Monday, 23 July 2007

Introduction to Topic Detection and Tracking

James Allan
Chapter 1: Topic Detection and Tracking: Event based Information Organization (2002)

This first chapter looks at an overview of the TDT (Topic Detection and Tracking). TDT is like other IR (information retrieval) tasks however one main issue is the lack of existing stories.

It gives a definition of the words topic and event. Both of which need to understood before we go any further. So a topic is a 'set a news stories that are strongly related to some seminal real - world event'. The event itself can be thought of a 4 vector (space and time) that will trigger a topic.

The next important distinction is what is a topic, event and the subject. A subject is what the story is about. The system that we will be looking at is an event based system rather than a subject based. So we always will be looking at things that have some event that trigger the topics. Another important factor is that event based systems are temporal. That is the seminal event is anchored at some time, and they will evolve over time.

We then discuss the 5 tasks that relate to TDT:
  1. Story Segmentation: This involves dividing some continuous news stream into individual stories. This is a problem of finding boundaries between news stories.
  2. First Story Detection: This is trying to find the onset of a new topic in some stream.
  3. Cluster Detection: this is when we group stories as they arrive into the groups of similar topics.
  4. Tracking: This involves finding additional stories given some small samples.
  5. Story Link Detection: This is looking at whether two given stories are topically linked.
The remainder of the chapter looks into the history of TREC sponsored events.

No comments: