Chapter 4: Topic Detection and Tracking: Event Based Information Organisation (2002)
This paper deals with the problems with treating the TDT tasks as a probabilistic approach. The power of a probabilistic model is that it is able to give you a range of choices, and here human interaction with the system can allow us to make a once unsupervised task into a semi - supervised task. The underlying model is a mathematically developed model and so is much more likely to be able to deal with new unseen data, rather than an ad hoc model that may only work for seen data (in the training set).
The first thing that to be discussed is the different methods employed to measure story - topic similarity. This is the corner stone of both detection and tracking tasks. There are 4 main methods that they use in order to show the similarity between stories and topics. At the end of the day however these must be normalised in order to make some meaningful comparison.
- Probabilistic Topic Spotting (TS) Measure: Here we assume that the words in the story represent a Hidden Markov Model (HMM).
Here the probability of the next word are calculated if the story was 'on topic' and whether this has been seen before can be tested.
They use a log score to ensure there is no underflow. Equation 4.1 uses Bayes Rule to develop the model.
- Probabilistic Information Retrieval (IR) measure:Given some training scores we look at the probability that a story is relevant given a query about the story and topic.
The measure is based on the conditional probability that a certain story will be able to generate a query based on the words present.
- Probabilistic IR with Relevance Feedback measure: This is similar to the above method with the addition of a feedback system that adds words to the query.
This feedback system can also be weighted according to the query and the context. This model takes more features about the words when making the decisions.
- Probabilistic Word Presence Measure: This looks for certain required words. That is there are always certain words that will be able to indicate the topic.
This method can be risky with a small number of stories since there is not enough evidence for the system to make judgements.
Once we can normalise the scores the can combine them to give a more accurate measure of similarity. BBN have used a linear regression to combine the 4 measures, once they have been normalised.
The tracking and detection are similar systems. They both use the story - topic similarity measures.
With Tracking we cannot totally rely on vocabulary because as a story evolves words are removed and added. One important part of the over all system in terms of tracking is time. Time gives a good measure about being on topic or not. But as always there will be exceptions.
With Detection BBN use a modified incremented k - means algorithms to cluster the stories. And if need be create new clusters and re evaluate.