Wednesday 25 July 2007

Probabilistic Approaches to topic Detection and Tracking

Tim Leek, Richard Schwartz, and Srinivasa Sista (BBN Technologies)
Chapter 4: Topic Detection and Tracking: Event Based Information Organisation (2002)

This paper deals with the problems with treating the TDT tasks as a probabilistic approach. The power of a probabilistic model is that it is able to give you a range of choices, and here human interaction with the system can allow us to make a once unsupervised task into a semi - supervised task. The underlying model is a mathematically developed model and so is much more likely to be able to deal with new unseen data, rather than an ad hoc model that may only work for seen data (in the training set).

The first thing that to be discussed is the different methods employed to measure story - topic similarity. This is the corner stone of both detection and tracking tasks. There are 4 main methods that they use in order to show the similarity between stories and topics. At the end of the day however these must be normalised in order to make some meaningful comparison.
  1. Probabilistic Topic Spotting (TS) Measure: Here we assume that the words in the story represent a Hidden Markov Model (HMM).
    Here the probability of the next word are calculated if the story was 'on topic' and whether this has been seen before can be tested.
    They use a log score to ensure there is no underflow. Equation 4.1 uses Bayes Rule to develop the model.
  2. Probabilistic Information Retrieval (IR) measure:Given some training scores we look at the probability that a story is relevant given a query about the story and topic.
    The measure is based on the conditional probability that a certain story will be able to generate a query based on the words present.
  3. Probabilistic IR with Relevance Feedback measure: This is similar to the above method with the addition of a feedback system that adds words to the query.
    This feedback system can also be weighted according to the query and the context. This model takes more features about the words when making the decisions.
  4. Probabilistic Word Presence Measure: This looks for certain required words. That is there are always certain words that will be able to indicate the topic.
    This method can be risky with a small number of stories since there is not enough evidence for the system to make judgements.
The next section deals with how to normalise the scores. Each of the measures has a different range and a different set of units. We simply cannot normalise over each story, as there are far to few stories that are on topic, so the foot print of on topic stories is small. Rather we calculate a z- score for the story being off - topic. This means we need to use the data about the stories already present and see how far from the mean is the story located, the z - score is easily calculated and from this we can generate the thresholds required by TDT.

Once we can normalise the scores the can combine them to give a more accurate measure of similarity. BBN have used a linear regression to combine the 4 measures, once they have been normalised.

The tracking and detection are similar systems. They both use the story - topic similarity measures.
With Tracking we cannot totally rely on vocabulary because as a story evolves words are removed and added. One important part of the over all system in terms of tracking is time. Time gives a good measure about being on topic or not. But as always there will be exceptions.
With Detection BBN use a modified incremented k - means algorithms to cluster the stories. And if need be create new clusters and re evaluate.

No comments: