Wednesday 1 August 2007

Technical Description of project.

The project will be compared to the results presented by Yang et.al (2004). We will restrict the system to tracking, detection and first story detection. A description of Yang et.al approaches are described in detail. Commentary is also provided about suitability to this project and possible alterations are also given.

Tracking

Section 3 deals with Topic and Event tracking. This is the core of the system that I will be developing. The approach used by Yang et.al involved using a variety of IR (information retrieval) techniques. These include
  • kNN (k – nearest neighbours)
  • D – Trees (decision Trees)
  • Rocchio
  • LM (Language Modelling)
Further they combine these methods using the BORG (Best Overall Result Generator).

One issue that is raised is the tuning of parameter for each method. Each method involves its own intrinsic parameters that need to be tuned to have the best performance. One issue is that you may tune for a given set, but this will inevitably reduce performance on unseen sets. The solution is to use the BORG model that combines data from many sources and makes a judgement. Further you tune each IR technique on the same set of data and evaluate on another set of data. The issue that is solved is the fact that there are few positive training examples that are present. This comes from the nature of the problem, news stories do not last very long.

The system that I will develop (hereafter VB.sys) may or may not use the a concept similar to BORG. It depends solely on the computation time that each method adds to the system. The VB.sys will more than likely involve parameter optimization and so the data sets need to be split into training and evaluation.

Each of the IR techniques are now discussed in turn:

Rocchio
Rocchio uses a vector space description of each given story. The vector space representation is the weights of all words within the story. Yang uses a common IR version of the TF-IDF for the story. From this we calculate the centroid of a given cluster which is a strong representation of the cluster. To calculate this we use n of the top ranking stories and then calculate the centroid. We then simply calculate the cosine between the centroid and the story in question.
In the VB.sys i will not be using the Rocchio method for the reason that is described in [13 in paper]. This is if the centroid is not well formed classification and clustering is not reliable and not accurate.

kNN
k - Nearest Neighbours is an instance based classification. This is much better suited to the task at hand as it does not rely on a centroid. Yang introduces slight variations to traditional kNN due to the small number of positive training examples. kNN uses zones around the story being tested in order to classify. The reason that this improves on standard kNN is that it also uses negative examples. this means the overall evidence we use is much larger and thus can get a better result with few positive stories.
In the VB.sys we will definitely use kNN or the augmented kNN algorithms. This is because it works well with few positive examples and does not rely on the presence of a well defined centroid like the Rocchio method.

Language Modeling
Yang borrow the techniques used by BBN system to implement their language modelling classifier. The BBN topic spotter is a Naive Bayesian classifier that uses smoothing to ensure unseen events are treated correctly. Other techniques that could be used KL - divergence, Hidden Markov models, as well us other techniques.
In the VB.sys we will use some form of language modelling. This is because language modelling is described well in all appears as being a good technique to classify the data. One issue is that it may be processor intensive.

BORG
The BORG track then uses a z - score to combine these values. and see how far from the mean of a cluster the current story is. It then will decide on whether to add to a cluster or not.

Each of the system that are described use some threshold. That is they make binary decisions based on the some threshold. The tuning of this threshold is also an important consideration to ensure we get a worthwhile system.

Topic Detection

Detection relies on single pass clustering. The single pass cluster is important because we are trying to perform real time clustering. Yang gives three ways for the clustering to occur.
The first way is to use GAC (Grouped Average Clustering). This uses a look ahead window to cluster using time as a major feature to aid clustering. A small look ahead window is also compared with a look - back window to either merge to an old cluster or not merge and be its own cluster.

The other system that they describe language model vectors. We then cluster based on estimated probabilities for an on target story. We smooth the data set as well using expectation maximisation algorithm based on some training data. We then work out an overall likelihood score.

the VB.sys will use a similar set of techniques that Yang uses to develop the detection.

First Story Detection

Yang et.al use a novelty score to define whether there is a new story. That is how dis - similar is it to the nearest neighbour. We look at the vector space model to find the novelty score. It is very sensitive to different sizes of the the look back clusters.
This will be used in the VB.sys as it is the best way to find and quantify new clusters.


The remainder of the paper look at Segmentation and Story link detection. Also there is a large amount spend looking into multi lingual TDT. This gives a good basis to the future work that will be done.

No comments: