Monday, 30 July 2007

Things I Need to write down

Here a few things that I've read or come across that I need to get down before i forget.
  • Topic Detection is a problem of real time clustering (or near real time). Its purpose is to be able to develop clusters of stories in one pass given a small (possibly zero size) history of the cluster.
  • Topic detection and first story detection go hand in hand. In fact you can use almost identical algorithms with the only difference being you threshold for FSD.
  • It seems that the best results are obtained using mixture models. That is using several techniques in order to obtain a result.
  • The main trade offs that will occur in TDT is False Alarm Probability and Missed Probability. In order to get a good system we aim to try and minimise both. However they are generally trade offs. A system with low false alarm generally has a higher miss rate. The reason can be seen in the way they generalise.
  • Most of the clustering seems to use a vector space model, and then KNN clustering techniques.
  • A main issue that i need to look into is run time. That is ensure that the system works in an acceptable time limit, while at the same time doing its job.
  • I need to also ensure that the system is a technical RESEARCH PROJECT that has some interesting research as well as the system that is business orientated. This will become clearer after i type up the initial system approach, and include references. These will then make sure I'm heading in the right technical direction for the project.

Developing the Ideas

In the EBUS5003 lab today Rafael said that the best way to move on with the project. His suggestion was that I pick a research paper from the TDT TRECs and try to emulate their results, and then extend the work to my data set and my application.

This seems like a good idea as then i will at least have a bench mark of where to go. The paper that I will use is the paper in 'Topic Detection and Tracking: Event Based Information Organization', from CMU. It is Chapter 5 (Multi-strategy learning for Topic Detection and Tracking Yang

The reason I choose this paper is that it covers all the areas that i will deal with, as well as having a quite detailed history of the way it has been constructed. The results and evaluation methodology is also well reported.

Overall the project i will be doing will concentrate on 3 tasks involved in TDT, all of which are described in detail. These tasks are:
  1. First Story Detection
  2. Topic Detection (Clustering)
  3. Tracking
My next plan is to obtain a copy of the TDT3 Corpus (which is already annotated as i do not want to spend too much time doing that). I've emailed Rafael about is as it seems you have to pay (allot) for it, which I don't really want to do.

I will also be obtaining a laptop from Dan tomorrow that i will be able to use for the duration of the project. This will be my working laptop where i will be able to use it at Vision Bytes. I will be going there more often from next week.

The full text of the article and the way that i will attack the problems will be up tomorrow.

Further I'm reinstalling Linux at home :-) For both EBUS COSC and Thesis, might even consider putting it on the laptop, but not sure. Think I'll keep this running Visa for the sanity at work.

Data Clustering

The following is some research into conventional clustering algorithms. These are taken from several sources:

The purpose of clustering is to give structure to some unlabelled data. That is we want to organise similar data, into groups (clusters). Clusters are those objects that are similar. Thus the similarity measure is an important part of clustering. For any clustering system we require;
  • Scalability
  • Able to deal with various attributes
  • Can cluster regardless of shape of the cluster
  • minimal requirements for domain knowledge
There are many different algorithms that we can use to cluster. Most however involve representing a document as a vector. Further we can use other information about the document to add to the vector.

Once we form the vector we need to find the distance between two topics, or a topic and some centroid. The centroid is the centre of a cluster and can be useful when clustering on the fly. The types of distances that can be used, this can be using simple Euclidean distances. However often more complex distance / similarity measures are required. Other considerations need to be the units of the vectors and scaling that will need to take place in order to make the cluster well formed.

More on the algorithms and distance measures in next summary.

Wednesday, 25 July 2007

Probabilistic Approaches to topic Detection and Tracking

Tim Leek, Richard Schwartz, and Srinivasa Sista (BBN Technologies)
Chapter 4: Topic Detection and Tracking: Event Based Information Organisation (2002)

This paper deals with the problems with treating the TDT tasks as a probabilistic approach. The power of a probabilistic model is that it is able to give you a range of choices, and here human interaction with the system can allow us to make a once unsupervised task into a semi - supervised task. The underlying model is a mathematically developed model and so is much more likely to be able to deal with new unseen data, rather than an ad hoc model that may only work for seen data (in the training set).

The first thing that to be discussed is the different methods employed to measure story - topic similarity. This is the corner stone of both detection and tracking tasks. There are 4 main methods that they use in order to show the similarity between stories and topics. At the end of the day however these must be normalised in order to make some meaningful comparison.
  1. Probabilistic Topic Spotting (TS) Measure: Here we assume that the words in the story represent a Hidden Markov Model (HMM).
    Here the probability of the next word are calculated if the story was 'on topic' and whether this has been seen before can be tested.
    They use a log score to ensure there is no underflow. Equation 4.1 uses Bayes Rule to develop the model.
  2. Probabilistic Information Retrieval (IR) measure:Given some training scores we look at the probability that a story is relevant given a query about the story and topic.
    The measure is based on the conditional probability that a certain story will be able to generate a query based on the words present.
  3. Probabilistic IR with Relevance Feedback measure: This is similar to the above method with the addition of a feedback system that adds words to the query.
    This feedback system can also be weighted according to the query and the context. This model takes more features about the words when making the decisions.
  4. Probabilistic Word Presence Measure: This looks for certain required words. That is there are always certain words that will be able to indicate the topic.
    This method can be risky with a small number of stories since there is not enough evidence for the system to make judgements.
The next section deals with how to normalise the scores. Each of the measures has a different range and a different set of units. We simply cannot normalise over each story, as there are far to few stories that are on topic, so the foot print of on topic stories is small. Rather we calculate a z- score for the story being off - topic. This means we need to use the data about the stories already present and see how far from the mean is the story located, the z - score is easily calculated and from this we can generate the thresholds required by TDT.

Once we can normalise the scores the can combine them to give a more accurate measure of similarity. BBN have used a linear regression to combine the 4 measures, once they have been normalised.

The tracking and detection are similar systems. They both use the story - topic similarity measures.
With Tracking we cannot totally rely on vocabulary because as a story evolves words are removed and added. One important part of the over all system in terms of tracking is time. Time gives a good measure about being on topic or not. But as always there will be exceptions.
With Detection BBN use a modified incremented k - means algorithms to cluster the stories. And if need be create new clusters and re evaluate.

Final Project Plan

I sent Rafael the final version of the plan today to get him to look it over quickly. Hopefully i can then hand it in asap.

I am already slightly behing as i need to type up most of the reading material onto the Blog, this will be done today and tonight. I will then be back on track, and finish the UML diagrams and also the requirements document.

Monday, 23 July 2007

Topic Dectection and Tracking Evaluation Overview

Johnathan G. Fiscuss and George R. Doddington
Chapter 2: Topic Detection and Tracking: Event Based Information Organisation (2002)

Here we define some important terms:
  • Topic: A seminal event or activity, along with all directly related events and activities.
  • Story: A story is on topic if it discusses events or activities that are directly connected to the topic's seminal event.
There does need to be some limit to the time that we consider the story to be on topic.

The next section talks about the evaluation metrics that are used in order to test the system that is built. Each of the 5 tasks are evaluated as a detection task. For each task there is some input data, a hypothesis and then the system needs to decide whether it is true.

For each task we can consider binary decisions that is it is a target or a non - target. The terminology that we use to evaluate the system is as follows:
  • If the reference annotation and the system response agree (that is either both say it is a target or both say it is not a target) then we say we are correct.
  • If the system response not a target but we have an annotation saying it is a target then we have a 'missed detection'
  • If the system response is a target however the annotation says not a target we say we have a 'false alarm'
The remainder of the chapter again deals with the 5 tasks and a more detailed description of the tasks. Further is mentions some assumptions that we make. We shall revisit this when we look at the tasks that we are performing.

Introduction to Topic Detection and Tracking

James Allan
Chapter 1: Topic Detection and Tracking: Event based Information Organization (2002)

This first chapter looks at an overview of the TDT (Topic Detection and Tracking). TDT is like other IR (information retrieval) tasks however one main issue is the lack of existing stories.

It gives a definition of the words topic and event. Both of which need to understood before we go any further. So a topic is a 'set a news stories that are strongly related to some seminal real - world event'. The event itself can be thought of a 4 vector (space and time) that will trigger a topic.

The next important distinction is what is a topic, event and the subject. A subject is what the story is about. The system that we will be looking at is an event based system rather than a subject based. So we always will be looking at things that have some event that trigger the topics. Another important factor is that event based systems are temporal. That is the seminal event is anchored at some time, and they will evolve over time.

We then discuss the 5 tasks that relate to TDT:
  1. Story Segmentation: This involves dividing some continuous news stream into individual stories. This is a problem of finding boundaries between news stories.
  2. First Story Detection: This is trying to find the onset of a new topic in some stream.
  3. Cluster Detection: this is when we group stories as they arrive into the groups of similar topics.
  4. Tracking: This involves finding additional stories given some small samples.
  5. Story Link Detection: This is looking at whether two given stories are topically linked.
The remainder of the chapter looks into the history of TREC sponsored events.

Tuesday, 3 July 2007

Project Plan

So if haven't been here in a while! But now its time for the plan to start to take shape.
The first thing is i'll be making a plan for the project. This will include Gnatt chart, the resource mangement.
The first will be a draft and will be reworked.