Monday, 27 August 2007
Friday, 24 August 2007
Partial draft is due soon, and i need the following to be done for that:
- The introduction and background should be done.
- In the development chapter what i have done so far and a skeleton of what I'm doing.
- Introduction is done
- Have a lit review and background of where I'm coming from, however need to add some things about Vision Bytes perspective, as well as technical details.
- Development needs to be written up and a skeleton provided.
Monday, 6 August 2007
Hopefully he'll email me asap, and i can get a hold of these files and start doing some proper stuff. Also I'm meeting Waleed on Thursday and getting some data in the form of objects. Need to have something to show for this.
Carrot2 is integrated in many things, and has the advantage that it involves some visualisation. This could be useful to get something before i meet with the design guy.
On another note i think i need to serve the Linux box from Intuition. It seems most logical, since the net is fast and i can port forwards etc.
Sunday, 5 August 2007
However i should have in mind that i need to do this. So keep things like data connectivity and what is doing what in mind.
Thursday, 2 August 2007
It may also be worthwhile getting some of the vision bytes data into some format (say 10 - 20) tagged stories.
In the short term i may need to use XLST to transform the LDC data into Vision Byte formats, but I'll wait till I see both in their final forms.
Waleed is in charge of the innovations team and also worked on the Story Segmentter which my system will need to talk to. Here is an overview of the meeting along with some technical descriptions and plan of attack.
The system that i will need to know about is in 4 distinct areas.
- Aquisition System: Takes the video stream, audio stream, and caption text and records it into the database. It does this line by line, so that the raw caption text is located within the database. They also save other details about the caption text which may be of use (but currently i think that it will not be useful).
- Database System: The database system is SQL server that simply is the repository.
- Story Segmenter: Here the stories that have been recorded can be further segmented (in the case of a news program). This is where all the meta data is also added. This meta data is added manually by a team of reporters that go through the story and add time in and time out stamps as well as titles and descriptions (the descriptions are somewhat automated).
- Web Interface (DCP): Displays the information from the database to the user.
- Title information (full text title)
- Timing information (when and how long it is)
- Captions (raw caption data)
- Program information (channel program name)
Further he said that allot of the information that i need is already in a Java object (Speech Object) which i can use to get an overall idea of the system and build a system that will be easier to integrate in the long run.
Early next week I'll be meeting with Waleed in order to get a copy of this object as well as a snap shot of their database, to ensure the data formats are correct and i can begin developing with their formats in mind.
The question of when this will run was also raised. Several options were discussed but the most favourable one was the use of a message or queue bean to tell my system that a new story has been completed please run you classifier over it. Also many simply run it hourly or so. Also it may be beneficial for them to run it on all their existing data, and so computational cost in the Java is important and needs to be investigated further.
From here my aim my Monday is to get something in Java that uses beans to comm that will be able to read information and then populate some objects and then run simple statistical workings on these objects. This will make sure i get the beans back in my head, and also get some initial stats about the data that i will be using.
Wednesday, 1 August 2007
Section 3 deals with Topic and Event tracking. This is the core of the system that I will be developing. The approach used by Yang et.al involved using a variety of IR (information retrieval) techniques. These include
- kNN (k – nearest neighbours)
- D – Trees (decision Trees)
- LM (Language Modelling)
One issue that is raised is the tuning of parameter for each method. Each method involves its own intrinsic parameters that need to be tuned to have the best performance. One issue is that you may tune for a given set, but this will inevitably reduce performance on unseen sets. The solution is to use the BORG model that combines data from many sources and makes a judgement. Further you tune each IR technique on the same set of data and evaluate on another set of data. The issue that is solved is the fact that there are few positive training examples that are present. This comes from the nature of the problem, news stories do not last very long.
The system that I will develop (hereafter VB.sys) may or may not use the a concept similar to BORG. It depends solely on the computation time that each method adds to the system. The VB.sys will more than likely involve parameter optimization and so the data sets need to be split into training and evaluation.
Each of the IR techniques are now discussed in turn:
Rocchio uses a vector space description of each given story. The vector space representation is the weights of all words within the story. Yang uses a common IR version of the TF-IDF for the story. From this we calculate the centroid of a given cluster which is a strong representation of the cluster. To calculate this we use n of the top ranking stories and then calculate the centroid. We then simply calculate the cosine between the centroid and the story in question.
In the VB.sys i will not be using the Rocchio method for the reason that is described in [13 in paper]. This is if the centroid is not well formed classification and clustering is not reliable and not accurate.
k - Nearest Neighbours is an instance based classification. This is much better suited to the task at hand as it does not rely on a centroid. Yang introduces slight variations to traditional kNN due to the small number of positive training examples. kNN uses zones around the story being tested in order to classify. The reason that this improves on standard kNN is that it also uses negative examples. this means the overall evidence we use is much larger and thus can get a better result with few positive stories.
In the VB.sys we will definitely use kNN or the augmented kNN algorithms. This is because it works well with few positive examples and does not rely on the presence of a well defined centroid like the Rocchio method.
Yang borrow the techniques used by BBN system to implement their language modelling classifier. The BBN topic spotter is a Naive Bayesian classifier that uses smoothing to ensure unseen events are treated correctly. Other techniques that could be used KL - divergence, Hidden Markov models, as well us other techniques.
In the VB.sys we will use some form of language modelling. This is because language modelling is described well in all appears as being a good technique to classify the data. One issue is that it may be processor intensive.
The BORG track then uses a z - score to combine these values. and see how far from the mean of a cluster the current story is. It then will decide on whether to add to a cluster or not.
Each of the system that are described use some threshold. That is they make binary decisions based on the some threshold. The tuning of this threshold is also an important consideration to ensure we get a worthwhile system.
Detection relies on single pass clustering. The single pass cluster is important because we are trying to perform real time clustering. Yang gives three ways for the clustering to occur.
The first way is to use GAC (Grouped Average Clustering). This uses a look ahead window to cluster using time as a major feature to aid clustering. A small look ahead window is also compared with a look - back window to either merge to an old cluster or not merge and be its own cluster.
The other system that they describe language model vectors. We then cluster based on estimated probabilities for an on target story. We smooth the data set as well using expectation maximisation algorithm based on some training data. We then work out an overall likelihood score.
the VB.sys will use a similar set of techniques that Yang uses to develop the detection.
First Story Detection
Yang et.al use a novelty score to define whether there is a new story. That is how dis - similar is it to the nearest neighbour. We look at the vector space model to find the novelty score. It is very sensitive to different sizes of the the look back clusters.
This will be used in the VB.sys as it is the best way to find and quantify new clusters.
The remainder of the paper look at Segmentation and Story link detection. Also there is a large amount spend looking into multi lingual TDT. This gives a good basis to the future work that will be done.
NWeekly tasks: Initial Approaches to consider:
The following is a description of what I will need to do in the next week about getting some prelim work done.
- Obtaining data. This data needs to be the TDT3 or TDT4 corpus for good comparison with the literature. Further I need to get data from the Vision Bytes in the form of tagged (or partial) tagged software. Most likely this should be done from the DB rather then manual extraction (so onsite).
- A preliminary clustering algorithm that will cluster in real time. In order to do this i will first need some feature extraction. That is get the features f the data and place them into some Vector space or language model.
- A system that will perform first story detection and track (ranked) the stories.
- Write up background, literature review, requirements and initial work for draft.
Reference paper is Multi – Strategy for leaning TDT; Yang Y et.al.
The task at hand will be to produce a system that is capable of performing TDT. In conventional TDT there are 5 tasks that are performed. The scope of this project will be to 3 of the 5 tasks. These are:
- First Story Detection.
- Story / Topic Detection
- Topic Tracking
The aims of the project are:
- Reproduce a system that is capable of performing the 3 tasks of TDT with equivalent performance to that described by Yang et.al (2002).
- Augment this system in such a way as to introduce checks and assisted learning mechanisms to ensure that the system will perform in a short time with a higher level of accuracy.
- Incorporate the system into the existing Vision Bytes framework.
At this point it is appropriate to talk about what results we are aiming for. That is starting to think about the evaluation mechanisms. In the domain of TDT the two metrics that we are most concerned with are:
- False alarm probability: That is the system says that this belongs to a certain event or is the first topic, whereas some gold standard (human or annotated text) says it is not.
- Missed Target: That is the system says NO, while the gold standard says YES.
In general these two quantities are related. That is having a high False Alarm probability results in having a lower missed detection. Thus the system that is developed is a trade off between missed detection and false alarm. Yang et.al. use a combination of the several in order to overcome this issue.
The technical aspects of this project deal with the development of a faster system to perform the tasks of TDT. The area of TDT has been dormant for 2 years since the last TDT conference. This project aims to use lesson learned from this time combined with the latest technology in shallow parsing and vector space models in order to develop a more reliable and robust system. The shallow parsing refers to quicker and more efficient algorithms in the areas of chunking and POS tagging [need citation]. The improved vector space models refer to models that not only include the words but also meta – data and represent these as sub vectors [need citation].
From here the next step is to combine the paper by Yang with the ideas from my project in a detailed description. This needs to be done ASAP for Rafael. Tomorrow’s meeting with Dan will concentrate on the 3rd aim, and trying to understand where my system will fit in to the larger project. This will be important in order to see what information I will need to extract and be able to display. I will also be meeting with the graphic designer soon to be able to develop the layout for the system in terms of Vision Bytes requirements. Again this will help with an understanding of the data that I will need to display and work with.
Monday, 30 July 2007
- Topic Detection is a problem of real time clustering (or near real time). Its purpose is to be able to develop clusters of stories in one pass given a small (possibly zero size) history of the cluster.
- Topic detection and first story detection go hand in hand. In fact you can use almost identical algorithms with the only difference being you threshold for FSD.
- It seems that the best results are obtained using mixture models. That is using several techniques in order to obtain a result.
- The main trade offs that will occur in TDT is False Alarm Probability and Missed Probability. In order to get a good system we aim to try and minimise both. However they are generally trade offs. A system with low false alarm generally has a higher miss rate. The reason can be seen in the way they generalise.
- Most of the clustering seems to use a vector space model, and then KNN clustering techniques.
- A main issue that i need to look into is run time. That is ensure that the system works in an acceptable time limit, while at the same time doing its job.
- I need to also ensure that the system is a technical RESEARCH PROJECT that has some interesting research as well as the system that is business orientated. This will become clearer after i type up the initial system approach, and include references. These will then make sure I'm heading in the right technical direction for the project.
This seems like a good idea as then i will at least have a bench mark of where to go. The paper that I will use is the paper in 'Topic Detection and Tracking: Event Based Information Organization', from CMU. It is Chapter 5 (Multi-strategy learning for Topic Detection and Tracking Yang et.al.)
The reason I choose this paper is that it covers all the areas that i will deal with, as well as having a quite detailed history of the way it has been constructed. The results and evaluation methodology is also well reported.
Overall the project i will be doing will concentrate on 3 tasks involved in TDT, all of which are described in detail. These tasks are:
- First Story Detection
- Topic Detection (Clustering)
I will also be obtaining a laptop from Dan tomorrow that i will be able to use for the duration of the project. This will be my working laptop where i will be able to use it at Vision Bytes. I will be going there more often from next week.
The full text of the article and the way that i will attack the problems will be up tomorrow.
Further I'm reinstalling Linux at home :-) For both EBUS COSC and Thesis, might even consider putting it on the laptop, but not sure. Think I'll keep this running Visa for the sanity at work.
The purpose of clustering is to give structure to some unlabelled data. That is we want to organise similar data, into groups (clusters). Clusters are those objects that are similar. Thus the similarity measure is an important part of clustering. For any clustering system we require;
- Able to deal with various attributes
- Can cluster regardless of shape of the cluster
- minimal requirements for domain knowledge
Once we form the vector we need to find the distance between two topics, or a topic and some centroid. The centroid is the centre of a cluster and can be useful when clustering on the fly. The types of distances that can be used, this can be using simple Euclidean distances. However often more complex distance / similarity measures are required. Other considerations need to be the units of the vectors and scaling that will need to take place in order to make the cluster well formed.
More on the algorithms and distance measures in next summary.
Wednesday, 25 July 2007
Chapter 4: Topic Detection and Tracking: Event Based Information Organisation (2002)
This paper deals with the problems with treating the TDT tasks as a probabilistic approach. The power of a probabilistic model is that it is able to give you a range of choices, and here human interaction with the system can allow us to make a once unsupervised task into a semi - supervised task. The underlying model is a mathematically developed model and so is much more likely to be able to deal with new unseen data, rather than an ad hoc model that may only work for seen data (in the training set).
The first thing that to be discussed is the different methods employed to measure story - topic similarity. This is the corner stone of both detection and tracking tasks. There are 4 main methods that they use in order to show the similarity between stories and topics. At the end of the day however these must be normalised in order to make some meaningful comparison.
- Probabilistic Topic Spotting (TS) Measure: Here we assume that the words in the story represent a Hidden Markov Model (HMM).
Here the probability of the next word are calculated if the story was 'on topic' and whether this has been seen before can be tested.
They use a log score to ensure there is no underflow. Equation 4.1 uses Bayes Rule to develop the model.
- Probabilistic Information Retrieval (IR) measure:Given some training scores we look at the probability that a story is relevant given a query about the story and topic.
The measure is based on the conditional probability that a certain story will be able to generate a query based on the words present.
- Probabilistic IR with Relevance Feedback measure: This is similar to the above method with the addition of a feedback system that adds words to the query.
This feedback system can also be weighted according to the query and the context. This model takes more features about the words when making the decisions.
- Probabilistic Word Presence Measure: This looks for certain required words. That is there are always certain words that will be able to indicate the topic.
This method can be risky with a small number of stories since there is not enough evidence for the system to make judgements.
Once we can normalise the scores the can combine them to give a more accurate measure of similarity. BBN have used a linear regression to combine the 4 measures, once they have been normalised.
The tracking and detection are similar systems. They both use the story - topic similarity measures.
With Tracking we cannot totally rely on vocabulary because as a story evolves words are removed and added. One important part of the over all system in terms of tracking is time. Time gives a good measure about being on topic or not. But as always there will be exceptions.
With Detection BBN use a modified incremented k - means algorithms to cluster the stories. And if need be create new clusters and re evaluate.
I am already slightly behing as i need to type up most of the reading material onto the Blog, this will be done today and tonight. I will then be back on track, and finish the UML diagrams and also the requirements document.
Monday, 23 July 2007
Chapter 2: Topic Detection and Tracking: Event Based Information Organisation (2002)
Here we define some important terms:
- Topic: A seminal event or activity, along with all directly related events and activities.
- Story: A story is on topic if it discusses events or activities that are directly connected to the topic's seminal event.
The next section talks about the evaluation metrics that are used in order to test the system that is built. Each of the 5 tasks are evaluated as a detection task. For each task there is some input data, a hypothesis and then the system needs to decide whether it is true.
For each task we can consider binary decisions that is it is a target or a non - target. The terminology that we use to evaluate the system is as follows:
- If the reference annotation and the system response agree (that is either both say it is a target or both say it is not a target) then we say we are correct.
- If the system response not a target but we have an annotation saying it is a target then we have a 'missed detection'
- If the system response is a target however the annotation says not a target we say we have a 'false alarm'
Chapter 1: Topic Detection and Tracking: Event based Information Organization (2002)
This first chapter looks at an overview of the TDT (Topic Detection and Tracking). TDT is like other IR (information retrieval) tasks however one main issue is the lack of existing stories.
It gives a definition of the words topic and event. Both of which need to understood before we go any further. So a topic is a 'set a news stories that are strongly related to some seminal real - world event'. The event itself can be thought of a 4 vector (space and time) that will trigger a topic.
The next important distinction is what is a topic, event and the subject. A subject is what the story is about. The system that we will be looking at is an event based system rather than a subject based. So we always will be looking at things that have some event that trigger the topics. Another important factor is that event based systems are temporal. That is the seminal event is anchored at some time, and they will evolve over time.
We then discuss the 5 tasks that relate to TDT:
- Story Segmentation: This involves dividing some continuous news stream into individual stories. This is a problem of finding boundaries between news stories.
- First Story Detection: This is trying to find the onset of a new topic in some stream.
- Cluster Detection: this is when we group stories as they arrive into the groups of similar topics.
- Tracking: This involves finding additional stories given some small samples.
- Story Link Detection: This is looking at whether two given stories are topically linked.
Tuesday, 3 July 2007
The first thing is i'll be making a plan for the project. This will include Gnatt chart, the resource mangement.
The first will be a draft and will be reworked.
Monday, 21 May 2007
Wednesday, 16 May 2007
Tuesday, 15 May 2007
The name of the topic seems to be almost finalised (yay cause now i can tell people what I'm doing)
"TV News Topic Detection and Tracking"
I also met Jorge a PhD student that is also doing some text mining but in the realm of e - learning, which may be useful later.
Anyways the topic for the project itself is a TDT (Topic Detection and Tracking) topic. So there are two things that I'll need to look into over the next few weeks are:
- Clustering of data (i.e. placing text documents into n - dimensional vector spaces)
- The detection of new and old stories.
Then after this we also consider it as a time evolving problem and see how news stories come over time. We notice that there is allot of information in the first three days and then not much information after.
Some other points that Rafael suggested are:
- Look at the research group at CMU (http://nyc.lti.cs.cmu.edu/clair/new/)
- Look at TREC where they have been doing this sort of analysis for some time (http://trec.nist.gov/)
- Also try and get hold of the Reuters RCV1 data, and we can use this as the benchmark for further my work. Will need to get it from uni as its 2.5GB uncompressed (but tagged!) http://trec.nist.gov/data/reuters/reuters.html
Monday, 14 May 2007
- The topics main aim is to track a news story over time, and to provide this information to the clients as part of the current site.
- Currently the stories are viewed by program e.g. Sunrise and all the stories. Instead this system will be viewed by story and will then show you all the stories that are related AND / OR updates on the same story.
- There will need to be a distinction made between a repeat story (e.g. the story is broadcast on Sunrise and then the same story, or maybe almost the same, is then broadcast again on the morning news.) and a story with new information. Both would need to be displayed and tagged appropriately.
- There are 5 real pieces of data that I will have access to and will have to use to perform all of the tasks:
- The title
- The quick description
- The full text (maybe)
- The date and time
- The duration
- Currently the system is built in C# however they are bringing into live use a new J2EE and Hibernate based system in June, thus it would be best to work using this system, as it is their long term platform. Also I will have access to the existing database and so wont need to replicate this. However i made need to add certain new data structures in order to complete the task.
- The system will involve some human interaction at some point in order to verify any tasks that it has done, e.g. Are two stories related, here we could use a learning algorithm to improve performance. However the main goal would be to minimise this.
- There should be both similar stories (broad topics such as Iraq) and more same event and how it develops (e.g. A particular kidnapping)
Dan was also said that he would organise meetings with their architect and their engineers to explain how their systems work. Will need to do this soon.
So as the plan goes we now start doing some research into the topic (mainly scientific journals)
Sunday, 13 May 2007
The following is a basic outline of initials ideas about the direction of the project. They range from requirements, to technical and bussiness aspects that wont come up for a while.
High Level ideas:
- The tracking of a given news event over time.
- Identifying the topic.
- Looking at the event over different TV stations.
- Providing an interface for clients to be able to retrieve this information.
- Which stories to track
- How to get the events (automated or use human interaction)
- Distinguishing between new events and old events (i.e. same topic but different event, best example would be Iraq is it a new story on it or a continued story)
- Where will clients view this data and how will they access it.
- Where in the production cycle will this fit.
- Technologies to be used i.e. Language, tools to be used [Java, J2EE]
- Verson control (sub version)
- Database to use. [PostgreSql, MSQL]
- Mix of technical and theoretical components [How much of the thess should be about the theory that is being used, and how much should be about building the working system itself to be used by
The general area of the topic is: News Topic tracking over time. That is looking at how news stories evolve over time, and more importantly how they are covered.
The topic is with Vision Bytes, a company that already have a access to a large variety of textual data.