eng project 07: August 2007

Monday, 27 August 2007

partial draft

I updated Dan today about whats going on.

Also i need to get some more documentation down for the parts of the system done. Also start looking at long term inegration and start ensuing i'm using samples etc...

Friday, 24 August 2007

long time no post

So recently have been bus with other subjects and having an epileptic attack and all.

Partial draft is due soon, and i need the following to be done for that:

The introduction and background should be done.
In the development chapter what i have done so far and a skeleton of what I'm doing.

In terms of whats been done:

Introduction is done
Have a lit review and background of where I'm coming from, however need to add some things about Vision Bytes perspective, as well as technical details.
Development needs to be written up and a skeleton provided.

The plan is to get it done my Monday, and show Rafael, and get some comments.

Monday, 6 August 2007

Update on getting TDTx

Went to Fisher today and talked to Peter McNeice about getting the TDT3. He said that i may only get access via the old system that John Patrick holds. So he is going to look into it and see if i can get access to it.

Hopefully he'll email me asap, and i can get a hold of these files and start doing some proper stuff. Also I'm meeting Waleed on Thursday and getting some data in the form of objects. Need to have something to show for this.

Clusttering Tools

Today in the EBUS lab Rafael suggested i use a clustering tool called carrot2. This was mainly because Weka isn't good enough. Carrot is made for Web applications.
http://project.carrot2.org

Carrot2 is integrated in many things, and has the advantage that it involves some visualisation. This could be useful to get something before i meet with the design guy.

On another note i think i need to serve the Linux box from Intuition. It seems most logical, since the net is fast and i can port forwards etc.

Sunday, 5 August 2007

Getting Something Going

When i looked into getting the system going a bit further i realised something. It really is worth using beans at this early stage. Rather i should get some foundation system working as a standalone project before then integrating it with bean technology.
However i should have in mind that i need to do this. So keep things like data connectivity and what is doing what in mind.

Thursday, 2 August 2007

Update on getting TDTx

I was contacted by a member of the library staff Peter McNiece who gave be an account to the LDC. However i need to get the croups as a physical copy and i am unable to download it. So I've emailed him and will wait for reply.
It may also be worthwhile getting some of the vision bytes data into some format (say 10 - 20) tagged stories.
In the short term i may need to use XLST to transform the LDC data into Vision Byte formats, but I'll wait till I see both in their final forms.

Tech meeting with Vision Bytes

Today I had a meeting with Dan and Waleed about the current strucutre of the system and where my system will be fitting into it.

Waleed is in charge of the innovations team and also worked on the Story Segmentter which my system will need to talk to. Here is an overview of the meeting along with some technical descriptions and plan of attack.

The system that i will need to know about is in 4 distinct areas.

Aquisition System: Takes the video stream, audio stream, and caption text and records it into the database. It does this line by line, so that the raw caption text is located within the database. They also save other details about the caption text which may be of use (but currently i think that it will not be useful).
Database System: The database system is SQL server that simply is the repository.
Story Segmenter: Here the stories that have been recorded can be further segmented (in the case of a news program). This is where all the meta data is also added. This meta data is added manually by a team of reporters that go through the story and add time in and time out stamps as well as titles and descriptions (the descriptions are somewhat automated).
Web Interface (DCP): Displays the information from the database to the user.

This information i need will be

Title information (full text title)
Timing information (when and how long it is)
Captions (raw caption data)
Program information (channel program name)

These are all located within beans. Different components of the system access these beans and display the relevant information differently. Waleed explained that i will probably need to use 2 or 3 of the beans to get the information i need.

Further he said that allot of the information that i need is already in a Java object (Speech Object) which i can use to get an overall idea of the system and build a system that will be easier to integrate in the long run.

Early next week I'll be meeting with Waleed in order to get a copy of this object as well as a snap shot of their database, to ensure the data formats are correct and i can begin developing with their formats in mind.

The question of when this will run was also raised. Several options were discussed but the most favourable one was the use of a message or queue bean to tell my system that a new story has been completed please run you classifier over it. Also many simply run it hourly or so. Also it may be beneficial for them to run it on all their existing data, and so computational cost in the Java is important and needs to be investigated further.

From here my aim my Monday is to get something in Java that uses beans to comm that will be able to read information and then populate some objects and then run simple statistical workings on these objects. This will make sure i get the beans back in my head, and also get some initial stats about the data that i will be using.

Wednesday, 1 August 2007

Technical Description of project.

The project will be compared to the results presented by Yang et.al (2004). We will restrict the system to tracking, detection and first story detection. A description of Yang et.al approaches are described in detail. Commentary is also provided about suitability to this project and possible alterations are also given.

Tracking

Section 3 deals with Topic and Event tracking. This is the core of the system that I will be developing. The approach used by Yang et.al involved using a variety of IR (information retrieval) techniques. These include

kNN (k – nearest neighbours)
D – Trees (decision Trees)
Rocchio
LM (Language Modelling)

Further they combine these methods using the BORG (Best Overall Result Generator).

One issue that is raised is the tuning of parameter for each method. Each method involves its own intrinsic parameters that need to be tuned to have the best performance. One issue is that you may tune for a given set, but this will inevitably reduce performance on unseen sets. The solution is to use the BORG model that combines data from many sources and makes a judgement. Further you tune each IR technique on the same set of data and evaluate on another set of data. The issue that is solved is the fact that there are few positive training examples that are present. This comes from the nature of the problem, news stories do not last very long.

The system that I will develop (hereafter VB.sys) may or may not use the a concept similar to BORG. It depends solely on the computation time that each method adds to the system. The VB.sys will more than likely involve parameter optimization and so the data sets need to be split into training and evaluation.

Each of the IR techniques are now discussed in turn:

Rocchio
Rocchio uses a vector space description of each given story. The vector space representation is the weights of all words within the story. Yang uses a common IR version of the TF-IDF for the story. From this we calculate the centroid of a given cluster which is a strong representation of the cluster. To calculate this we use n of the top ranking stories and then calculate the centroid. We then simply calculate the cosine between the centroid and the story in question.
In the VB.sys i will not be using the Rocchio method for the reason that is described in [13 in paper]. This is if the centroid is not well formed classification and clustering is not reliable and not accurate.

kNN
k - Nearest Neighbours is an instance based classification. This is much better suited to the task at hand as it does not rely on a centroid. Yang introduces slight variations to traditional kNN due to the small number of positive training examples. kNN uses zones around the story being tested in order to classify. The reason that this improves on standard kNN is that it also uses negative examples. this means the overall evidence we use is much larger and thus can get a better result with few positive stories.
In the VB.sys we will definitely use kNN or the augmented kNN algorithms. This is because it works well with few positive examples and does not rely on the presence of a well defined centroid like the Rocchio method.

Language Modeling
Yang borrow the techniques used by BBN system to implement their language modelling classifier. The BBN topic spotter is a Naive Bayesian classifier that uses smoothing to ensure unseen events are treated correctly. Other techniques that could be used KL - divergence, Hidden Markov models, as well us other techniques.
In the VB.sys we will use some form of language modelling. This is because language modelling is described well in all appears as being a good technique to classify the data. One issue is that it may be processor intensive.

BORG
The BORG track then uses a z - score to combine these values. and see how far from the mean of a cluster the current story is. It then will decide on whether to add to a cluster or not.

Each of the system that are described use some threshold. That is they make binary decisions based on the some threshold. The tuning of this threshold is also an important consideration to ensure we get a worthwhile system.

Topic Detection

Detection relies on single pass clustering. The single pass cluster is important because we are trying to perform real time clustering. Yang gives three ways for the clustering to occur.
The first way is to use GAC (Grouped Average Clustering). This uses a look ahead window to cluster using time as a major feature to aid clustering. A small look ahead window is also compared with a look - back window to either merge to an old cluster or not merge and be its own cluster.

The other system that they describe language model vectors. We then cluster based on estimated probabilities for an on target story. We smooth the data set as well using expectation maximisation algorithm based on some training data. We then work out an overall likelihood score.

the VB.sys will use a similar set of techniques that Yang uses to develop the detection.

First Story Detection

Yang et.al use a novelty score to define whether there is a new story. That is how dis - similar is it to the nearest neighbour. We look at the vector space model to find the novelty score. It is very sensitive to different sizes of the the look back clusters.
This will be used in the VB.sys as it is the best way to find and quantify new clusters.

The remainder of the paper look at Segmentation and Story link detection. Also there is a large amount spend looking into multi lingual TDT. This gives a good basis to the future work that will be done.

work for next week (and a bit)

NWeekly tasks: Initial Approaches to consider:

The following is a description of what I will need to do in the next week about getting some prelim work done.

Obtaining data. This data needs to be the TDT3 or TDT4 corpus for good comparison with the literature. Further I need to get data from the Vision Bytes in the form of tagged (or partial) tagged software. Most likely this should be done from the DB rather then manual extraction (so onsite).
A preliminary clustering algorithm that will cluster in real time. In order to do this i will first need some feature extraction. That is get the features f the data and place them into some Vector space or language model.
A system that will perform first story detection and track (ranked) the stories.
Write up background, literature review, requirements and initial work for draft.

Approach to TDT for TV news broadcast data.

Approach to TDT for TV news broadcast data.

Reference paper is Multi – Strategy for leaning TDT; Yang Y et.al.

The task at hand will be to produce a system that is capable of performing TDT. In conventional TDT there are 5 tasks that are performed. The scope of this project will be to 3 of the 5 tasks. These are:

First Story Detection.
Story / Topic Detection
Topic Tracking

The aims of the project are:

Reproduce a system that is capable of performing the 3 tasks of TDT with equivalent performance to that described by Yang et.al (2002).
Augment this system in such a way as to introduce checks and assisted learning mechanisms to ensure that the system will perform in a short time with a higher level of accuracy.
Incorporate the system into the existing Vision Bytes framework.

At this point it is appropriate to talk about what results we are aiming for. That is starting to think about the evaluation mechanisms. In the domain of TDT the two metrics that we are most concerned with are:

False alarm probability: That is the system says that this belongs to a certain event or is the first topic, whereas some gold standard (human or annotated text) says it is not.
Missed Target: That is the system says NO, while the gold standard says YES.

In general these two quantities are related. That is having a high False Alarm probability results in having a lower missed detection. Thus the system that is developed is a trade off between missed detection and false alarm. Yang et.al. use a combination of the several in order to overcome this issue.

The technical aspects of this project deal with the development of a faster system to perform the tasks of TDT. The area of TDT has been dormant for 2 years since the last TDT conference. This project aims to use lesson learned from this time combined with the latest technology in shallow parsing and vector space models in order to develop a more reliable and robust system. The shallow parsing refers to quicker and more efficient algorithms in the areas of chunking and POS tagging [need citation]. The improved vector space models refer to models that not only include the words but also meta – data and represent these as sub vectors [need citation].

From here the next step is to combine the paper by Yang with the ideas from my project in a detailed description. This needs to be done ASAP for Rafael. Tomorrow’s meeting with Dan will concentrate on the 3rd aim, and trying to understand where my system will fit in to the larger project. This will be important in order to see what information I will need to extract and be able to display. I will also be meeting with the graphic designer soon to be able to develop the layout for the system in terms of Vision Bytes requirements. Again this will help with an understanding of the data that I will need to display and work with.

eng project 07

Monday, 27 August 2007

partial draft

Friday, 24 August 2007

long time no post

Monday, 6 August 2007

Update on getting TDTx

Clusttering Tools

Sunday, 5 August 2007

Getting Something Going

Thursday, 2 August 2007

Update on getting TDTx

Tech meeting with Vision Bytes

Wednesday, 1 August 2007

Technical Description of project.

work for next week (and a bit)

Approach to TDT for TV news broadcast data.

Blog Archive

About Me