eng project 07: meeting with rafael

I met with Rafael today to get the project confirmation signed (still need to hand in though). I also got allot of information about the way that i need to approach the topic, the summary is below.

The name of the topic seems to be almost finalised (yay cause now i can tell people what I'm doing)

"TV News Topic Detection and Tracking"

I also met Jorge a PhD student that is also doing some text mining but in the realm of e - learning, which may be useful later.

Anyways the topic for the project itself is a TDT (Topic Detection and Tracking) topic. So there are two things that I'll need to look into over the next few weeks are:

Clustering of data (i.e. placing text documents into n - dimensional vector spaces)
The detection of new and old stories.

So we have a clustering problem as shown below:

Then after this we also consider it as a time evolving problem and see how news stories come over time. We notice that there is allot of information in the first three days and then not much information after.

Some other points that Rafael suggested are:

Look at the research group at CMU (http://nyc.lti.cs.cmu.edu/clair/new/)
Look at TREC where they have been doing this sort of analysis for some time (http://trec.nist.gov/)
Also try and get hold of the Reuters RCV1 data, and we can use this as the benchmark for further my work. Will need to get it from uni as its 2.5GB uncompressed (but tagged!) http://trec.nist.gov/data/reuters/reuters.html

So i have some direction now! Will read up over the next week or two about clustering and TDT in literature and update the blog on the topics.

eng project 07

Tuesday, 15 May 2007

meeting with rafael

No comments:

Blog Archive

About Me