The name of the topic seems to be almost finalised (yay cause now i can tell people what I'm doing)
"TV News Topic Detection and Tracking"
I also met Jorge a PhD student that is also doing some text mining but in the realm of e - learning, which may be useful later.
Anyways the topic for the project itself is a TDT (Topic Detection and Tracking) topic. So there are two things that I'll need to look into over the next few weeks are:
- Clustering of data (i.e. placing text documents into n - dimensional vector spaces)
- The detection of new and old stories.
Then after this we also consider it as a time evolving problem and see how news stories come over time. We notice that there is allot of information in the first three days and then not much information after.
Some other points that Rafael suggested are:
- Look at the research group at CMU (http://nyc.lti.cs.cmu.edu/clair/new/)
- Look at TREC where they have been doing this sort of analysis for some time (http://trec.nist.gov/)
- Also try and get hold of the Reuters RCV1 data, and we can use this as the benchmark for further my work. Will need to get it from uni as its 2.5GB uncompressed (but tagged!) http://trec.nist.gov/data/reuters/reuters.html