- Topic Detection is a problem of real time clustering (or near real time). Its purpose is to be able to develop clusters of stories in one pass given a small (possibly zero size) history of the cluster.
- Topic detection and first story detection go hand in hand. In fact you can use almost identical algorithms with the only difference being you threshold for FSD.
- It seems that the best results are obtained using mixture models. That is using several techniques in order to obtain a result.
- The main trade offs that will occur in TDT is False Alarm Probability and Missed Probability. In order to get a good system we aim to try and minimise both. However they are generally trade offs. A system with low false alarm generally has a higher miss rate. The reason can be seen in the way they generalise.
- Most of the clustering seems to use a vector space model, and then KNN clustering techniques.
- A main issue that i need to look into is run time. That is ensure that the system works in an acceptable time limit, while at the same time doing its job.
- I need to also ensure that the system is a technical RESEARCH PROJECT that has some interesting research as well as the system that is business orientated. This will become clearer after i type up the initial system approach, and include references. These will then make sure I'm heading in the right technical direction for the project.
Monday, 30 July 2007
Things I Need to write down
Here a few things that I've read or come across that I need to get down before i forget.