Monday, 21 May 2007

open source

So open source was suggested by Rafael, however he wanted me to find out how Dan would feel about it. The open source projects that we would look into are:
  • Spring
  • Hibernate
  • Lucene
One issue was if i modify that where does the code go, can it be used by uni and Vision Bytes. The specifics of this would depend on the license. Initally Dan is fine 'in principle' with using open source, so thats good, the nitty gritty will still be worked out.

Wednesday, 16 May 2007

now using vista

Well today I got Vista (Ultimate) installed and working. It wasn't too painful, and well its actually quite a nice operating system, lots of things i didn't even know existed. But yes so were now a Vista powered blog and project :)

Tuesday, 15 May 2007

meeting with rafael

I met with Rafael today to get the project confirmation signed (still need to hand in though). I also got allot of information about the way that i need to approach the topic, the summary is below.

The name of the topic seems to be almost finalised (yay cause now i can tell people what I'm doing)

"TV News Topic Detection and Tracking"

I also met Jorge a PhD student that is also doing some text mining but in the realm of e - learning, which may be useful later.

Anyways the topic for the project itself is a TDT (Topic Detection and Tracking) topic. So there are two things that I'll need to look into over the next few weeks are:
  • Clustering of data (i.e. placing text documents into n - dimensional vector spaces)
  • The detection of new and old stories.
So we have a clustering problem as shown below:
Then after this we also consider it as a time evolving problem and see how news stories come over time. We notice that there is allot of information in the first three days and then not much information after.

Some other points that Rafael suggested are:
So i have some direction now! Will read up over the next week or two about clustering and TDT in literature and update the blog on the topics.

Monday, 14 May 2007

topic meeting with dan

Today I went into Vision Bytes and meet with Dan to get a better understanding of the topic in question and how the best aid the company. The meeting was a good as I cleared up some key areas including how best to apply the project to the company. I'll details the areas of discussion below:
  • The topics main aim is to track a news story over time, and to provide this information to the clients as part of the current site.
  • Currently the stories are viewed by program e.g. Sunrise and all the stories. Instead this system will be viewed by story and will then show you all the stories that are related AND / OR updates on the same story.
  • There will need to be a distinction made between a repeat story (e.g. the story is broadcast on Sunrise and then the same story, or maybe almost the same, is then broadcast again on the morning news.) and a story with new information. Both would need to be displayed and tagged appropriately.
  • There are 5 real pieces of data that I will have access to and will have to use to perform all of the tasks:
    • The title
    • The quick description
    • The full text (maybe)
    • The date and time
    • The duration
  • Currently the system is built in C# however they are bringing into live use a new J2EE and Hibernate based system in June, thus it would be best to work using this system, as it is their long term platform. Also I will have access to the existing database and so wont need to replicate this. However i made need to add certain new data structures in order to complete the task.
  • The system will involve some human interaction at some point in order to verify any tasks that it has done, e.g. Are two stories related, here we could use a learning algorithm to improve performance. However the main goal would be to minimise this.
  • There should be both similar stories (broad topics such as Iraq) and more same event and how it develops (e.g. A particular kidnapping)
Also got Dan to sign the form saying he will be the co - supervisor, will need to talk to Rafael and confirm what exactly needs to change.

Dan was also said that he would organise meetings with their architect and their engineers to explain how their systems work. Will need to do this soon.

So as the plan goes we now start doing some research into the topic (mainly scientific journals)

Sunday, 13 May 2007

initial topic plan

The following is a basic outline of initials ideas about the direction of the project. They range from requirements, to technical and bussiness aspects that wont come up for a while.

High Level ideas:

  • The tracking of a given news event over time.
  • Identifying the topic.
  • Looking at the event over different TV stations.
  • Providing an interface for clients to be able to retrieve this information.

Some basic issues that will come up

  • Which stories to track
  • How to get the events (automated or use human interaction)
  • Distinguishing between new events and old events (i.e. same topic but different event, best example would be Iraq is it a new story on it or a continued story)
  • Where will clients view this data and how will they access it.
  • Where in the production cycle will this fit.

Other issues:

  • Technologies to be used i.e. Language, tools to be used [Java, J2EE]
  • Verson control (sub version)
  • Database to use. [PostgreSql, MSQL]
  • Mix of technical and theoretical components [How much of the thess should be about the theory that is being used, and how much should be about building the working system itself to be used by

topic development

So over the next few days my main aim is to obtain a topic name, and also get the requirements for the project. Tomorrow (Monday) i'll be meeting with both Daniel, and Rafael about teh project and the direction i should be heading. So far i have only read a few papers on the genreal topic. I will be posting a detalied analysis on each paper when i get the time :).
The general area of the topic is: News Topic tracking over time. That is looking at how news stories evolve over time, and more importantly how they are covered.
The topic is with Vision Bytes, a company that already have a access to a large variety of textual data.

Thursday, 10 May 2007

my blog

well here it is my blog! It's mainly going to be a place for my project so i can keep track of ideas, and get people to talk about it. So here it goes, the aim is to get a HD !