Monday, 30 July 2007

Data Clustering

The following is some research into conventional clustering algorithms. These are taken from several sources:

The purpose of clustering is to give structure to some unlabelled data. That is we want to organise similar data, into groups (clusters). Clusters are those objects that are similar. Thus the similarity measure is an important part of clustering. For any clustering system we require;
  • Scalability
  • Able to deal with various attributes
  • Can cluster regardless of shape of the cluster
  • minimal requirements for domain knowledge
There are many different algorithms that we can use to cluster. Most however involve representing a document as a vector. Further we can use other information about the document to add to the vector.

Once we form the vector we need to find the distance between two topics, or a topic and some centroid. The centroid is the centre of a cluster and can be useful when clustering on the fly. The types of distances that can be used, this can be using simple Euclidean distances. However often more complex distance / similarity measures are required. Other considerations need to be the units of the vectors and scaling that will need to take place in order to make the cluster well formed.

More on the algorithms and distance measures in next summary.

No comments: