Cluster analysis is the process of grouping (clustering) objects into groups (clusters). The aim is for each cluster to contain items that are closely related to each other.
Hierarchical clustering is a technique for representing items as a hierarchy of clusters. The position of an item in the hierarchy represents how similar it is to the other items - similar items are located near to each other. Hierarchical clustering can be achieved using either a "bottom up" or "top down" approach. With a "bottom up" approach each item starts in its own cluster and clusters are combined. In a "top down" approach all items start in a single cluster and clusters are split.
Pseudo code for an iterative "bottom up" approach to hierarchical clustering is shown below:
Create array (a) containing a separate cluster for each item WHILE number of elements in a > 1 Find the two clusters (c1 and c2) in a that are closest to each other Create a new cluster (c3) which is a composite of c1 and c2 Remove c1 and c2 from a Add c3 to a
A convenient feature of heirarchical clustering is that the resulting tree structure can be visually represented as a dendogram.
A potential drawback of using heirarchical clustering is the time required to generate the full sequence of clusters. This is particularly an issue when using large or frequently changing datasets. The larger the dataset the greater the complexity as, for each iteration, every remaining element has to be compared to every other remaining element. If a dataset is updated, to add or remove an element, the full process has to be repeated to generate the new structure.
K-Means Clustering is a technique for partitioning a collection of items into k distinct clusters - where k is a value specified in advance. Each item being clustered will belong to one, and only one, cluster.
The must common implementation of k-means clustering is an iterative process referred to as Lloyd's algorithm. The starting point is to create k centroids (representing the central point of a cluster). The initial positions of the centroids can be set either randomly or using a heuristic function. Each item is assigned to the centroids it is closest to. The position of each centroid is then recalculated as an average of the items assigned to it. The process of item assignment and recalculating positions is repeated until the positions of the centroids have stabilised. All items assigned to a particular centroid represent a single cluster.
Pseudo code for a k-means clustering algorithm is shown below:
Create array (a) containing k clusters at random positions REPEAT FOR EACH i IN the array of items to be clustered Assign i to the cluster in a that it is closest to FOR EACH c IN a Change position of c to the average of the items assigned to it UNTIL values in a are no longer being altered
When using k-means clustering you must decide in advance how many distinct clusters you want. If the initial positions of the centroids is randomly generated then you may get different results from multiple attempts at clustering the same set of data.