## Cluster Analysis

**Cluster analysis** is the process of grouping (clustering) objects into groups (clusters). The aim is for each cluster to contain items that are closely related to each other.

### Hierarchical Clustering

**Hierarchical clustering** is a technique for representing items as a hierarchy of clusters. The position of an item in the hierarchy represents how similar it is to the other items - similar items are located near to each other. Hierarchical clustering can be achieved using either a "bottom up" or "top down" approach. With a "bottom up" approach each item starts in its own cluster and clusters are combined. In a "top down" approach all items start in a single cluster and clusters are split.

Pseudo code for an iterative "bottom up" approach to hierarchical clustering is shown below:

Create array (a) containing a separate cluster for each item WHILE number of elements ina> 1 Find the two clusters (c1andc2) inathat are closest to each other Create a new cluster (c3) which is a composite ofc1andc2Removec1andc2fromaAddc3toa

**Hierarchical Clustering Example**

A convenient feature of heirarchical clustering is that the resulting tree structure can be visually represented as a dendogram.

A potential drawback of using heirarchical clustering is the time required to generate the full sequence of clusters. This is particularly an issue when using large or frequently changing datasets. The larger the dataset the greater the complexity as, for each iteration, every remaining element has to be compared to every other remaining element. If a dataset is updated, to add or remove an element, the full process has to be repeated to generate the new structure.

### K-Means Clustering

**K-Means Clustering** is a technique for partitioning a collection of items into *k* distinct clusters - where *k* is a value specified in advance. Each item being clustered will belong to one, and only one, cluster.

The must common implementation of k-means clustering is an iterative process referred to as Lloyd's algorithm. The starting point is to create *k* centroids (representing the central point of a cluster). The initial positions of the centroids can be set either randomly or using a heuristic function. Each item is assigned to the centroids it is closest to. The position of each centroid is then recalculated as an average of the items assigned to it. The process of item assignment and recalculating positions is repeated until the positions of the centroids have stabilised. All items assigned to a particular centroid represent a single cluster.

Pseudo code for a k-means clustering algorithm is shown below:

Create array (a) containingkclusters at random positions REPEAT FOR EACHiIN the array of items to be clustered Assignito the cluster inathat it is closest to FOR EACHcINaChange position ofcto the average of the items assigned to it UNTIL values inaare no longer being altered

**K-Means Clustering Example**

When using k-means clustering you must decide in advance how many distinct clusters you want. If the initial positions of the centroids is randomly generated then you may get different results from multiple attempts at clustering the same set of data.