Clustering
Clustering is used for finding groups or “clusters” of data for which the true groups/labels are unknown. kmeans is a common clustering technique that assigns cluster “centroids” (an average of the points that make up a cluster) and then reassigns points to the new clustercentroids, iteratively. The algorithm stops when points don’t change their cluster assignments. kmeans requires deciding upfront the value of $k$.
Iris example
Suppose we measure some properties of flowers, species of iris to be precise (data available from UCI. We have a table like the following:
Sepal length (cm)  Sepal width (cm)  Petal length (cm)  Petal width (cm) 

5.1  3.5  1.4  0.2 
4.9  3.0  1.4  0.2 
4.7  3.2  1.3  0.2 
…  …  …  … 
Here are two plots of these data.
There are more plots we could study (e.g. Sepal width vs. Petal length, etc.). However, sometimes we may have many more than just four dimensions in our data. Plotting is not always the best way to study the data.
kmeans clustering
We want to know if there are more than one distinct kinds of iris plants represented in these data. In other words, we want to classify each measured flower as some species of iris. But we don’t know, a priori, which points belong in which groups. We see in the second graph that, most likely, the bottom left cluster of points is a likely a distinct group.
There exist many algorithms to automatically cluster data like these. We’ll look at the simplest, which is called kmeans.
The kmeans algorithm is the simplest and most intuitive. It works as follows.

Decide how many clusters we want. Call this $k$. (Because we have to choose $k$, this method is “semisupervised” rather than wholly “unsupervised.”)

Create $k$ random cluster means (also called “centroids”). Our data come in four dimensions; thus, each cluster mean will be fourdimensional. We can choose random values for each dimension for each of the $k$ clusters or we can choose a random data point to represent each initial cluster mean.

For each measured flower (each row in the table of data), use Euclidean distance (which works in any number of dimensions) to determine which cluster’s mean is closest to the measurements. Assign this flower to that cluster. (Note it may have already been assigned to that cluster.)

Now that all flowers have been assigned (or reassigned) to clusters, recalculate the cluster means. This simply involves summing all data vectors in the cluster and dividing by the number of members in the cluster.

Go back to step 3 until no cluster assignments change.
The result is each flower (each row in the table) is assigned to a cluster. We can graph these clusters with color.
If we get a new set of measurements for a new plant, we can predict (approximate, guess) its membership in one of the clusters by finding which cluster mean is closest.
Unfortunately, these clusters are not entirely accurate. The following graphs show the real clusters.
Confusion matrix
We can think of kmeans as a classifier, at least in the sense that it is attempting to group sameclass data together. An easy way to understand exactly how badly, and in what ways, the classifier misclassified is to write a confusion matrix:
0  1  2  

0  28  22  Irissetosa 
47  3  0  Irisversicolor 
50  0  0  Irisvirginica 
In a confusion matrix, the predicted class labels (0, 1, 2) are written along the top (column names). The true class labels (Irissetosa, etc.) are written along the right side. Each cell in the matrix is a count of how many instances of a true class where classified as each of the predicted classes.
With a confusion matrix, we can see at a glance whether some true class is very confused. A perfect classification will produce a confusion matrix with all zeros except on the diagonal. A confused classification will have large values not on the diagonal.
Voronoi diagram
If we look at a random x,y coordinate in the cluster graph, we will find that one cluster (one mean) is closest. Thus we can color each x,y point (each pixel) with the color of its assigned cluster. The result is a digram that divides the whole space into cells. The borders between cells are exactly halfway between the two closest cluster means. This kind of diagram is called a Voronoi diagram (named after Georgy Voronoi).
From Wikipedia.
View an animation showing the Voronoi diagram in each step of the kmeans algorithm.
These diagrams are actually quite useful in many areas of AI. For example, consider a robot navigating through a building. Draw walls and other obstacles with points. Then construct the Vornoi diagram; the Voronoi borders will be equidistant between the nearest walls. If the robot simply follows these Voronoi borders, then it will stay as far away from the walls as is physically possible. This is usually the safest route for a robot. A visualization of this process can be found at Robot Path Planning Using Generalized Voronoi Diagrams.
Benefits of kmeans

Very simple algorithm.

Reasonably fast (although its “worst case” behavior is poor).
Drawbacks of kmeans
A key limitation of kmeans is its cluster model. The concept is based on spherical clusters that are separable in a way so that the mean value converges towards the cluster center. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment. When for example applying kmeans with a value of $k=3$ onto the wellknown Iris flower data set, the result often fails to separate the three Iris species contained in the data set. With $k=2$, the two visible clusters (one containing two species) will be discovered, whereas with $k=3$ one of the two clusters will be split into two even parts. In fact, $k=2$ is more appropriate for this data set, despite the data set containing 3 classes. As with any other clustering algorithm, the kmeans result relies on the data set to satisfy the assumptions made by the clustering algorithms. It works well on some data sets, while failing on others. (Wikipedia)
How to choose k
Check out the Wikipedia article about calculating the silhouette. which is a measure of the average dissimilarity of the clusters. We can start at $k=1$ and increase it until we find a max silhouette (or at least a “local” max).
kmeans in Spark
Using Yahoo Music user ratings with columns userid,artistid,rating:
1 1000125 90
1 1006373 100
1 1006978 90
1 1007035 100
1 1007098 100
1 1007723 100
1 1008659 100
1 1008916 100
1 1012809 70
1 1014635 100
1 1016419 100
1 1016470 100
We need to create vectors representing each user’s ratings for all artists (even those they didn’t rate, which we’ll call a 50 rating). Then we’ll run kmeans.
# sparksubmit master local[10] kmeans.py
from pyspark import SparkContext, SparkConf
from pyspark.mllib.clustering import KMeans, KMeansModel
import numpy as np
import re
from math import sqrt
# if a user didn't rate an artist, we give the artist a 50 rating for that user
def createUserVector(artistsList, ratings):
# should be few user ratings (but lots of artists), so create a dict of user ratings
userRatings = {}
for (artistid, rating) in ratings:
userRatings[artistid] = rating
# now create vector for all artists
result = []
for artistid in artistsList:
if artistid in userRatings:
result.append(userRatings[artistid])
else:
result.append(50)
return np.array(result)
if __name__ == "__main__":
conf = SparkConf().setAppName("kmeans")
sc = SparkContext(conf=conf)
# input file format: userid, artistid, rating
ratings = sc.textFile("file:///home/jeckroth/cinf401examples/spark/clustering/ydata.txt")
ratings = ratings.map(lambda line: map(lambda x: int(x), re.split(r'\s+', line)))
# need to find all unique artistids to know how many dimensions in user vector
artists = ratings.map(lambda (userid,artistid,rating): artistid)
artists = artists.distinct()
artistsCount = artists.count()
artistsList = artists.takeOrdered(artistsCount)
print "artist count:",len(artistsList)
# now create a rating vector for each user
# make userid a key so we can group
ratingsPerUser = ratings.map(lambda (userid,artistid,rating): (userid,(artistid,rating))).groupByKey()
# finally, for each user, produce a vector for user's rating for each artist
ratingsUserVectors = ratingsPerUser.map(lambda (userid,ratings): createUserVector(artistsList, ratings))
clusters = KMeans.train(ratingsUserVectors, 5, maxIterations=50)
print clusters
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point  center)]))
WSSSE = ratingsUserVectors.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
clusters.save(sc, "file:///home/jeckroth/cinf401examples/spark/clustering/output")
Run on delenn:
sparksubmit master local[10] drivermemory 100G executormemory 16G kmeans.py