 # 6 5 The CURE Algorithm 15 13 Advanced

Welcome back to Mining
of Massive Datasets. We’re going to continue or discussion of clustering by
looking at the CURE Algorithm. CURE is an acronym that stands for
Clustering Using Representatives. But before we get to the algorithm itself,
let’s see why we need it. Remember, we’ve looked at the,
in the last lecture, for clustering very large datasets
that don’t fit in memory. The BFR algorithm is great, because
you can scan the data in one pass, and obtain clusters. The problem, though,
is that the BFR algorithm makes very strong assumptions about the data,
and about what the clusters look like. The first assumption that the BFR
Algorithm makes is that the clusters are normally distributed
in each dimension. That in each dimension there
is a fixed centroid and a fixed standard deviation that the, that
each cluster follows along each dimension. The second strong assumption
that the BFR Algorithm makes is that the axes are fixed. So the clusters then,
if you follow both these assumptions, the, that the clusters are normally distributed
in each dimension and the axes are fixed. The clusters that are discovered
by the BFR Algorithm had this the cigar kind of shape that,
that you see here on the left. it, it, it could either be a horizontal
cigar shape or a vertical cigar shape. Or a circle, which is kind of
a limiting case of, of an ellipse. But if your clusters actually
are not oriented along the x or the y axis in this case, or along the axis
in general in the multi-dimensional case, but are at an angle to the axis as,
as I show in the, in the fig, in, in the second picture here. that’s, that’s not okay. The BFR Algorithm will not find a cluster
that looks like a tilted ellipse. It can only find clusters that look like
either upright or horizontal ellipses. And clusters actually look very,
very different, like the picture on the extreme right here where there are two
clusters and the clusters look kind of like crescent moons except in, in opposite
directions, those would definitely not be found by the BFR Algorithm because
they don’t look like cigars at all. They don’t look like ellipses at all or
in any dimension. So, that’s the kind of cluster that will
never be found by the BFR Algorithm. So the BFR Algorithm, even though it’s super efficient makes the
strong assumptions the clusters are going to look like the, the pictures on the
extreme left, and not like the other two. And we’d like to avoid this assumption,
and try to find clusters regardless
of what they actually look like, because we don’t control what the clusters
look like in the, in the, in the data. The CURE Algorithm tries to fix this
problem with the with the BFR Algorithm. The CURE Algorithm assumes
a Euclidean distance. I remember a Euclidean distance metric
means between any two points, we can always find a mid point of two points by
taking the average of those two points. However, the CURE Algorithm, unlike the BFR Algorithm, allows
clusters to assume any shape whatsoever. There is no restriction on the,
on the shape of the clusters. So in the CURE Algorithm any of
these clusters, the the, the first, the second, or
the third are perfectly fine. The CURE Algorithm works,
can find clusters of, of those shapes. Now difference between
the CURE Algorithm and the BFR Algorithm is that the BFR we
represent each cluster using its centroid. Whereas in the CURE Algorithm,
we’re going to use instead of a centroid, we’re going to represent each cluster by
a collection of representative points. So, instead of representing
a cluster by one point, we’re going to represent
it by many points. Here’s an example of a dataset where
the clusters don’t look anything at all like ellipses or cigars. So, this data, on x axis we have
the age of faculty members at a university like Stanford and
on the y axis we have their salaries. Now these are, these are, this is not the
actual data, but more a representation of what the data might look like, although
it’s based on, on real world experience. Now, the, this,
the data points marked by h, are salaries of faculty
members in the humanities. Where the, data points marked with an e
are salaries of faculty members in, in engineering, departments. And as you can see, it’s apparent from
the, from the graph here, that in the humanities the, the starting salary
is, is somewhat lower than in engineering. A humanities, faculty member
starts at a much lower salary than an engineering faculty member. But, as, over time,
as their tenure increases, the salary of a humanities,
faculty member, keeps increasing and eventually overtakes the salary of
a an engineering faculty member. But in the salary of engineering faculty
members increases a little bit with their tenure, but then kind of flattens out,
it doesn’t increase beyond that. And this is just a phenomenon that has
been observed in, in terms of salaries at, at most universities and
presumably this is because, in, in the engineering departments
the fields keep changing so much that there’s a lot of
value in bringing in new faculty with with new interests and
and new you know, new expertise. Whereas in the humanities I
guess you age better as you age. So if you sort of look at the look
at these two sets of salaries and you try to cluster them, what you really
want in an ideal world is, is, is two, is two clusters. One that you know looks at
the engineering salaries, and one that looks at
the humanities salaries and see it, and cleanly separate out these two data
points into, into two separate clusters. Now when you’re looking at the data,
remember you do know that some of these of salaries corresponding to
engineering faculty members and some to humanities facilities members so
so the clustering algorithm doesn’t have
access to this information but you’d yet like it to find these find
these clusters in the data. Now it’s too much to hope that
a clustering algorithm can actually find these exact clusters because
these are overlapping clusters, and most clustering algorithm cannot find
cluster that actually overlap with with each other where a single
data point is in two clusters. But at the very least, we can hope that
the clustering algorithm finds some approximation for these clusters. For example we might want it
to discover one cluster there. Another cluster there. And a third cluster there. So that would be a nice, nice outcome for
from from any clustering algorithm. And the CURE Algorithm can indeed
find clusters of this nature. The CURE Algorithm is
a two pass algorithm. And let’s look at the,
the first pass of the two pass algorithm. In the first pass, we sample a random set of points from the,
dataset that’s given to us. Remember the dataset is really large and
doesn’t fit in memory at all. It’s sitting on disk somewhere. But we’re going to randomly sample
a point from this very large dataset. And we’re only going to sample enough
points, that, that fit in memory. we’ll, we’ve covered techniques for,
sampling, in another lecture, so you know how to do this already. Now, once we have a lot of,
sample points that, you know, that you’ve randomly sampled, we’re going to
use any main memory clustering algorithm. So, for example, you can use
a hierarchical clustering algorithm that we covered in our previous lecture. And you can cluster the the sample points
that are in memory using the hierarchical clustering algorithm to create
an initial set of clusters. Right? And because hierarchical clustering is is,
is, is, is a complex algorithm, it can actually
find clusters of the nature you know, it doesn’t have any kind of restriction
on the kind of clusters that it can find. So it can find clusters of any shape. Now, once we’ve actually clustered
the sample points, and fi, figured out initial sort of clusters,
for each of those clusters, we’re going to pick representative
points to represent them. So what we’re going to do,
is we’re going to pick a number, k, and let’s say 4, and we’re going to find pick
k representative points for each cluster. Now our goal is to, you know,
take a cluster like this. Let’s say, here’s a cluster that, that
we found using hierarchical clustering. And we want to find a representative
point, that are, you know, as far away from each other to sort of get a good
coverage of the cluster as possible. For example, we might want to, if, if we
find the first representative point there, you might want to find the second there,
the third there, and the fourth there. So we have four points that are sort
of well distributed in the cluster and if they, they cover as much of the
cluster’s, you know, volume as possible. And the, you know,
we’ve sort of in our previous lecture, we’ve discussed how to pick, points that
are as dispersed as possible in a cluster. The technique is to, for each cluster,
first pick, the first point at random, and then you pick the second point
to be within the cluster but be as far away from
the first point as possible. And you pick the third point to
be still within the cluster but as far away from points one and
two as possible, and so on. And we’ve covered this technique in a,
in a previous previous lecture. So once we’ve picked these
representative points you know, what we’re going to do is that we’re
going to create the synthetic points. And the synthetic points
are obtained by moving each, representative point a certain, distance
towards the centroid of the cluster. Right?
So, we have the cluster. We know its centroid. And we have these,
these k representative points. Now, we’re going to take each
representative point and we’re going to create a synthetic point, that is obtained
by moving the representative point a fixed fraction, maybe 20%,
towards the centroid of the cluster. And the 20% here is a,
is a parameter to the algorithm. So let’s, let’s look at an example,
that should make this clear. So, here are faculty salary data. And let’s say when you run a hierarchical
clustering on a sample of the data, let’s say that this is in fact a sample
of the data and not the actual data. When you run a hierarchical
clustering on this, let’s say the, the here is the first
cluster that’s found. Here is the second cluster and
here’s the third cluster. So we end up with these three
clusters that are found by the hierarchical clustering algorithm. Now, once we’ve found these
clustering algorithms, let’s take one of these clusters and and let’s say we want to find four remote, or
representative point for each cluster. And let’s start at
the cluster to the right. And there,
there’s the first representative point, let’s say it’s the one that there
that we’ve ended up picking. Now we want to pick the second
representative point to be in this cluster, but as far away from the first
representative point as possible. So that’s the point we end up picking. The third representative
point is going to be, still in the cluster but as far away
from these two points as possible. So we end up with that point there. And the fourth representative point
similarly ends up being that, that point there. Now, once you have picked
these four remote points for the cluster we’re going to move these
points towards a centroid of the cluster. No, we, we don’t actually going
to affect the data itself. But we’re going to you know,
pick synthetic points that are closer to the centroid of a cluster than the remote
points that we’ve actually picked. Now the centroid of the, of the cluster is somewhere in the middle
of the cluster there so each of these points is move, going to move
towards the center of the circle here. So when you move the points 20%
towards the centroid, we end up with these synthetic points and these
synthetic points are manufactured points, are going to be the representative
points for this cluster. And we repeat this process for
the other clusters as well. So for each cluster we end up with k, in this case 4, represented
points representing that cluster. So that’s the first pass
of the CURE Algorithm. In the, in the second pass, of the, of
the algorithm, we scan the whole dataset. Remember, so far we’ve been working with, a sample
of the dataset that fits in memory. Now we go back to the whole dataset and,
which is sitting on disc. And we re-scan the whole dataset,
and visit each point p. And, and
we’re going to take the point p and we’re going to place it in
the cluster that’s closest to it. And the definition of
closest is very simple. We’re going to find the point you know,
the, we’re going to take p, and we’re going to find the, the, among the set of
representative points of all the clusters. We’re going to find the representative
point, that’s closest to p, and we’re going to assign p to the cluster,
belonging to that representative point. And that’s it. It’s a very, very simple algorithm, where
we just scan the data in one pass, and we place, each point, in, in its
closest cluster, which is the cluster, with the closest representative point. Now the CURE is a, it’s a very,
very simple algorithm as as we saw. It just requires a main memory sample and clustered cluster,
finding some representative points, and then one scan of the entire dataset to
place each data point into you know, in, into the closest cluster, but
surprisingly CURE really works well for a large number of used cases in finding,
complicated clusters of the, of the kind that we saw with
the faculty’s salaries and so on. Our discussion of the CURE Algorithm
wraps up our discussion of clustering, which we’ve covered over
the past several lectures. Remember that the clustering point,
problem is one where we’re given a large set of points with
a notion of distance between the points, usually in a very
high-dimensional data space. And we, and our go, goal is to group these
points into some number of clusters, with the points that are closer
together being in the same cluster. We’ll look at a variety of
algorithms to do clustering. We started with Agglomerative
hierarchical clustering where we looked at the notion of centroid and
clustroid. We realize that hierarchical clustering,
while it’s extremely flexible and can produce clusters of any shape,
it’s really, really complex. And it doesn’t really work well or scale well to large datasets
that don’t fit in memory. Thus we looked at other algorithms
that scale well to large dataset. We started with a k-means algorithm. Then we looked at the BFR or