Welcome back to Mining

of Massive Datasets. We’re going to continue or discussion of clustering by

looking at the CURE Algorithm. CURE is an acronym that stands for

Clustering Using Representatives. But before we get to the algorithm itself,

let’s see why we need it. Remember, we’ve looked at the,

BFR algorithm, or the Bradley-Fayyad-Reina algorithm,

in the last lecture, for clustering very large datasets

that don’t fit in memory. The BFR algorithm is great, because

you can scan the data in one pass, and obtain clusters. The problem, though,

is that the BFR algorithm makes very strong assumptions about the data,

and about what the clusters look like. The first assumption that the BFR

Algorithm makes is that the clusters are normally distributed

in each dimension. That in each dimension there

is a fixed centroid and a fixed standard deviation that the, that

each cluster follows along each dimension. The second strong assumption

that the BFR Algorithm makes is that the axes are fixed. So the clusters then,

if you follow both these assumptions, the, that the clusters are normally distributed

in each dimension and the axes are fixed. The clusters that are discovered

by the BFR Algorithm had this the cigar kind of shape that,

that you see here on the left. it, it, it could either be a horizontal

cigar shape or a vertical cigar shape. Or a circle, which is kind of

a limiting case of, of an ellipse. But if your clusters actually

are not oriented along the x or the y axis in this case, or along the axis

in general in the multi-dimensional case, but are at an angle to the axis as,

as I show in the, in the fig, in, in the second picture here. that’s, that’s not okay. The BFR Algorithm will not find a cluster

that looks like a tilted ellipse. It can only find clusters that look like

either upright or horizontal ellipses. And clusters actually look very,

very different, like the picture on the extreme right here where there are two

clusters and the clusters look kind of like crescent moons except in, in opposite

directions, those would definitely not be found by the BFR Algorithm because

they don’t look like cigars at all. They don’t look like ellipses at all or

in any dimension. So, that’s the kind of cluster that will

never be found by the BFR Algorithm. So the BFR Algorithm, even though it’s super efficient makes the

strong assumptions the clusters are going to look like the, the pictures on the

extreme left, and not like the other two. And we’d like to avoid this assumption,

and try to find clusters regardless

of what they actually look like, because we don’t control what the clusters

look like in the, in the, in the data. The CURE Algorithm tries to fix this

problem with the with the BFR Algorithm. The CURE Algorithm assumes

a Euclidean distance. I remember a Euclidean distance metric

means between any two points, we can always find a mid point of two points by

taking the average of those two points. However, the CURE Algorithm, unlike the BFR Algorithm, allows

clusters to assume any shape whatsoever. There is no restriction on the,

on the shape of the clusters. So in the CURE Algorithm any of

these clusters, the the, the first, the second, or

the third are perfectly fine. The CURE Algorithm works,

can find clusters of, of those shapes. Now difference between

the CURE Algorithm and the BFR Algorithm is that the BFR we

represent each cluster using its centroid. Whereas in the CURE Algorithm,

we’re going to use instead of a centroid, we’re going to represent each cluster by

a collection of representative points. So, instead of representing

a cluster by one point, we’re going to represent

it by many points. Here’s an example of a dataset where

the clusters don’t look anything at all like ellipses or cigars. So, this data, on x axis we have

the age of faculty members at a university like Stanford and

on the y axis we have their salaries. Now these are, these are, this is not the

actual data, but more a representation of what the data might look like, although

it’s based on, on real world experience. Now, the, this,

the data points marked by h, are salaries of faculty

members in the humanities. Where the, data points marked with an e

are salaries of faculty members in, in engineering, departments. And as you can see, it’s apparent from

the, from the graph here, that in the humanities the, the starting salary

is, is somewhat lower than in engineering. A humanities, faculty member

starts at a much lower salary than an engineering faculty member. But, as, over time,

as their tenure increases, the salary of a humanities,

faculty member, keeps increasing and eventually overtakes the salary of

a an engineering faculty member. But in the salary of engineering faculty

members increases a little bit with their tenure, but then kind of flattens out,

it doesn’t increase beyond that. And this is just a phenomenon that has

been observed in, in terms of salaries at, at most universities and

presumably this is because, in, in the engineering departments

the fields keep changing so much that there’s a lot of

value in bringing in new faculty with with new interests and

and new you know, new expertise. Whereas in the humanities I

guess you age better as you age. So if you sort of look at the look

at these two sets of salaries and you try to cluster them, what you really

want in an ideal world is, is, is two, is two clusters. One that you know looks at

the engineering salaries, and one that looks at

the humanities salaries and see it, and cleanly separate out these two data

points into, into two separate clusters. Now when you’re looking at the data,

remember you do know that some of these of salaries corresponding to

engineering faculty members and some to humanities facilities members so

so the clustering algorithm doesn’t have

access to this information but you’d yet like it to find these find

these clusters in the data. Now it’s too much to hope that

a clustering algorithm can actually find these exact clusters because

these are overlapping clusters, and most clustering algorithm cannot find

cluster that actually overlap with with each other where a single

data point is in two clusters. But at the very least, we can hope that

the clustering algorithm finds some approximation for these clusters. For example we might want it

to discover one cluster there. Another cluster there. And a third cluster there. So that would be a nice, nice outcome for

from from any clustering algorithm. And the CURE Algorithm can indeed

find clusters of this nature. The CURE Algorithm is

a two pass algorithm. And let’s look at the,

the first pass of the two pass algorithm. In the first pass, we sample a random set of points from the,

dataset that’s given to us. Remember the dataset is really large and

doesn’t fit in memory at all. It’s sitting on disk somewhere. But we’re going to randomly sample

a point from this very large dataset. And we’re only going to sample enough

points, that, that fit in memory. we’ll, we’ve covered techniques for,

sampling, in another lecture, so you know how to do this already. Now, once we have a lot of,

sample points that, you know, that you’ve randomly sampled, we’re going to

use any main memory clustering algorithm. So, for example, you can use

a hierarchical clustering algorithm that we covered in our previous lecture. And you can cluster the the sample points

that are in memory using the hierarchical clustering algorithm to create

an initial set of clusters. Right? And because hierarchical clustering is is,

is, is, is a complex algorithm, it can actually

find clusters of the nature you know, it doesn’t have any kind of restriction

on the kind of clusters that it can find. So it can find clusters of any shape. Now, once we’ve actually clustered

the sample points, and fi, figured out initial sort of clusters,

for each of those clusters, we’re going to pick representative

points to represent them. So what we’re going to do,

is we’re going to pick a number, k, and let’s say 4, and we’re going to find pick

k representative points for each cluster. Now our goal is to, you know,

take a cluster like this. Let’s say, here’s a cluster that, that

we found using hierarchical clustering. And we want to find a representative

point, that are, you know, as far away from each other to sort of get a good

coverage of the cluster as possible. For example, we might want to, if, if we

find the first representative point there, you might want to find the second there,

the third there, and the fourth there. So we have four points that are sort

of well distributed in the cluster and if they, they cover as much of the

cluster’s, you know, volume as possible. And the, you know,

we’ve sort of in our previous lecture, we’ve discussed how to pick, points that

are as dispersed as possible in a cluster. The technique is to, for each cluster,

first pick, the first point at random, and then you pick the second point

to be within the cluster but be as far away from

the first point as possible. And you pick the third point to

be still within the cluster but as far away from points one and

two as possible, and so on. And we’ve covered this technique in a,

in a previous previous lecture. So once we’ve picked these

representative points you know, what we’re going to do is that we’re

going to create the synthetic points. And the synthetic points

are obtained by moving each, representative point a certain, distance

towards the centroid of the cluster. Right?

So, we have the cluster. We know its centroid. And we have these,

these k representative points. Now, we’re going to take each

representative point and we’re going to create a synthetic point, that is obtained

by moving the representative point a fixed fraction, maybe 20%,

towards the centroid of the cluster. And the 20% here is a,

is a parameter to the algorithm. So let’s, let’s look at an example,

that should make this clear. So, here are faculty salary data. And let’s say when you run a hierarchical

clustering on a sample of the data, let’s say that this is in fact a sample

of the data and not the actual data. When you run a hierarchical

clustering on this, let’s say the, the here is the first

cluster that’s found. Here is the second cluster and

here’s the third cluster. So we end up with these three

clusters that are found by the hierarchical clustering algorithm. Now, once we’ve found these

clustering algorithms, let’s take one of these clusters and and let’s say we want to find four remote, or

representative point for each cluster. And let’s start at

the cluster to the right. And there,

there’s the first representative point, let’s say it’s the one that there

that we’ve ended up picking. Now we want to pick the second

representative point to be in this cluster, but as far away from the first

representative point as possible. So that’s the point we end up picking. The third representative

point is going to be, still in the cluster but as far away

from these two points as possible. So we end up with that point there. And the fourth representative point

similarly ends up being that, that point there. Now, once you have picked

these four remote points for the cluster we’re going to move these

points towards a centroid of the cluster. No, we, we don’t actually going

to affect the data itself. But we’re going to you know,

pick synthetic points that are closer to the centroid of a cluster than the remote

points that we’ve actually picked. Now the centroid of the, of the cluster is somewhere in the middle

of the cluster there so each of these points is move, going to move

towards the center of the circle here. So when you move the points 20%

towards the centroid, we end up with these synthetic points and these

synthetic points are manufactured points, are going to be the representative

points for this cluster. And we repeat this process for

the other clusters as well. So for each cluster we end up with k, in this case 4, represented

points representing that cluster. So that’s the first pass

of the CURE Algorithm. In the, in the second pass, of the, of

the algorithm, we scan the whole dataset. Remember, so far we’ve been working with, a sample

of the dataset that fits in memory. Now we go back to the whole dataset and,

which is sitting on disc. And we re-scan the whole dataset,

and visit each point p. And, and

we’re going to take the point p and we’re going to place it in

the cluster that’s closest to it. And the definition of

closest is very simple. We’re going to find the point you know,

the, we’re going to take p, and we’re going to find the, the, among the set of

representative points of all the clusters. We’re going to find the representative

point, that’s closest to p, and we’re going to assign p to the cluster,

belonging to that representative point. And that’s it. It’s a very, very simple algorithm, where

we just scan the data in one pass, and we place, each point, in, in its

closest cluster, which is the cluster, with the closest representative point. Now the CURE is a, it’s a very,

very simple algorithm as as we saw. It just requires a main memory sample and clustered cluster,

finding some representative points, and then one scan of the entire dataset to

place each data point into you know, in, into the closest cluster, but

surprisingly CURE really works well for a large number of used cases in finding,

complicated clusters of the, of the kind that we saw with

the faculty’s salaries and so on. Our discussion of the CURE Algorithm

wraps up our discussion of clustering, which we’ve covered over

the past several lectures. Remember that the clustering point,

problem is one where we’re given a large set of points with

a notion of distance between the points, usually in a very

high-dimensional data space. And we, and our go, goal is to group these

points into some number of clusters, with the points that are closer

together being in the same cluster. We’ll look at a variety of

algorithms to do clustering. We started with Agglomerative

hierarchical clustering where we looked at the notion of centroid and

clustroid. We realize that hierarchical clustering,

while it’s extremely flexible and can produce clusters of any shape,

it’s really, really complex. And it doesn’t really work well or scale well to large datasets

that don’t fit in memory. Thus we looked at other algorithms

that scale well to large dataset. We started with a k-means algorithm. Then we looked at the BFR or

the Bradley-Fayyad-Reina algorithm which an implementation of k-means for

really large datasets. But we found out that BFR

has certain limitations. So we looked at another

algorithm called CURE or Clustering Using Representatives

that overcomes these limitations.