Welcome back to Mining

of Massive Datasets. Be looking at K-means algorithms for

clustering. And we’ve sort of seen that the K-means

algorithm is, you know, produces good clusters and it’s linear for each scan or

each round of the K-means clustering. But it can take a really, large number of rounds to convert

the K-means clustering algorithm. So, our question is: Can we come up with

an algorithm that can do something like k-Means clustering but in one pass

over the data, if you have a really, really large dataset that

doesn’t fit in memory? The answer to that question is

the Bradley-Fayyad-Reina or BFR algorithm, which you’re

going to cover in this segment. BFR is a variant of K-means. And is sort of optimized for really large

data set that are actually disclicids. The data site is so

large that it doesn’t fit in memory. And so you really want to just

can the dataset once in disk and come up with the clustering. However, to accomplish this the BFR

algorithm makes a very strong assumption. It assumes,

not only that we have a Euclidian space, which is an assumption that all K-means

algorithms make, but it also assumes that each cluster is normally distributed

around a centroid in Euclidean space. Let’s see what this means. Let’s start with a refresher

on the normal distribution. Here’s a variable, x, and here its,

here is its frequency distribution. Assuming that x is normally distributed. Let’s say x has, mean 0 and

standard deviation sigma. Then, if you look at the frequency

distribution, it’s it’s apparent that, you know, we, we sort of end

up in this bell-shaped curve. and, most, observations of x are going to

cluster very closely around the mean zero. In fact about 68% of the values which is

you know here 0.34 plus 0.34 so that’s 68% of observations are going to lie within

the one standard deviation of the mean. Plus or minus one sigma from the mean. Right? and, about 95% of the values

are going to lie, at a distance of at most 2

sigma from the mean, plus or minus 2 sigma from, from the mean. And about 99% of observations

are going to be at a distance of about 3 sigma from the mean. All right so, this is just a refresher. It should be evident to, most of you should recall this from

your elementary statistics classes. Now the, the nice thing, if, if you assume tha,t the, that the data is

normally distributed in each dimension. Let’s say X here is a dimension,

and, the mean of x is zero, which means that the centroid in x

dimension for a given cluster is at zero. Then, we, we sort of know that,

about 68% of the points, lie within one standard deviation,

off the centroid, along x dimension. And we can say the same thing

about each dimension, and we assume that all the dimensions

are independent of each other. So in general, we can quantify the

likelihood given, you know, given a point, given the point, belongs to a cluster,

we can quantify the likelihood of finding that point at a certain distance

from the centroid of the cluster. And this is the property that we are going

to use, in the BFR algorithm as, you know, as it’ll become apparent later on. Now recall that,

although in this case, you know, I’ve shown,

single standard deviation sigma. We don’t make the assumption that

a standard deviation is actually the same along every dimension. Each dimension actually has its own mean,

and its own standard deviation. Now the implication of the, of this idea

that each dimension has its own mean and its own standard deviation is that the,

the, the clusters actually look like, ellipsis, that are aligned along the axis. For example, here’s here’s a cluster

the pink cluster, and the, notice that it’s, it’s off a flat ellipse

that’s aligned along the, the x axis. Which means that the, the standard

deviation is higher along the x axis than along the y axis for

this particular cluster. Whereas the green cluster here,

has, you know, is more elongated along the y dimension. Which means that its standard deviation

along the y dimension is more than its standard deviation along the x dimension. Whereas the blue cluster is like,

looks more like a circle, which means that it has roughly the same

standard deviation along both dimensions. Let’s start with an overview

of the BFR algorithm. Remember that the data, set that

we’re looking at is very, very large. So and it’s actually on disc. Now the data is so large,

that it doesn’t actually fit. All of it doesn’t really fit into memory,

right? So what we’re going to do is we’re going

to page the data one chunk of the data in, you know, at, at a time into memory. Right? When we bring a chunk of data in,

and, and read it, we also, at the same time, in memory,

we’re going to maintain, metadata about all the clusters

that we’ve created thus far. Right?

We’re going to bring a chunk of the data into memory. We’re going to update

the metadata about the clusters. And then we’re going to throw

that chunk of data away. And then we’re going to bring

the next chunk of data into memory. And we’re going to continue with this

process until we’ve gone through all, all the chunks of data, at which point

we’ve ended up with our final clustering. Okay?

This is the, this is the high-level overview of the,

BFR algorithm. To summarize, the points are read from

disk, one main memory full at a time. Right? And most points from previous memory

loads are actually summarized by simple statistics, since we don’t

have space to store all those points, we summarize all those points

with simple statistics. And we’ll see what these statistics are,

going forward. To begin with we’re going to, when we

load the first set of data, or the first chunk of data into memory, we’re going

to select, the initial K centroids for, the k-Means algorithm from that chunk

of data using some sensible approach. For example you might use one of

the techniques that we saw in the, in the k-Means lecture, prior to this. Now as we go through, we,

we’ll see that points fall into three, three different categories that we, that

we need, that we need to keep track of. The first category of points

is called a discard set. And the discard set of points

are points that are close enough to the centroid of an existing cluster. So if a point is close to a centroid of an

existing cluster, then we just update the, metadata, the statistics of

that cluster to account for the fact that we’ve added

the new point to the cluster. And then we throw the point away. That’s why this set of point

is called the discard set. Because we discard points that are close

enough to the centroid of a cluster. So we don’t need to keep

them around any longer. The, the second set of points is

called a compression set, or the CS. Now, these are points that

are close to each other, but are not close to any existing, centroid. Right? So these points form a separate

mini cluster of their own, but they don’t really naturally merge

with any of our existing K-clusters. So what we do with these points, is that

we compress them into mini clusters and summarize them. But we don’t assign them to

any of the existing clusters. and, and the, the resulting, you know, com, summarized sets are what we

call the compression set, or the CS. And we’ll see more about this later. Finally, we have the, the third set,

which is called the retained set. And a retained set contains

points that are kind of isolated. They don’t fit into any of the existing

clusters, and they’re not close, close together to any

of the other points so they can be, you know,

made by any compressed set. They’re kind of outliers

that are on their own. And these are called the retained set or

RS. These are isolated points. and, you know, we might eventually, fu,

in the future, as we load in more points, we might find that they get close to,

you know, a, a compression set. But at the moment they are too far

away to be assigned to, and, and they just kept, isolated,

as part of the retained set of points. So the retained set of points. Is the only set of points that we

actually have to keep track of in memory. Both the discard set and

the compression set, we can throw away the actual points and

only keep metadata points in memory. So here’s a picture that will help you

visualize these three, class of points. Remember, there are three,

sets that we need to keep track of. The discard set, the compression set and

the retained set. So heres, you know a cluster, one of the

K-clusters that, that we’re maintaining, and, here is it’s centroid, and any point

that’s close enough to, to the centroid of this cluster is assigned to this cluster

and becomes part of the discard set. We don’t need to keep track

of the point anymore. Now here are a bunch of,

you know, smaller, mini-clusters. Now these are compressed sets. These, the points in, in these

mini-clusters are close to each other, so we could actually cluster them into

these mini-clusters, but these, but they’re too far away from any

of the existing clusters, so I can’t really merge the,

points in this mini cluster with the, with this cluster here, but I can keep track

of them independently as a mini cluster. So the points that are in these

mini clusters are said to be in the compressed set. So for the points, we just need to

keep track of the mini cluster. We don’t need to keep track of the,

each individual point. And so

these points are in the compressed set. And finally, we have isolated points that

are not close to any mini-cluster or they are not close to each other. They are not close to any

of the existing cluster, so we just keep track of them

independently as individual points and these are the points that

are the retained set, or RS. So let’s look at how we summarize,

sets of points that are in the, in the discard set. Remember the discard set is,

is particular this discard set, pertains to one of the K-clusters

that we’re keeping track of. So each of the K-clusters, we,

we keep track of the number of points in the cluster, let’s,

let’s say, let’s call that N. we, keep, track of vector SUM, and SUM is just the,

sum of the points that are in the cluster. So there are N points in the cluster. And we, sum them, dimension wise. And the SUM vector is just the sum of

all the points that are in the cluster. So the ith component of the SUM vector

is the sum of the coordinates of the points along the ith dimension,

and so on. And similarly,

the sum square vector is just the sum of the squares of the coordinates. So the ith component of the, sum square

vector is the sum of the squares of the coordinates of all the end-points

along the ith dimension. In a moment, it’ll be apparent

why we maintain the sum and the sum square vectors. It should be fairly apparent that we

need to keep track of the number of points in a cluster. Notice that if d is the number

of dimensions, we can represent a cluster by 2d plus 1 values,

independent of the size of the cluster. The cluster can have 10 points or

10 million points. The number of values that we use to

represent that cluster is two d plus one, where d is the number of dimensions. Why is this? Well we need one, one point or one value, for to, to, to store n, the number of,

the number of points in the cluster. The sum is a d dimensional vector, and the sum squares are on

the D dimensional vector. So, overall, we need two d + 1

values to represent the cluster, and this is independent on the number

of points in the cluster. Now, the centroid of the cluster is

just the average along each dimension. And the, the, it’s very simple

to calculate the centroid. We just have to go to the SUM vector and divide each component of the SUM vector

by N and the total number of points, and then we end up with a vector that’s

just the centroid of the cluster. We can also calculate the variance of the,

of the, of the cluster along any dimension. The variance of the, cluster along the ith dimension is

just given by this formula here. You take the, SUMSQ values along the ith

dimension, divide them by N, and then subtract the square of the SUM values

along the ith dimension divided by N. This is, from elementary statistics,

the, the, the, the formula for standard, the variance,

from elementary statistics. And the,

standard deviation along the dimension i, it, it’s the square root of

the variance along the dimension i. All right, so we just take the, the

variance along dimension i, and you take the square root of that, and you get

the standard deviation along, dimension i. So now it should be apparent,

why we maintain the, the sum and the sum square of vectors,

for, for each cluster. Because using them and

the number of values in the cluster, it’s quite easy to compute both the,

the centroid of the cluster. And the standard deviation

along every dimension. Next we’re going to look at how

we actually do the clustering. We’re going to] page in one chunk of, data points at a time and they’re

going to update, the cluster metadata. And we just saw the metadata that we’re

going to, that it keep track of for each cluster, the N, the SUM and

the SUM squared, values. So once we have a chunk of points in a,

in memory, we’re going to find the points that are

sufficiently close to a cluster centroid. And we’ll see what sufficiently

close means in a moment, but let’s just assume that, that for

now we know that given a point that it’s, it’s sufficiently close to a cluster

centroid for one of the, the K-clusters, which means, that it is that, which means

at that point it is in the discard set. They’re going to merge the point of

the cluster and throw the point away. And here is how we do that. We, we add the point to the cluster and

then we add the point to the cluster we have to update the N, the sum, and

the sum squared for that cluster. To our just end, we just are incremented

by one to indicate that we’ve added a value to the,

added a point to the cluster. To income and sum, we just have to add the

point to the sum value for the cluster. And to update sum square, we just have re,

to compute the square of the, the, the values, of the,

of that point along each dimension. And add that to the sum square vector for

the cluster. Now we see the advantage of the NSUM and

SUMSQS presentation. It’s very easy to incrementally add value,

you know, add points to a cluster and increment for a cluster without

doing complex computations. Now the points that are actually close to, the centroid of a cluster we’ve added,

to a cluster. However, the points that are not

close to any cluster, we know, we know how to deal with those points. What we’re going to do

is to take these points, the points that are in this memory load

of points, but that are not, you know, that we have not,

added to the discard set. And you’re going to take those points, and we’re going to add in the old, retained

set, the, the, those points from before. That they’ve not done anything with. And you’re going to take these new

points and the old reading set, and we are going to use some main memory

clustering algorithm to cluster them all. You would use either Kamian’s clustering,

or we could use hierarchical clustering. It doesn’t matter. We use any main memory clustering

algorithm and recluster these points. Now, once we do that clustering,

the clusters that, that come all, which we’ll call,

we’ll call these the mini clusters. They go into the compress set. And the outlying points, that don’t,

fall into any cluster in this, according to this main memory algorithm. Go into the new, retained set. Now we have, now we’ve added a whole bunch

of mini-clusters into the compressed set. And it might happen that two of the new,

mini clusters, you know, one of the new mini-clusters that we,

we’re adding to the compressed set is actually very close to one of the existing

mini-clusters so we might want to merge, these, these mini-clusters in,

in, in the compressed set. and, so that’s, we will see in a moment,

what metric we might use to determine that, mini-clusters are close

enough and, and, and need to be merged. Right?

So we might consider doing that. and, and if this is the last round,

You know, if this is not the last round,

we just drop at this point. And we just read in the next chunk

of data, data points into memory. But if this is the last round,

then we take each, mini-cluster, or each compressed set in the CS. And we merge it to its

closest cluster centroid. And we take each point

in the retained set, or the RS, and we, we assign it to,

to its nearest centroid. And that’s the end of our clustering. If this is not the last round,

we just keep the compressed set and the retained set as it is. And we read in the next

chunk of data points, and we repeat the whole process again. So the end of, the algorithm,

we have gone through the, the entire data set exactly once. We page the entire data set to

memory exactly once, and we, we’ve produced K-clusters. And there are two questions that, we left unanswered during

the discussion of the algorithm. The first question is,

how do we decide if a point is close enough to a cluster that we

will add that point to the cluster? We handled a little bit about this, during

our discussion of the algorithm, and now we’re going to look at, a way to determine

if a point is close enough to a cluster that we will add it to the cluster and

make it part of the discard set. The second question,

which we left unanswered, is how do we decide whether two

mini-clusters in the compressed set are close enough that they deserve to

be combined into one mini-cluster, or whether we should just leave

them as separate, mini-clusters. Let’s look at that next. Let’s start with the first question which

is how do we decide whether a point that, that we is close enough to one

of the cluster centroids that we decide to add that point to the cluster? The key, concept here is something

called the Mahalanobis distance. And the Mahalanobis distance,

quantifies the, likelihood of a point belonging,

to, a centroid. Let’s, and the Mahalanobis distance, depends very, crucially on our

assumption about the points being normally distributed about

the centroid along each dimension. Let’s say, cluster C has, centroids, C one through C d

remember there are d dimensions. And the value C one through C d

represent the, the mean along each of the d dimensions so the centroid

of the cluster is C one through C d. And, let’s say the standard dimen,

dimensions, along those dimensions, along, standard deviations along those

dimensions are sigma 1 through sigma d. And let’s say we have a point,

x1 through xd. Now, what we need to determine is if this

point P is close enough to the cluster C, so that we can add the point

P to the cluster C, or is it too far away to be

considered part of the cluster C? We’re going to define,

the normalized distance of a point, along dimension i from the centroid to be yi,

which is xi minus ci divided by sigma i. We take the, the value of the point along

the dimension, subtract the centroid along the dimension, and divide by the standard

deviation along that dimension. And this gives the normalized

distance along that dimension. Now, consider the point, along dimension

i that is at a distance of one standard deviation away from the centroid. What this means is that x i

minus c i is equal to sigma i. But if x i minus c i is equal to sigma i, then the normalized distance in dimension

i, which is y i, is equal to one. So Y i measures the number of standard

deviations away from the mean, along dimension i. The Mahalanobis distance of point P

from cluster C is the square root of the sum of the squares of the normalized,

distances Y i, along each dimension. Let’s see what,

how we use the Mahalanobis distance. Now suppose point P is one

standard deviation away from the centroid along each dimension. Now we saw that in that case,

each yi is going to be 1, since yi measures the number of standard deviations

away from the centroid along dimension i. And since there are D dimensions, the,

the mahalanobis distance to the point P from cluster C is going to be square root

of D, where D is the number of dimensions. Now going back to ou,r bell curve from

the, Gaussian, or normal, distribution. the, the probability,

that each YI is equal to one, or that the point is one standard

deviation of A along each dimension. 1 standard deviation, not less,

in each dimension, is 0.68. 0.34 + 0.34 here, and so

the probability, that, that point is Mahalanobis distance of

square root of d or less, is 68 percent or point, 68 of those points are in

this area for each dimension, right? And similarly, the probability that

the Mahalanobis distance of a point belonging to a cluster is,

within, two times square of D, from, it-it-it’s-it’s 95%, and, an-and 99% of points have Mahalanobis distance

less than three times, square root of D. So, this sort of motivates us to,

formulate the Mahalanobis acceptance criterion for the BFR algorithm, which

is to accept point P into the cluster C. It its Mahalanobis distance

from the center, is, from the cluster centroid is

less than some three sets ratio. For example, we might, bring the threshold

to be three times, square root of D. Which means that, it’s, the-the-the prob,

that it’s less than 1% probability, that the, the-the pont, is, you know,

does not belong to that, cluster. The second question that we left open

during our discussion of the algorithm, was whether two subclusters or mini clusters in the compress set should

be combined or compressed into each other. The simple way to address this

question is to compute the variance of the combined sum, subcluster. The way we do that is to just add N,

SUM and SUM square values of the two

clusters to each other. And then we can compute the,the centroid,

and the variance of a combined

hypothetical combined sub-cluster. If we find that the combined

variance is below, some threshold. Uh,then that might indicate to us

that these two clusters actually do belong together, and

we might decide to combine those two, sub-clusters in the compressed

set into a single cluster. Now this is just one approach,

to this, to answer this question and there, there,

there could be other approaches. For example, some dimensions will

be more important than others. You know, in the, in,

in the data for example. Height may be more,

more important than weight, in, and so we might want to rate dimensions,

differently. Or we might want to use

a density-based metric, like we, studied when we looked at

algorithmic clustering

Thank you!

Very well explained