 # 6 4 The BFR Algorithm 25 01

Welcome back to Mining
of Massive Datasets. Be looking at K-means algorithms for
clustering. And we’ve sort of seen that the K-means
algorithm is, you know, produces good clusters and it’s linear for each scan or
each round of the K-means clustering. But it can take a really, large number of rounds to convert
the K-means clustering algorithm. So, our question is: Can we come up with
an algorithm that can do something like k-Means clustering but in one pass
over the data, if you have a really, really large dataset that
doesn’t fit in memory? The answer to that question is
going to cover in this segment. BFR is a variant of K-means. And is sort of optimized for really large
data set that are actually disclicids. The data site is so
large that it doesn’t fit in memory. And so you really want to just
can the dataset once in disk and come up with the clustering. However, to accomplish this the BFR
algorithm makes a very strong assumption. It assumes,
not only that we have a Euclidian space, which is an assumption that all K-means
algorithms make, but it also assumes that each cluster is normally distributed
around a centroid in Euclidean space. Let’s see what this means. Let’s start with a refresher
on the normal distribution. Here’s a variable, x, and here its,
here is its frequency distribution. Assuming that x is normally distributed. Let’s say x has, mean 0 and
standard deviation sigma. Then, if you look at the frequency
distribution, it’s it’s apparent that, you know, we, we sort of end
up in this bell-shaped curve. and, most, observations of x are going to
cluster very closely around the mean zero. In fact about 68% of the values which is
you know here 0.34 plus 0.34 so that’s 68% of observations are going to lie within
the one standard deviation of the mean. Plus or minus one sigma from the mean. Right? and, about 95% of the values
are going to lie, at a distance of at most 2
sigma from the mean, plus or minus 2 sigma from, from the mean. And about 99% of observations
are going to be at a distance of about 3 sigma from the mean. All right so, this is just a refresher. It should be evident to, most of you should recall this from
your elementary statistics classes. Now the, the nice thing, if, if you assume tha,t the, that the data is
normally distributed in each dimension. Let’s say X here is a dimension,
and, the mean of x is zero, which means that the centroid in x
dimension for a given cluster is at zero. Then, we, we sort of know that,
about 68% of the points, lie within one standard deviation,
off the centroid, along x dimension. And we can say the same thing
about each dimension, and we assume that all the dimensions
are independent of each other. So in general, we can quantify the
likelihood given, you know, given a point, given the point, belongs to a cluster,
we can quantify the likelihood of finding that point at a certain distance
from the centroid of the cluster. And this is the property that we are going
to use, in the BFR algorithm as, you know, as it’ll become apparent later on. Now recall that,
although in this case, you know, I’ve shown,
single standard deviation sigma. We don’t make the assumption that
a standard deviation is actually the same along every dimension. Each dimension actually has its own mean,
and its own standard deviation. Now the implication of the, of this idea
that each dimension has its own mean and its own standard deviation is that the,
the, the clusters actually look like, ellipsis, that are aligned along the axis. For example, here’s here’s a cluster
the pink cluster, and the, notice that it’s, it’s off a flat ellipse
that’s aligned along the, the x axis. Which means that the, the standard
deviation is higher along the x axis than along the y axis for
this particular cluster. Whereas the green cluster here,
has, you know, is more elongated along the y dimension. Which means that its standard deviation
along the y dimension is more than its standard deviation along the x dimension. Whereas the blue cluster is like,
looks more like a circle, which means that it has roughly the same
of the BFR algorithm. Remember that the data, set that
we’re looking at is very, very large. So and it’s actually on disc. Now the data is so large,
that it doesn’t actually fit. All of it doesn’t really fit into memory,
right? So what we’re going to do is we’re going
to page the data one chunk of the data in, you know, at, at a time into memory. Right? When we bring a chunk of data in,
and, and read it, we also, at the same time, in memory,
that we’ve created thus far. Right?
We’re going to bring a chunk of the data into memory. We’re going to update
that chunk of data away. And then we’re going to bring
the next chunk of data into memory. And we’re going to continue with this
process until we’ve gone through all, all the chunks of data, at which point
we’ve ended up with our final clustering. Okay?
This is the, this is the high-level overview of the,
BFR algorithm. To summarize, the points are read from
disk, one main memory full at a time. Right? And most points from previous memory
loads are actually summarized by simple statistics, since we don’t
have space to store all those points, we summarize all those points
with simple statistics. And we’ll see what these statistics are,
going forward. To begin with we’re going to, when we
load the first set of data, or the first chunk of data into memory, we’re going
to select, the initial K centroids for, the k-Means algorithm from that chunk
of data using some sensible approach. For example you might use one of
the techniques that we saw in the, in the k-Means lecture, prior to this. Now as we go through, we,
we’ll see that points fall into three, three different categories that we, that
we need, that we need to keep track of. The first category of points
are points that are close enough to the centroid of an existing cluster. So if a point is close to a centroid of an
existing cluster, then we just update the, metadata, the statistics of
that cluster to account for the fact that we’ve added
the new point to the cluster. And then we throw the point away. That’s why this set of point
is called the discard set. Because we discard points that are close
enough to the centroid of a cluster. So we don’t need to keep
them around any longer. The, the second set of points is
called a compression set, or the CS. Now, these are points that
are close to each other, but are not close to any existing, centroid. Right? So these points form a separate
mini cluster of their own, but they don’t really naturally merge
with any of our existing K-clusters. So what we do with these points, is that
we compress them into mini clusters and summarize them. But we don’t assign them to
any of the existing clusters. and, and the, the resulting, you know, com, summarized sets are what we
call the compression set, or the CS. And we’ll see more about this later. Finally, we have the, the third set,
which is called the retained set. And a retained set contains
points that are kind of isolated. They don’t fit into any of the existing
clusters, and they’re not close, close together to any
of the other points so they can be, you know,
made by any compressed set. They’re kind of outliers
that are on their own. And these are called the retained set or
RS. These are isolated points. and, you know, we might eventually, fu,
in the future, as we load in more points, we might find that they get close to,
you know, a, a compression set. But at the moment they are too far
away to be assigned to, and, and they just kept, isolated,
as part of the retained set of points. So the retained set of points. Is the only set of points that we
actually have to keep track of in memory. Both the discard set and
the compression set, we can throw away the actual points and
visualize these three, class of points. Remember, there are three,
sets that we need to keep track of. The discard set, the compression set and
the retained set. So heres, you know a cluster, one of the
K-clusters that, that we’re maintaining, and, here is it’s centroid, and any point
that’s close enough to, to the centroid of this cluster is assigned to this cluster
and becomes part of the discard set. We don’t need to keep track
of the point anymore. Now here are a bunch of,
you know, smaller, mini-clusters. Now these are compressed sets. These, the points in, in these
mini-clusters are close to each other, so we could actually cluster them into
these mini-clusters, but these, but they’re too far away from any
of the existing clusters, so I can’t really merge the,
points in this mini cluster with the, with this cluster here, but I can keep track
of them independently as a mini cluster. So the points that are in these
mini clusters are said to be in the compressed set. So for the points, we just need to
keep track of the mini cluster. We don’t need to keep track of the,
each individual point. And so
these points are in the compressed set. And finally, we have isolated points that
are not close to any mini-cluster or they are not close to each other. They are not close to any
of the existing cluster, so we just keep track of them
independently as individual points and these are the points that
are the retained set, or RS. So let’s look at how we summarize,
sets of points that are in the, in the discard set. Remember the discard set is,
is particular this discard set, pertains to one of the K-clusters
that we’re keeping track of. So each of the K-clusters, we,
we keep track of the number of points in the cluster, let’s,
let’s say, let’s call that N. we, keep, track of vector SUM, and SUM is just the,
sum of the points that are in the cluster. So there are N points in the cluster. And we, sum them, dimension wise. And the SUM vector is just the sum of
all the points that are in the cluster. So the ith component of the SUM vector
is the sum of the coordinates of the points along the ith dimension,
and so on. And similarly,
the sum square vector is just the sum of the squares of the coordinates. So the ith component of the, sum square
vector is the sum of the squares of the coordinates of all the end-points
along the ith dimension. In a moment, it’ll be apparent
why we maintain the sum and the sum square vectors. It should be fairly apparent that we
need to keep track of the number of points in a cluster. Notice that if d is the number
of dimensions, we can represent a cluster by 2d plus 1 values,
independent of the size of the cluster. The cluster can have 10 points or
10 million points. The number of values that we use to
represent that cluster is two d plus one, where d is the number of dimensions. Why is this? Well we need one, one point or one value, for to, to, to store n, the number of,
the number of points in the cluster. The sum is a d dimensional vector, and the sum squares are on
the D dimensional vector. So, overall, we need two d + 1
values to represent the cluster, and this is independent on the number
of points in the cluster. Now, the centroid of the cluster is
just the average along each dimension. And the, the, it’s very simple
to calculate the centroid. We just have to go to the SUM vector and divide each component of the SUM vector
by N and the total number of points, and then we end up with a vector that’s
just the centroid of the cluster. We can also calculate the variance of the,
of the, of the cluster along any dimension. The variance of the, cluster along the ith dimension is
just given by this formula here. You take the, SUMSQ values along the ith
dimension, divide them by N, and then subtract the square of the SUM values
along the ith dimension divided by N. This is, from elementary statistics,
the, the, the, the formula for standard, the variance,
from elementary statistics. And the,
standard deviation along the dimension i, it, it’s the square root of
the variance along the dimension i. All right, so we just take the, the
variance along dimension i, and you take the square root of that, and you get
the standard deviation along, dimension i. So now it should be apparent,
why we maintain the, the sum and the sum square of vectors,
for, for each cluster. Because using them and
the number of values in the cluster, it’s quite easy to compute both the,
the centroid of the cluster. And the standard deviation
along every dimension. Next we’re going to look at how
we actually do the clustering. We’re going to] page in one chunk of, data points at a time and they’re
going to update, the cluster metadata. And we just saw the metadata that we’re
going to, that it keep track of for each cluster, the N, the SUM and
the SUM squared, values. So once we have a chunk of points in a,
in memory, we’re going to find the points that are
sufficiently close to a cluster centroid. And we’ll see what sufficiently
close means in a moment, but let’s just assume that, that for
now we know that given a point that it’s, it’s sufficiently close to a cluster
centroid for one of the, the K-clusters, which means, that it is that, which means
at that point it is in the discard set. They’re going to merge the point of
the cluster and throw the point away. And here is how we do that. We, we add the point to the cluster and
then we add the point to the cluster we have to update the N, the sum, and
the sum squared for that cluster. To our just end, we just are incremented
by one to indicate that we’ve added a value to the,
added a point to the cluster. To income and sum, we just have to add the
point to the sum value for the cluster. And to update sum square, we just have re,
to compute the square of the, the, the values, of the,
of that point along each dimension. And add that to the sum square vector for
the cluster. Now we see the advantage of the NSUM and
SUMSQS presentation. It’s very easy to incrementally add value,
you know, add points to a cluster and increment for a cluster without
doing complex computations. Now the points that are actually close to, the centroid of a cluster we’ve added,
to a cluster. However, the points that are not
close to any cluster, we know, we know how to deal with those points. What we’re going to do
is to take these points, the points that are in this memory load
of points, but that are not, you know, that we have not,
added to the discard set. And you’re going to take those points, and we’re going to add in the old, retained
set, the, the, those points from before. That they’ve not done anything with. And you’re going to take these new
points and the old reading set, and we are going to use some main memory
clustering algorithm to cluster them all. You would use either Kamian’s clustering,
or we could use hierarchical clustering. It doesn’t matter. We use any main memory clustering
algorithm and recluster these points. Now, once we do that clustering,
the clusters that, that come all, which we’ll call,
we’ll call these the mini clusters. They go into the compress set. And the outlying points, that don’t,
fall into any cluster in this, according to this main memory algorithm. Go into the new, retained set. Now we have, now we’ve added a whole bunch
of mini-clusters into the compressed set. And it might happen that two of the new,
mini clusters, you know, one of the new mini-clusters that we,
we’re adding to the compressed set is actually very close to one of the existing
mini-clusters so we might want to merge, these, these mini-clusters in,
in, in the compressed set. and, so that’s, we will see in a moment,
what metric we might use to determine that, mini-clusters are close
enough and, and, and need to be merged. Right?
So we might consider doing that. and, and if this is the last round,
You know, if this is not the last round,
we just drop at this point. And we just read in the next chunk
of data, data points into memory. But if this is the last round,
then we take each, mini-cluster, or each compressed set in the CS. And we merge it to its
closest cluster centroid. And we take each point
in the retained set, or the RS, and we, we assign it to,
to its nearest centroid. And that’s the end of our clustering. If this is not the last round,
we just keep the compressed set and the retained set as it is. And we read in the next
chunk of data points, and we repeat the whole process again. So the end of, the algorithm,
we have gone through the, the entire data set exactly once. We page the entire data set to
memory exactly once, and we, we’ve produced K-clusters. And there are two questions that, we left unanswered during
the discussion of the algorithm. The first question is,
how do we decide if a point is close enough to a cluster that we
our discussion of the algorithm, and now we’re going to look at, a way to determine
if a point is close enough to a cluster that we will add it to the cluster and
make it part of the discard set. The second question,
which we left unanswered, is how do we decide whether two
mini-clusters in the compressed set are close enough that they deserve to
be combined into one mini-cluster, or whether we should just leave
them as separate, mini-clusters. Let’s look at that next. Let’s start with the first question which
is how do we decide whether a point that, that we is close enough to one
of the cluster centroids that we decide to add that point to the cluster? The key, concept here is something
called the Mahalanobis distance. And the Mahalanobis distance,
quantifies the, likelihood of a point belonging,
to, a centroid. Let’s, and the Mahalanobis distance, depends very, crucially on our
the centroid along each dimension. Let’s say, cluster C has, centroids, C one through C d
remember there are d dimensions. And the value C one through C d
represent the, the mean along each of the d dimensions so the centroid
of the cluster is C one through C d. And, let’s say the standard dimen,
dimensions, along those dimensions, along, standard deviations along those
dimensions are sigma 1 through sigma d. And let’s say we have a point,
x1 through xd. Now, what we need to determine is if this
point P is close enough to the cluster C, so that we can add the point
P to the cluster C, or is it too far away to be
considered part of the cluster C? We’re going to define,
the normalized distance of a point, along dimension i from the centroid to be yi,
which is xi minus ci divided by sigma i. We take the, the value of the point along
the dimension, subtract the centroid along the dimension, and divide by the standard
deviation along that dimension. And this gives the normalized
distance along that dimension. Now, consider the point, along dimension
i that is at a distance of one standard deviation away from the centroid. What this means is that x i
minus c i is equal to sigma i. But if x i minus c i is equal to sigma i, then the normalized distance in dimension
i, which is y i, is equal to one. So Y i measures the number of standard
deviations away from the mean, along dimension i. The Mahalanobis distance of point P
from cluster C is the square root of the sum of the squares of the normalized,
distances Y i, along each dimension. Let’s see what,
how we use the Mahalanobis distance. Now suppose point P is one
standard deviation away from the centroid along each dimension. Now we saw that in that case,
each yi is going to be 1, since yi measures the number of standard deviations
away from the centroid along dimension i. And since there are D dimensions, the,
the mahalanobis distance to the point P from cluster C is going to be square root
of D, where D is the number of dimensions. Now going back to ou,r bell curve from
the, Gaussian, or normal, distribution. the, the probability,
that each YI is equal to one, or that the point is one standard
deviation of A along each dimension. 1 standard deviation, not less,
in each dimension, is 0.68. 0.34 + 0.34 here, and so
the probability, that, that point is Mahalanobis distance of
square root of d or less, is 68 percent or point, 68 of those points are in
this area for each dimension, right? And similarly, the probability that
the Mahalanobis distance of a point belonging to a cluster is,
within, two times square of D, from, it-it-it’s-it’s 95%, and, an-and 99% of points have Mahalanobis distance
less than three times, square root of D. So, this sort of motivates us to,
formulate the Mahalanobis acceptance criterion for the BFR algorithm, which
is to accept point P into the cluster C. It its Mahalanobis distance
from the center, is, from the cluster centroid is
less than some three sets ratio. For example, we might, bring the threshold
to be three times, square root of D. Which means that, it’s, the-the-the prob,
that it’s less than 1% probability, that the, the-the pont, is, you know,
does not belong to that, cluster. The second question that we left open
during our discussion of the algorithm, was whether two subclusters or mini clusters in the compress set should
be combined or compressed into each other. The simple way to address this
question is to compute the variance of the combined sum, subcluster. The way we do that is to just add N,
SUM and SUM square values of the two
clusters to each other. And then we can compute the,the centroid,
and the variance of a combined
hypothetical combined sub-cluster. If we find that the combined
variance is below, some threshold. Uh,then that might indicate to us
that these two clusters actually do belong together, and
we might decide to combine those two, sub-clusters in the compressed
set into a single cluster. Now this is just one approach,
to this, to answer this question and there, there,
there could be other approaches. For example, some dimensions will
be more important than others. You know, in the, in,
in the data for example. Height may be more,
more important than weight, in, and so we might want to rate dimensions,
differently. Or we might want to use
a density-based metric, like we, studied when we looked at
algorithmic clustering

## One thought to “6 4 The BFR Algorithm 25 01”

1. shivangi gupta says:

Thank you!
Very well explained