Euclidean Distance & Cosine Similarity | Introduction to Data Mining part 18

Euclidean Distance & Cosine Similarity | Introduction to Data Mining part 18


So when we’ve got real values– and this is sort of
a primer for the boot camp, a reminder
for those of you who’ve been out of math
classes for a while– when we’ve got continuous
data, purely continuous data, we will often use Euclidean
distance as the distance, as a way of measuring
similarity, actually, really, as a way of measuring
dissimilarity because it’s higher the
more unlike the objects are. So this formula
might be a little intimidating to some people. But I promise you that you
are familiar with Euclidean distance. You just maybe
don’t know the term. Euclidean distance
is what you’d hear called a distance formula,
just the distance formula, in your high school
algebra classes. And most people have seen it in
two dimensions, and sometimes three. But one of the very nice things
about the Euclidean distance is that it generalizes very
naturally to as many dimensions as you want. So in order to calculate the
Euclidean distance between two data objects, we take the
difference in each attribute value, square it, and then sum
that and take the square root. So for instance, we
have four points here at 0,2 2,0, 3,1 and
5,1 that are all plotted at different points. And we can construct
a distance matrix describing how dissimilar
all of our points are. So 0.1 0.4 are the
most dissimilar. They’re the farthest
apart, whereas 0.2 and 0.3 are the most similar. They’re the closest together. 0.3 is also fairly
similar to 0.4, whereas 0.2 is somewhat
less similar from 0.4. So another distance metric
that we see, particularly in the context of documents,
is called cosine similarity. So we have documents. We have turned them
into term vectors. We can find how similar– and cosine similarity is
a measure of similarity, not of dissimilarity. We can find how dissimilar
the two documents are by thinking of each of them
as vectors, taking their dot product– which, for those
of you who never had it or don’t remember
your college vector calculus classes– you take each attribute,
attribute by attribute, and you multiply them
together across your two different objects. So 3 times 1, 2
times 0, 0 times 0. Maybe this is play and this is
coach and this is tournament. And so we’ll do our
count, and then we’ll multiply them all together
document to document, and sum that all up. And then we end up dividing by
the product of the magnitudes. So the product of
the magnitudes is just you square each
attribute, add them all up, and take the square root. So in this case we have
a dot product of 5. We have a D1 and a D2
of 6.481 and 2.245. Those are our magnitudes. So we multiply these two
together and divide 5 by that. And we end up with a
cosine similarity of .315. Cosine similarity is a really
nice metric for documents because it gives us this very
clean 0 to 1 measurement that suffers less from the
curse of dimensionality than something like
Euclidean distance does. So because document vectors
tend to get very, very long because there’s a lot
of different words in a given language, and given
documents might have lots of different words in
them, cosine similarity is a way to avoid some of
the curse of dimensionality. And we’ll talk
about this more when we talk about encoding documents
more directly in the boot camp.

5 thoughts to “Euclidean Distance & Cosine Similarity | Introduction to Data Mining part 18”

  1. Sorry, I'm trying to understand how did you arrive on .3150 at the Cosine Similarity.
    You said: "so we multiply those two together (magnitudes) and divide 5 by that". I've tried the formula you stated dot product of d1 and d2 divided by magnitudes:
    5 / (6481 * 2245) = 3.436462…

    Also in the first part, I didn't understand how did you calculate the distances. Is it Pythagoras theorem?

Leave a Reply

Your email address will not be published. Required fields are marked *