So when we’ve got real values– and this is sort of

a primer for the boot camp, a reminder

for those of you who’ve been out of math

classes for a while– when we’ve got continuous

data, purely continuous data, we will often use Euclidean

distance as the distance, as a way of measuring

similarity, actually, really, as a way of measuring

dissimilarity because it’s higher the

more unlike the objects are. So this formula

might be a little intimidating to some people. But I promise you that you

are familiar with Euclidean distance. You just maybe

don’t know the term. Euclidean distance

is what you’d hear called a distance formula,

just the distance formula, in your high school

algebra classes. And most people have seen it in

two dimensions, and sometimes three. But one of the very nice things

about the Euclidean distance is that it generalizes very

naturally to as many dimensions as you want. So in order to calculate the

Euclidean distance between two data objects, we take the

difference in each attribute value, square it, and then sum

that and take the square root. So for instance, we

have four points here at 0,2 2,0, 3,1 and

5,1 that are all plotted at different points. And we can construct

a distance matrix describing how dissimilar

all of our points are. So 0.1 0.4 are the

most dissimilar. They’re the farthest

apart, whereas 0.2 and 0.3 are the most similar. They’re the closest together. 0.3 is also fairly

similar to 0.4, whereas 0.2 is somewhat

less similar from 0.4. So another distance metric

that we see, particularly in the context of documents,

is called cosine similarity. So we have documents. We have turned them

into term vectors. We can find how similar– and cosine similarity is

a measure of similarity, not of dissimilarity. We can find how dissimilar

the two documents are by thinking of each of them

as vectors, taking their dot product– which, for those

of you who never had it or don’t remember

your college vector calculus classes– you take each attribute,

attribute by attribute, and you multiply them

together across your two different objects. So 3 times 1, 2

times 0, 0 times 0. Maybe this is play and this is

coach and this is tournament. And so we’ll do our

count, and then we’ll multiply them all together

document to document, and sum that all up. And then we end up dividing by

the product of the magnitudes. So the product of

the magnitudes is just you square each

attribute, add them all up, and take the square root. So in this case we have

a dot product of 5. We have a D1 and a D2

of 6.481 and 2.245. Those are our magnitudes. So we multiply these two

together and divide 5 by that. And we end up with a

cosine similarity of .315. Cosine similarity is a really

nice metric for documents because it gives us this very

clean 0 to 1 measurement that suffers less from the

curse of dimensionality than something like

Euclidean distance does. So because document vectors

tend to get very, very long because there’s a lot

of different words in a given language, and given

documents might have lots of different words in

them, cosine similarity is a way to avoid some of

the curse of dimensionality. And we’ll talk

about this more when we talk about encoding documents

more directly in the boot camp.

Sqrt(6) = 2.449 for anyone wondering about the example given on the Cosine Similarity slide.

Wow perfect!! Thanks sir for this tutorial !

cosine similarity gives you a number between -1 and 1 (you said 0 to 1 at ~ 4:00).

Sorry, I'm trying to understand how did you arrive on .3150 at the Cosine Similarity.

You said: "so we multiply those two together (magnitudes) and divide 5 by that". I've tried the formula you stated dot product of d1 and d2 divided by magnitudes:

5 / (6481 * 2245) = 3.436462…

Also in the first part, I didn't understand how did you calculate the distances. Is it Pythagoras theorem?

Why does cosine similarity work? What's the logic/intuition behind it?