Lecture 4: Machine Learning 3 – Generalization, K-means | Stanford CS221: AI (Autumn 2019)

Lecture 4: Machine Learning 3 – Generalization, K-means | Stanford CS221: AI (Autumn 2019)


Homework 1, hard deadline, uh, section. Tomorrow, um, we’re gonna go through the backpropagation example which I went through very briefly in the last lecture. Talk about diverse neighbors, which I did in one minute. And also, we’re gonna talk about scikit-learn, which is- it is a really useful tool for doing machine learning which might be useful for your final projects, good to know. So, uh, please come to section. All right, let’s, ah, do a little bit of review of where we are. So we’ve talked about, ah, we’re talking about machine learning in particular supervised learning where we start with feature extraction where we take examples and convert them into a feature vector, um, which is more amenable for our learning algorithm. Um, we can take either linear predictors or neural networks which gives us scores, um, and the score is either defined via a simple dot product between the weight vector and the feature vector, or some sort of more fancy non-linear combination. At the end of the day, we have these model families that gives us you know, score functions which then can be used for classification regression. We also talked about loss functions as a way to assess the quality of a particular predictor. So in, ah, linear classification, ah, we had the zero, one loss and the hinge loss as example of loss functions that we, ah, might care about. Ah, the training loss is an average over the losses on individual examples. And to optimize all this, ah, we can use, ah, the stochastic gradient algorithm which takes an example x, y, and, ah, computes the gradient on that particular example and then just updates, ah, you know the weights based on that. Okay. So hopefully this should be all, you know, review. Okay. So now, I’m going to ask the following question. You know, let’s be a little bit philosophical here? So what is the true objective you know, of machine learning? So how many of you think it is to minimize the error on the training set? Show of hands. No one? This is what we’ve been talking about, right? We’ve been talking about minimizing your own training sets. Okay, well, uh, maybe that’s, um, maybe that’s not right then. Um, what about minimizing, ah, training error with regularization? Because regularization is probably a good idea. How many of you think that’s, ah, that’s the goal? What about minimizing error on the test set? Okay, seems like it’s closer, right? You know, the test sets. Test accuracies are things maybe you care about. Um, um, what about minimizing error on unseen future examples? Okay. So the majority of you think that’s the right answer. What about, ah, learning about machines, and that’s the true objective. Who doesn’t want to learn about machines? That’s actually the true objective. Now, um, so the correct answer is minimizing error on unseen future examples. So I think all of you have an intuition that we are doing some machine learning, we’re learning on data, but what we really care about is how this predictor performs on in the future. Because we’re going to deploy this in a system, and it’s going to be the future, it’s going to be unseen. Um, and then- but then- okay, so then how do we- do you think about all these other things? You know training set, regularization tests. So that’s going to be something we’ll come back to, um, later. Okay. So there’s two topics today, I wanna talk about generalization, which is I think a pretty subtle but important thing to keep in mind when you’re doing machine learning. Um, and then we’re going to switch gears and talk about unsupervised learning where we don’t have labels, but we can still, um, do something. So we’ve been talking about training loss, right? We’ve- you know, I’ve made a big deal about, you write down what you want, and then you optimize. So the question is like, ah, is this training loss a good objective function? Um, well, let’s take this literally. Suppose we really wanted to just minimize the training loss, what would we do? Well, here- here’s an algorithm for you. So you just store your training examples, okay. And then you’re going to define your predictor is as follows. So if you see that particular, ah, example in your training set, then you’re just going to output, um, ah, the out- the output that you saw in the training set. And then otherwise, you’re just going to segfault. And this is going to crash, [BACKGROUND] right? So this is great. It minimizes the training error perfectly. It gets zero loss assuming your training examples don’t have, ah, conflicts. But you know you’re all laughing because this is clearly a bad idea. So somehow purely following this minimizing training set objective is, ah, error objective is not really the right thing. Um, So this is an example- a very extreme example of overfitting. So overfitting is this phenomena that you see where you, ah, have some data and usually the data has some noise, and you are trying to fit a predictor, but you’re really trying too hard, right? So if you’re fitting this, um, green squiggly line, you are fitting the data and getting zero training error, but you’re kinda missing the big picture which is you know this black curve, or in grant regression. Some of you you’ve probably seen examples where you have a bunch of points usually with noise, and if you really try hard to fit the points and you’re gonna get zero error, but you’re kinda missing this general, ah, trend. And overfitting can really kind of bite you if you’re not careful. So let’s try a formula- formalize this a little bit more. How do we assess whether a predictor is good? Because if we can’t measure it, we can’t really, um, you know optimize it. Okay. So, um, the key idea is that we really care about error on unseen future examples, okay? So this is great as you know, uh, aspiration to write down. But the question is how do we actually you know, optimize this, right? Because it’s the future and it’s also unseen. Um, so you know high-definition, we’ve had, get a handle, get a handle on this. Um, so typically what people do is, ah, the next best thing which is you gather a test set, which is supposed to be representative of all the types of things you would see. And then you guard it, uh, carefully, and make sure you don’t touch it too much, right? Because, you know what happens if you, ah, start looking at the test set and- you know, or the worst case that you train on a test set, right? So you know, the test set being a surrogate for unseen future examples, um, just completely goes away, right? And even if you start looking at it and you’re- you’re really trying to optimize it, um, you can get you know, into this overfitting regime, right? So really be careful of that. And I want to emphasize that the test set is really a surrogate for what you actually care about. So, um, don’t blindly just, you know, try to make test accuracy numbers go up at all costs, okay? Okay. So, um, but for now let’s assume we have a test set though, you know, we have to work with. So there- there’s this kind of really peculiar thing about machine learning, which is this leap of faith, right. You- the training algorithm, um, is only operating on a training set. And then all of a sudden, you go to these unseen examples or the test set and you’re expected to do well. So why is there- why would you expect that, you know, to happen? And as I alluded to on the first day of class, there is some kind of actually pretty deep mathematical reasons for why this might happen, um, but, you know, rather go and get into the math, I just kinda wanna give you a maybe intuitive picture of how to think about, um, this- this gap. Okay. So remember, ah, we had this picture that it’s, ah, of all predictors. So these are all the functions that you could possibly want in your wildest dreams. Um, and then when you define, um, a feature, ah, extractor or a neural net architecture or any sort of, um, you know, a- a structure. You’re basically saying, “Hey I’m only interested in these sets of functions, not all functions.” Okay. And then learning is, um, trying to find some element of the- the class of functions that you’ve, ah, set out, ah, to find. Okay. So there’s a decomposition which is useful. So let’s take out this point G. So G is going to be the best, um, function in this class. The best predictor that you can possibly get. So if some oracle came and set your neural net weights to something, how well could you do? Okay. So now there’s two gaps here. One is approximation error. Approximation error is the difference between F star which is the- the true, ah, predictor. So this is the thing that always gets the right answer and G which is the best thing in your, ah, class. Okay. So this really measures how good is your hypothesis class. Remember last time we said that, we want hypothesis class to be expressive. If you only have linear, ah, functions and your data looks, ah, sinusoidal then that is not expressive enough to capture, um, the data. Okay. So the second part is estimation error. This is the difference between the best thing in your hypothesis class and the- the function you actually find. Right. And this measures how good is a learned predictor kind of relative to the potential of the hypothesis class. You define this hypothesis classes, um, here are things that I’m willing to, you know, consider but at the end of the day based on a finite amount of data, you- you can’t get to G. You only can kind of estimate, um, you know, some- you do a learning and you get to some F hat. So, um, in kind of more mathematical terms, if you look at the error of the thing that your learning algorithm actually returns minus the- the era of the best thing possible which, you know, in many cases is, you know, zero. Um, then this can be written as follows. So all I’m doing is minus- subtracting error of G and adding error of G. So this is the same quantity as this, um, but then I can look at these two terms. So the estimation error is the difference in error between the thing that your learning algorithm produces minus the best thing in the class G. And then the difference- approximation error is the difference between the error of G and error of F star. Okay. So this is going to be useful as a way to kind of conceptualize, um, the different trade-offs. Right. So, you know, just to kind of explore this a little bit. Suppose I increase the hypothesis class size, right, so I add more features or I, you know, add, ah, increase the dimension of my neural networks. Um, what- what happens? So the approximation error will go down. And why is that? Because we’re taking a minimum over a larger set. So, um, G is always the minimum possible error over the set F. And if I make the set larger, I have just more possibilities of driving the error down. Okay. So the approximation error is gonna go down, um, but the estimation error is going to go up, right? As I make my hypothesis class more expressive. And that’s because it’s harder to estimate something more complex. So I’m leaving it kind of vague right now. There’s a mathematical way to formalize this which, um, you can ask me about offline. Okay. So you can see there’s kind of tension here. Right, you really want to make your hypothesis class large so you can, um, drive down the approximation error but you don’t want to make it too large that it becomes impossible to, um, estimate. Okay. So now we have this kind of abstract frame work. What are some kind of knobs we can tune? How do we control the size of the hypothesis class? So we’re gonna talk about two essentially classes of, um, types of ways. So strategy one is, um, dimensionality. So remember for linear classifiers, a predictor is specified by a wave vector. So this is D numbers, right? And we can change D. We can make D smaller by removing features. We can make D larger by adding features. And, ah, pictorially you can think about as reducing D as reducing the dimensionality of your- of your hypothesis class. So if you are in three dimensions, you have three numbers, three degrees of freedom, you have this kind of a ball and you- if you re- remove one of the dimensions now you have this, ah, ball or a circle in two dimensions. Okay. So concretely what this means, is, ah, you can manually add uh, you know, this is a little bit heuristic. You can add features if they seem to be, you know, helping and remove features if they don’t, ah, help. So you can, um, kind of modulate the dimensionality of your, ah, weight vector. Or there are also automatic feature selection methods, um, such as boosting or L1 regularization, um, which are outside the scope of this- this class. Um, to be- if you take a machine learning class you’ll learn more about this, um, this stuff but the main point is that, ah, you can determine by setting the number of features you have. You can, um, vary the expressive power of your hypothesis. Okay. So the se- second strategy is, um, looking at the norm, or the norm, or the length of a vector. So this one is maybe a little bit less, um, obvious. Um, so again for linear, ah, predictors, the wave vector is just, ah, a d-dimensional vector and you look at how long this vector is. So what is – and the length, um, pictorially it looks like this. So if you have, um, let’s say all the weight vectors in, um, each W can be ex- thought about as a point as well. So this circle contains all the weight vectors up to a certain length. And if by making this smaller, now you’re considering, you know, a smaller number of weight vectors. Okay. So at that level it’s, um, perhaps intuitive. Um, so what does this actually look like? Um, so let’s suppose we’re doing one-dimensional linear regression and here’s the board. Um, and we’re looking at, um, x y. Um, so remember what- and in one-dimension, um, we’re- all we’re looking at is, um, you know, W is just a single number. Right? And the number represents the slope of this line. So by saying, um, you know, let’s draw some slopes here. Okay. Um, so by saying that, ah, the weight vector or the weight is a small magnitude, that’s basically saying the slope is, ah, you know, smaller or closer to 0. So if you think about, um, you know, slope equals- let’s say this is slope equals 1, so W equals 1. So anything- anything let’s say, ah, less than 1 or greater than minus 1 is fair game. And now if you reduce this to half, now you’re looking at a kinda a smaller, um, window here and if you keep on reducing it, now you’re basically, um, converging to, you know, essentially very flat and constant functions. Okay. So you can understand this two ways. One is just that the total number of, um, possible weight vectors you’re considering, it’s just shrinking because you’re putting more constraints. They have to be, you know, smaller. From this picture you can also think about it as, what you’re really doing is, um, making the function, you know, smoother. Right? Because, um, a flat function is kinda the smoothest function. It doesn’t kind of, you know, vary too much and, ah, a complicated function is one that can go, you know, very- jump up very steeply and, you know, for quadratic functions can also come down really quickly. So you get a kind of very, ah, wiggly functions. Those are- tend to be more complicated. Okay. Any questions about this so far? Yeah? Um, trying to not overfit. So like what if we had like latent structures within the data set that sensor tra- that says if you try to like not overfit we’re really just kind of like this tricking ourselves like a perpendicular set of like distributions that we say, “Okay. This data must have like come from like something normal, it must have come from something reasonable.” But saying that we’re like not really capturing the full- the full like scope of our data sets. Um, I’m not sure, so let’s see. So the- so the question is if there’s a particular structure inside your data set, for example, if some, um, things are sparse or low rank or something, um, you know, how do you capture that with a regularization? Regularization. But you have like, perhaps not even just like price spikes like this. Like if you have a causal model inside, inside between your like parameters like how would you like, would a regularization like impede some of those relations? Oh yeah so um, so all of this is kind of very generic, right? You’re not making any assumptions about the, the what the classifier is or the features is. So they’re kinda like big cameras that you can just apply. So if you have models where you have more structural domain knowledge or if you, um, which we’ll see. For example if you have you know, Bayesian networks later in the class then there’s much more you can do. And this is just kind of you know, two techniques for as a kind of a generic way of controlling for overfitting. Yeah. Making sure I’m understanding correctly. This approach is actually creating constraints on each element in the vector W that the magnitude of it versus the other one was actually counting elements in a potential vector W? Yeah so um, so let’s look at W here. So let’s say you’re in 3-dimensions. So W is W1, W2, W3. So the first method just says okay let’s just kill some of these um, elements and make it smaller. This one is saying that I mean formally, it’s looking at the squared values of each of these and looking at the square root, that’s what the norm is. So it’s saying that each of these should be, you know, small according to this particular metric. Yeah. [NOISE] Yeah so that’s what I’m going to get to. So this is just kind of giving you intuition for in terms of hypothesis classes and how you want them to be small. How do you actually implement this? Um, you know there’s several ways to do this but the most popular way is to add regularization. And by regularization what I mean is take your original objective function which is train loss of W and you just add this, um, penalty term. So Lambda is called the regularization strength. It’s just a positive number let’s say- let’s say 1. And this is the squared length of W. Okay, so what is this doing is by adding this it’s saying that, “okay optimizer you should really try to make a train loss small but you should also try to make this small as well.” Okay. And there’s uh, if you study convex optimization there’s kind of this duality between, um, this- this is called the Lagrangian form where you have a penalize objective where you add a penalty on the weight vector and the constraint form where you just say that I want to minimize training loss with subject to the norm of W being less than some value. But this is more of that kind of a typical one that you’re going to see in practice. Okay, so here’s objective function. Great. How do I optimize it? Yeah, I think i will use the same W leading into the train Yeah. Okay, so the process of minimizing train loss is further minimized. Yeah so it’s important that these be the same W and you’re optimizing the sum. So the optimizer is going to make these trade-offs. If it says, “Oh okay, I can drive the training loss down. But if this is shooting up then that’s not good and it’ll try to balance these two.” Yeah, [NOISE] it’s basically saying try to fit the data but not at the expense of, uh, having huge weight vectors. [NOISE] Yeah, so if there’s another way to say it is that, um, kind of think about Occam’s razor. It’s saying if there is a simple way to fit your data then you should just do that. Instead of finding some really complicated weight vector that fits your data, so prefer simple solutions. Okay. So once you have this objective you know we have a standard crank we can turn to turn this into an algorithm. You can just do gradient ascent. Um, and the you know if you just take the derivative of this then you have this gradient. And then you also have Lambda W which is the gradient of this term. So you can understand this as basically you’re doing gradient descent as we were doing before. Um, and now all you’re doing is you’re shrinking the weights towards 0 by lambda. So Lambda is a regularization strength. If it’s large that means you’re trying to really kind of push down on the magnitude of the weights. So the gradient optimizer is basically going to say, hey I’m going to try to step in a direction that makes the training loss small but then I’m going to also push the weights towards 0. Okay. In neural nets literature this is also known as a weight decay. And in optimization and statistics it’s known as L2 regularization because this is the Euclidean or 2-norm. Okay so here is another strategy which intuitively gets at the same idea but it’s in some sense you know, more crude. So it’s called early stopping. And the idea is very simple. You just stop early instead of going and training for 100 iterations you just train for 50. Okay. So why, why does this, why is this a good idea? Um, the intuition is that if you start with the weights at 0, so that’s the smallest you can make the norm of W, right? So every time you update on a training you know, set, generally the norm goes up, you know there’s no guarantee that it will always go up but generally this is what happens. So if you stop early that means you are giving less of an opportunity for the norm to grow, grow. So fewer updates translates to generally a lower norm. You can also make this formal mathematically but the connection is not as tight as the explicit regularization from the previous slide. Okay, so the lesson here is you know try to minimize the training error but don’t try you know, too hard. Yeah, question? It depends on how we initialize the weights? Question is, does this depend on how we initialize the weights? Most of the time you’re going to initialize the weights from you know, some sort of weights which is kind of a baseline either 0, or for neural nets maybe like random vectors around 0 but they’re pretty small weights and usually the weights grow from outside from small to large. There’s other cases where if you think about your pre-training, you have a pre-trained model you start with some weights and then you do gradient descent from that. Then you’re saying basically don’t go too far from your initialization. Yeah. This means that like want to [inaudible] like focus on the train loss [inaudible] ? Right. So the question is why aren’t we focusing on minimizing the train loss, or why focus on w? It’s always going to be a combination. So the optimizer is still trying to push down on the training loss by taking these gradient updates, right? Notice that the, the gradient with respect to the regularizer actually doesn’t come in here. It kind of comes in explicitly through the fact that you are stopping it early. But it’s always kind of a balance between, uh, minimizing the training loss and also making sure your, um, class- classifier weights doe- doesn’t get too complicated. Yeah. How do you decide what value of lambda of T to set as? Yeah. So the question is how you decide the value of T here, and how you decide the value of lambda? [NOISE] So these are called hyperparameters, and I’ll talk a little bit more about that later. Okay. So here’s the kind of the general philosophy, uh, that you should have in machine learning. So you should try to minimize the training error, because really, that’s the only thing you can do. That’s your data, and that’s, you know, you have your data there, but you should try to do so in a way that keeps your hypothesis small. So try to minimize the training set error, but don’t try too hard. I guess it’s the, it’s the lesson here. Okay. So now, going back to the question earlier. If you notice through all these, um, my presentation, there’s, there’s all sorts of properties of the learning algorithm, you know, which features you have, which regularization parameter you have, the number of iterations, the step size for gradient descent. Um, these are all considered hyperparameters. So, so far, they’re just magical values that are given to the learning algorithm, and the learning algorithm runs with them. But someone has to set them, and how do you set them? [inaudible]? Yeah. You can ask me, uh, I don’t know the answer to that. [LAUGHTER] Um, okay. So here- here’s an idea. So let’s choose hyperparameters to minimize the training error. So how many of you think that’s a good idea? Okay. Not too many. So why is this a bad idea? Yeah. You can over-fit, right? So suppose you took, uh, lambda and you say, “Hey, um, you know, let’s choose the lambda that will minimize the training error.” Okay. And the, the learning algorithm says “Well, okay, you know, I wanted to make this stat go down. What is this doing in the way? Let’s just set lambda to 0, and then I don’t have to worry about this.” So it’s kind of, um, you know, cheating in a way. And also, early stopping would say like, don’t stop, just keep on going because you’re always going to drive the training error lower and lower. Okay. So that’s not good. So how about, um, choosing hyperparameters to minimize the test error? How many of you say, “Yeah, it’s a good idea”? Yeah. Not, not so good, it turns out. Um, so why? And this is again stressing the point that the test error is not the thing you care about. Because what happens when you look at that- uh, we, we try to use a test set, then it becomes an unreliable estimate of the actual unseen error. Because if you’re tuning hyperparameters on the test set, that means that, um, it’s no longer- it becomes less and less unseen and less future. Yeah. [inaudible]. Yeah. So we could do cross-validation which I’ll describe in a second. Okay? So I want to emphasize this point. When you’re doing your final project, you have your test set, you have it sitting there, and, uh, you should not be, you know, fiddling with it too much or else, um, it becomes less reliable. Okay. So you can’t use the test set, so what do you do? So here’s the idea behind, uh, a validation set, it’s that you take your training set, and you sacrifice some amount of it, maybe it’s, you know, 10% maybe 20%, and you use it to estimate the test error. So this is a validation set, right? The test set is, you know, off to the side, it’s locked in a safe, uh, you’re not gonna touch it. And then, um, you’re just gonna tune hyperparameters on the validation set, and use that to guide your model development. So, the, um, the proportion itself is not a hyperparameter? The proportion itself, uh, [LAUGHTER] is a hyper, hyperparameter. You know, I- us- yeah, you know, I usually don’t tune that. I mean, usually, it’s- how you choose it is, um, kind of this balance between you want the validation set to be large enough so it gives you reliable estimates, but you also want to use most of your data for training. Yeah. How do you choose like lambda and like the other like T? How do we choose those hyperparameters? Yeah. So how do you choose the hyperparameters? Um, so the, the answer is you try a particular value, so, so you- for example, try let’s say lambda equals, um, 0.01 and 0.11, and then you run your algorithm and then you look at your validation error, and then you just choose the one that has the lowest. Yeah. It’s pretty crude but [NOISE] yeah. [inaudible] and I got a hyperparameter without just doing like a, like a, like a search, like try this one then try this one, then try this one. Yeah. So how- is there a better way to search for hyperparameters? Um, you could do, uh, your, er, grid search generally is fine, random sampling is fine. There’s fancier things based on Bayesian Optimization which might give you some benefits but it’s actually the jury’s out on that and they’re more complicated. Um, there’s also you can use better, um, learning algorithms which are less sensitive to the step size. So you don’t have to nail it like, “Oh, 0.1 works but 0.11 doesn’t.” So you don’t- you don’t want that. But in all of the high-level answer is that there’s no, um, real kind of principled way of like here’s a formula that lambda equals and you just evaluate that formula, and you’re done, um, because there’s this is, you know, the kind of the- uh, I don’t know, the dirty side of machine learning, there’s always this tuning that needs to happen to get your, you know, good results. Um, yeah. Question over there. [inaudible] is this process usually automated or is this manual? So the question is, uh, is this process automated? Increasingly, it becomes much more automated. So, um, it requires a fair amount of compute, right? Because usually, if you have a large data-set, even training one model might take a little while. And now, you’re talking about, you know, training let’s say 100 models. So it can be very expensive and there’s things that you can do to make it, uh, faster. But I mean, in general, I would advise that don’t hyperparameter tune kind of blindly, especially when you’re kind of learning the ropes. I think doing it kind of manually and getting intuition for what, uh, step size, um, like factor of step size algorithm is still valuable to have. And then once you kind of get a hang of it, then maybe you can automate. But I wouldn’t try to automate too, you know, early. Yeah? Small changes of hyperparameters need to vary big changes in prediction accuracy, is that considered [inaudible]? Yeah. So your question is, if you change the hyperparameters a little bit and that causes your, um, training or, or model performance to change quite a bit, does that mean your model’s not robust? Uh, yeah, it means your model is probably not as robust. And sometimes, you actually don’t choose a hy- hyperparameter set at all, and you still get varying, you know, model performances. Um, so, you know, you should always check that first because there could be just inherent randomness, especially if you’re doing neural networks that could get stuck in local optimum, there’s all sorts of, um, you know, things that can happen. Okay. Final question now so we can move on. So we found out that the optimal hyperparameter, is it [inaudible]? Uh, so how do you choose, uh, an optimal hyperparameter? So you basically have like a for loop that says for lambda in, you known, 0.1, 0.011, whatever values for t equals, uh, you know, something. Um, you train on this- all these training examples with manual validation, and then you test the model on the validation, you get a number, and you just use, uh, whichever setting gives you the lowest number. [inaudible]? I’m sorry? We- we do have know the numbers it’s not like the uh, [inaudible]? Yeah. Usually, you just have to be in the ballpark. You don’t have to get like 99 versus 100. The, the things I would just advise is like you know, let’s say what kind of orders of magnitude. Because if it- if it really matters like being down to a precise number, then, um, you probably have other worry- things to worry about. Okay. Let’s move on. So what I’m gonna do now is go through a kind of a sample, uh, problem, right? Because I think the, the theory of machine learning and the practice of it, are actually kind of quite different in terms of the types of things that you have to think about. Um, so here’s a simplify named entity recognition problem. So named entity is this, uh, recognition is this popular task in NLP where you’re trying to find names of, uh, people and locations and, um, organizations. So the input is a string, um, where, which has, you know, a particular, potentially named with, uh, the left and right context words. Okay. And the, the goal is to predict whether, um, this x contains, you know, if they’re a person, um, which is plus 1 or not. Okay. So, so here’s the, the recipe for success. Um, when you’re doing your fin- final project or something, um, you get a data set, um, it have- if it hasn’t been already split, split it into train, validation, and test, and lock the test set away. And then, first, I would try to look at the data to get some, you know, intuition, you know. Al- always remember, you want to make sure that you understand your, your data. Don’t just immediately start coding up the most fancy algorithm you can think of. Um, and then you repeat. You implement some, you know, feature, maybe change the architecture of your network, um, and then you tune some, you, you set some hyperparameters and you run the learning algorithm, and then you look at, uh, the, the training error and validation error rates, um, to see, you know, how they’re doing, if you’re underfitting or overfitting. Um, in some cases, you can look at the weights for linear classifiers, um, in, for neural nets it might be a little bit harder. And then you- I will recommend look at the predictions of your model. I always have- I always try to log as much information as I have. You can, so that you can go back and understand what the model is, you know, trying to do. And then you brainstorm some improvements, and you kind of do this until, uh, you either, are happy or you run out of time, and then you run it on that final test set and you get your final error rates which you put in your, uh, report. Okay? So let’s go through an example of what this might, uh, look like. Um, so this is going to be based on the code base for the sentiment homework. Um, so, okay, so here’s where we’re starting. We’re reading, uh, a training set. Let’s look at this training set. So there are, you know, 7,000 lines here. Each line contains the label which is minus 1 or plus 1, along with the input, which is going to be, uh, you know, remember, the left context, the actual entity and the, the right context. Okay? All right. So you also have a development or validation set. Um, and what this code does is, eh, it’s gonna learn a predictor, which, uh, takes the training set and a feature extractor which we’re gonna fill out. Um, and then it’s gonna output either, uh, both the, the weights and, um, some error analysis which you can look- use to look at the predictions. And finally, there’s this test which I’m gonna not do for now. Okay. So, um, so the first thing is, uh, let’s define this feature extractor. So this feature extractor is, uh, Phi of x. And we’re gonna use the sparse, uh, you know, map representation of, of features. So Phi, Phi is, um, there’s this really nice community structure called defaultdict. So this is kind of like saying, you have, uh, you know, uh, you know, a map, but, um, you can’t, you know, access it. And if the element is in there, then you return zero. Um, okay. So Phi equals that, and then you return Phi. Okay. So this is the, the simplest feature vector you can come up with. Um, the dimensionality is zero because you have no features.x Okay. So- but, you know, we can run this and see how we do on this. Okay. So let’s run this. Um, okay. So over a number of iterations, um, you can see that learning isn’t doing anything because there’s no way it’s still updating. Okay. So- but, you know, it doesn’t crash, uh, which is good. Um, okay. So I’m getting 72%, uh, error, which is, you know, pretty bad but I haven’t really done anything. So, um, that’s to be expected. Okay. Where did my window go? Okay. So now, let’s, um, start defining some, you know, features. Okay. So remember, what is x? X is something like, uh, um, took Mauritius into, right? So there’s this entity on left and right. So let’s break this up. So I’m going to, tokens equals x.split. So that’s gonna give me a bunch of tokens, and then I’m going to define left entity, right equals. So this is the- token zero is the left, that’s gonna be took. Um, tokens 1 through minus 1 is gonna be everything until the last token, and then tokens minus 1 is the last one. Okay. So now, I can define, um, a feature template. So remember, a good- nice way to go about it is to define a feature template. So I can just say entity is, um, ent- blank. Um, that’s how I would’ve write- written it as a feature template. In code, um, this is actually pretty, you know, transparent. It’s saying, I’m defining a feature which is going to be one, um, for this, uh, you know, feature template. So entity is gonna be some value, I plug it in, I get a, a particular feature value or feat- feature name. And I’m gonna set that feature name to be- have a feature value of 1. Okay. So let’s, uh, run this. Okay, so, um, let’s go over here, run it. Uh, oops. Um, so entity is, uh, a, a list. So I’m going just gonna turn it into a string. [NOISE] Okay. So now, I’m getting, uh, what happened? So the, um, the training error is, uh, pretty low, right? I’m basically fitting the training error pretty, uh, for training set pretty well. But, you know, noticed, I, I don’t, we don’t- I don’t care about the training, so I care about the, uh, tester. So just one note, it says test here but it’s really the, the validation, um, should probably change that. Um, it’s just what- whatever non training set you passed in. Okay. So this is still a 20% error which is not great. Okay so, uh, at this point, remember, I wanna go back and look at, um, some, you know, get some intuition for what’s going on here. So let’s look at the weights. Okay. So this is the weight vector that’s learned. So for this weight, er, uh, feature, the weight is 1, and all of these are 1. And this, you know, corresponds to the names that, um, the people names that have been seen at training time. Because whenever I see a person name then I’m going to, um, you know, give that feature a 1, so I can get that training example right. And if you look at the bottom, these are the entities which, uh, are not people names. Okay. So this is a sanity check that it’s doing what it’s, um, you know, supposed to do. Um, so the nice thing about with these kind of really interpretable features of that, you can kind of almost compute the- what the weight should be in your- in your head, yeah. [inaudible] one feature for every, almost every example that you learn? Yeah. Okay. Yeah. Yeah. So I have one- essentially, one feature for every entity, which has almost, you know, number [OVERLAPPING]. Most of them are unique. Yeah. So there’s 3,900 features here. [inaudible] [NOISE]. Uh, so we’re gonna change that. But, um, we- we’ll get. We’re not done yet. Okay so, okay, so the other thing we wanna look at is, um, the error analysis. Okay. So this shows you- here is an example, Eduardo Romero. Um, the ground truth is positive but we predicted minus 1. And why do we predict minus 1? It’s because this feature, uh, has weight 0. And why does it have weight 0? Because we never saw this name at training time. Okay? Um, we did get some right, we saw Senate at training time and we just rightly, uh, predicted that was minus 1. Okay. But, you know, you look at these errors and you say, “Okay. Well, you know, this is, um, maybe the- we should add more features.” Okay? So if you look- remember, this, um, you know, example here. Maybe the context helps, right? Because if you have governor, blank, then you probably know it’s a person because only people can, you know, be governors. Uh, so let’s add a feature. So I’m gonna add a feature which is, uh, left is left. [NOISE] And for symmetry, I’ll just add right is right. Okay. So this, eh, defines some indicator features on, you know, the context. So in this case, it will be took him into. [NOISE] Okay. So now, I have three feature templates. Let’s go and train this model. Um, and now I’m down to just, uh, 11% error. Okay. So I’m making some progress. Um, oops, um, let’s look at the error analysis. Okay? So now, I’m getting this correct. Um, and let’s look at what else am I getting it wrong. So Simitis blamed, um, you know, Felix Mantilla. And, you know, again, it hasn’t seen, um, this exact, uh, actually, maybe it, uh, did see this string before, but it still got it wrong. Um, uh, you know, I think there’s kind of a general intuition though that, well, if you have, you know, Felix, um, you know, even if it- you’ve never seen Felix Mantilla. If you see Felix something, you know, chances are it probably is a person, um, not always but, ah, as- as we noted before features are not meant to be like deterministic rules. They’re just pieces of information which are useful. So let’s go over here and we want to define let’s say a feature for every, ah, possible word that’s in- in entity. So word and entity. Remember, entity is a list of tokens which occur between left and right. And I’m gonna say entity contains a word. Okay? So now let’s run this again and now I’m down to 6% error which is, you know, a lot better. Um, if you look at the error analysis, um, so I think the F- maybe the Felix example and now I get this right. Um, and, you know, what else- what else can I do? Um, so you know what I’m- kind of this general strategy here I’m, ah, following here is, um, you know, which is not always the- necessarily the right one but you start with kind of very, uh, very specific features and then you try to kind of generalize, you know, as you go. Um, so how can I generalize this more, right? So if you look at, um, worker, so Kurdistan, right? If your word ends in stan, um, or then- I mean may- maybe it’s, ah, less likely to be a person. I actually don’t know but, you know, maybe like suffixes and prefixes, um, you know, are helpful too. So, um, I’m going to add features. Let’s say entity contains prefix and then I’m going to let’s say just, you know, heuristically look at the first four tokens, um, and suffix the last four tokens. Um, and then run this again and now I’m down to, you know, 4% error. Um, okay. I’m probably gonna, you know, stop right now. Um, at this point, you can, um, actually run it on your test set and we get, um, you know, 4% error as well. Yeah. [BACKGROUND] Oh, yeah. I guess, um, this was, um, all planned out so that the test error would go down. But actually more often than not, you’ll add a feature that you really, really think should help. But it doesn’t help for whatever reason, so. [inaudible] cause of that certainly not get worse. We agree that the cause of due date, sorry, not get [inaudible] with whatever cause, would it get worse? Yeah. You- you s- yeah, some of the time, you- yeah, it doesn’t move. Uh, that’s kind of probably the more of the time but sometimes it can go up, if you add a really, you know, bad feature or something. [inaudible] don’t consider this at all, you know, it says here. So the more features you add generally the training error will go down, right? So all the algorithm knows is like it’s driving training error down, so it doesn’t know that. It doesn’t, you know, generalize. Yeah. Okay. So this is definitely the happy path. I think when you go and actually do machine learning, it’s going to be more often than not, ah, the test error will not go down. So don’t get too frustrated. Um, just keep on trying. Yeah. Are we expected to keep optimizing after like 5% error? [NOISE] Um, are you expected to optimize after 5% error? Um, it’s- it really depends, um, um, you know, there’s kind of a limit to every data set. So data sets have noise. So sometimes, you- you shouldn’t definitely not optimize below the noise, ah, limit. So one thing that you might imagine is, for example, um, you have an oracle which, um, let’s say it’s, uh, human agreement. Like if your data set is annotated by humans and if humans can’t even agree like 3% of the time, then you can’t really do better than 3% of the time, as a general rule. There are exceptions, but- okay. Any other questions? Yeah. Uh, kind of like through all your training, you’re happy and then you- you see the kinds of errors, um, hence in fair view applications, say in the advent you try and test that and you find it’s not good, um, what do you do? Oh, yeah. What happens if you accidentally, ah, if you train on the test that and it’s not good. Um, that’s you- to say that it’s not good [LAUGHTER] in some level. So there’s many things that could happen. One is that your test set might actually be different for whatever reason. Maybe it was collected in a different day and, um, your performance just doesn’t hold up on that test set. Um, in that case, well, that’s your test error, right? Remember, the test error is just- if you didn’t look at it, it’s really a honest representation of how good this model is. And if it’s not good, well, that’s just the truth. There wasn’t- your model is not that good. In some cases there are some like bug, like something was misprocessed in a way and it wasn’t really fair. So, you know, there are cases where you want to like investigate if it’s like way off the mark. If I had gone like 70% error, then maybe you- something was wrong and you would have to go investigate. But if it’s in the ballpark and whatever it is, that’s kind of what, um, you have to deal with, right? So what you wanna do also is make sure your validation error is kind of representative of your- if your test error so that you don’t have, you know, surprises at the end of the day, right? I mean it’s- I think fine, er, to run it on a test set, um, just to make sure that there’s no catastrophic problems but the- the kind of aggressive tuning on a test set is something that would, you know, have uh, warned against. Um, yeah. Is there any sort of standard as to how you should split the data into train that and like validation testing. Generally, like what percentage of your data you should allocate to each one, just randomize it or- Um, yeah. So the question is how do you split, uh, into train, validation and test? Um, it- it depends on how large your data set is. So generally people, um, you know, shuffle the data and then randomly split it into test validation and- and train. Um, maybe let’s say like 80%, 10%, 10% just as a kind of, ah, a rule of thumb. There are cases where you don’t wanna do that. Um, there’s cases where you, for example, wanna train on the past and test on the future because that simulates the more realistic settings. Um, remember, the test set is meant to kind of be representative as possible of the situations that you would see at- in the- in the real world. Yeah. Have like, uh, some examples or something like labeled plus one and minus one. Do you have to do that manually? So the question is that dataset was labeled. There’s 7,000 of them. Um, I personally did not label this dataset. [BACKGROUND] This is a center dataset that, uh, someone labeled, um, you know, sometimes these data-sets come from, um, you know, crowd workers, sometimes they come from, you know, experts. Um, yeah, it varies. Um, yeah, sometimes they come from grad students. It’s actually a good exercise to go and label. I’ve labeled a lot of data also, in my life, um. [BACKGROUND] Yeah, exactly. Okay, let’s go on. So switching gears now, let’s talk about unsupervised learning. So, so far we’ve talked about supervised learning where the training set contains input-output pairs. So you are given the input and this is the output that your predictor should output. Um, but, you know, uh, this is very, uh, timely. Um, we were just talking about how fully labeled data is very expensive to obtain because, you know, 10,000 is actually not that much, you know, you can often have, you know, 100,000 or even a million examples which, uh, you do not want to, um, be sitting down and annotating yourself. Um, so here’s another possibility, so unsupervised learning. Unsupervised learning, the training data only contains inputs and unlabeled data is much cheaper to obtain in certain situations. So for example if you’re doing text classification, you have a lot of text out there. People write a lot on the Internet and you can easily download, you know, gigabytes of text and all that is unlabeled. And yeah, you can do something with it. That would be, you know, you turn that into gold or something. Um, and also images, videos, um, and so on. Um, you know, it’s not always possible to obtain unlabeled data. For example, if you have, you know, some device that is producing, uh, data and you only have one of that device that you built yourself, then, you know, you’re not going to be able to get that much data. But we’re gonna focus on a case where you do have basically infinite amount of, uh, data and you want to do something with it. Um, so here’s some examples I want to share with you. This is a classic, uh, example from NLP that goes back to, um, you know, the early 90s. So if these ideas were clustering, the input you g- will have a bunch of raw text, lots of news articles and you put it into this algorithm, which I’m not going to describe, but I’m going to look at- we’re going to look at the output. So what is this output? It returns a bunch of clusters where for each cluster, it has a certain set of words associated with that cluster. Okay, and when you look at the clusters, they’re pretty coherent. So this is roughly- the first cluster is days of the week, second cluster is months, um, third cluster is some sort of, uh, you know, m- m- materials, um, um, fourth cluster is, uh, synonyms of like, you know, big, and so on. And, you know, one th-, one thing though, the critical thing to note is that the input was just raw text. Nowhere did someone say, “Hey, these, these are days of the month, learn them and I’ll go test you later.” It’s all unsupervised. So this is actually, um, you know, on a personal note, the kind of th- th- the, uh, example when I was doing a Masters, uh, that got me into doing NLP research because I was looking at this and I was like, “Wow, you can actually take unlabeled data and actually mine-” really interesting kind of signals, you know, out of it. Um, more recently, there’s these, uh, things called word vectors, uh, which do something very similar instead of clustering words, they embed words in, uh-, into a vector space. So if you zoom in here, um, each word is associated with a particular position and, uh, s- words which are similar actually t- happened to be close by in vector space. So for example, these are country, um, names, these are, uh, pronouns, these are, you know, years, months, and so on. Okay? So this is kind of operating on a very similar principle. Um, there’s also contextualized word vectors like, um, Elmo and Bert if you’ve, you know, heard of those things which have been really taking the NLP community by storm m- more recently. On, on the v- vision side, you also have, uh, the ability to do unsupervised learning. Um, so this is an example from 2015 where you run, um, a clustering algorithm which is also jointly learning the features during this kind of deep neural network and it can identify, um, different types of digits: zeros, and nines, and fours that look like nines, threes and- or fives that look like three’s and so on. So remember this is not doing classification, right? You’re not, um, uh, telling the algorithm, “Here’s our fives, here’s our twos.” It’s just looking at examples and finding the structure that, “Oh, these are kind of the same thing and these are also the same thing.” And sometimes but not always, these clusters actually correspond to labels. Um, so here’s another example of, um, um, ships, planes, and birds that look like planes. Um, so you can see kind of this is not doing classification, it’s just kind of looking at visual similarity, okay? All right so the general idea behind supervised learning is that, you know, data has a lot of rich latent structure in that, in that. And in that, by that mean- I mean there’s, there’s kind of patterns in there. Um, and we want to develop methods that can discover this structure, you know, automatically. So there’s multiple types of unsupervised learning. There’s clustering, dimensionality reduction. Um, um, but we’re going to focus on, you know, clustering- in particular K-means clustering for, um, this lecture. Okay. So let’s get into it more formally. So the definition of clustering is as follows. I give you a set of points, so x_1 through x_n and you want to output an assignment of each point to a cluster, and the assignment variables are going to be z_1 through z_n. So for every data point, I’m going to have a z_i that tells me which of the K clusters I’m in, 1 through K, okay? So pictorially this looks like this on the board here where I have, uh, let’s say, uh, let’s say I have seven points. Okay. And if I gave you only these seven points and I tell you, “Hey, I want you to cluster them into two clusters, ” you know, intuitively, you can kind of see maybe there’s a left cluster over here and a right cluster over here, okay? Um, but how do we formulate that kind of mathematically? So, um, here’s the, K-means objective function. So this is the principle by which we’re going to derive, um, clusterings, okay? So K-means says that, uh, every cluster, there’s going to be two clusters, is going to be associated with a centroid, okay? So I’m gonna draw a centroid and, um, uh, a red square here. And the centroid is a point in the space along with the, uh, you know, the data points. And, um, I’m gonna th- this is kinda representing where the cluster is. And then I’m going to associate each of the points with a particular centroid. So I’m going to denote this by a blue arrow pointing from the point into the centroid, um, and, you know, these two quantities, um, are going to kind of represent the clustering. I have the locations of the clusterings in red and also the assignments of the points into the clusters in, in blue. Okay, so of course neither the red or the blue are known and that’s something we’re going to have to optimize. Okay, so, but first we have to define, um, what the optimization, uh, objective function is. Um, so intuitively, what do we want? We want each point, uh, Phi of X_i to be close to the centroid, right? For the centroid to be really representative like, of the points in that cluster, that centroid should be close to all the points in that cluster, okay? So this is captured by this objective function where I look at all the points. For every point, I measure the distance between that point and, um, the centroid that that point is associated with it. So remember z_i is a number between one 1 K. So that indexes which of the Mu, uh, Mu 1 or Mu 2 I’m talking about. I’m looking at the squared distance between those; two, the centroid and the point. Yeah? How does each point get assigned to a centroid? Yeah, how does each point get to, assigned to a centroid? So that’s going to be specified by the z’s which, um, is going to be optimized overall. A priori, you don’t know. Yeah? The holes have a pretty good idea of how many labels they can support, I guess- How many clusters. -clusters it could be? Yeah, the question is do we know how many clusters there are? In general, no. So there are ways to select. It’s another hyperparameter. So it’s something that you have to set before you run the k means object function. So when you’re tuning, you try different number of clusters and see which one kind of works better. Okay, so we need to choose the centroids and the assignments jointly. So though this- this hopefully is clear, you just want to find the assignment z and the centroids mu to make this number as small as possible. So how do you do this? Well, let’s- let’s look at a simple one-dimensional example, and let’s build up some questions, okay? So we have 1d now, and we have four points and the points are at- they are going to be at 0, 2, 10, and 12, okay? So I have our points, four points at these locations. Okay, I want to cluster, and intuitively you think I want two clusters here. There’s going to be two centroids. And suppose I know the centroids, okay? So just- someone told you magically that the centroids of this example is- are going to be like at 1 and 11, okay? So someone told you that and now you have to figure out the assignments. Yeah, how would you do this? Let’s assign this point, where should it go? You look at this distance, which is one. You look at this distance, which is 11. Which are smaller? One is smaller. So you say, “Okay, that’s where I should go.” Same with this point, 1 is smaller, for these, 11 is smaller. And that’s it, okay? So mathematically, you can see it’s comparing the distance from each point to each of the centers and choosing the center which is closest, okay? And you can convince yourself that that’s the way to- if the cluster centroids were- centroids would fix, how you would minimize the objective function. Because if you choose a centroid which is farther away, then you get just- a larger value and you want the value to be as small as possible, okay? I don’t know why this is two. I think this should be one, right? Okay, so let’s do it the other way now. Suppose I now have the assignments. So I know that these two should be in some cluster. These two should be in a different cluster, cluster two. And now I have to place the- the centers. Where- where should I place it? Should I place it here? Should I place it here? Should I place it here? Where should I place it? And if you look at the slide here, what you’re doing is you’re saying, “Okay, for the first cluster, I know 2 and 0 are assigned to that cluster. And I know that the sum of the distances to this- this centroid mu is this, and I want this number to be as small as possible.” Okay? And if you did the first homework you know that whenever you have one of these kind of squared of some objectives, you should be averaging the points here. So you can actually solve that in closed form, and you- given the assignments here, you know the center should be there, which is average of 0 and 2. And for these- these cluster, you should average the two points here, and that should be at 11. Yeah. [inaudible]. Okay, so what’s the difference between centroid and assignment? So when you’re clustering, you have k clusters, so there’s k centroids. So in this case there’s two centroids. There- those are the- the red. The assignments are the association between the points and the centroid. So you have n assignments. And these are the things that move. Is the k a hyperparameter or is that somehow [OVERLAPPING]. Yeah, so k here is a hyperparameter, which is the number of clusters which you can turn. Okay, so here’s a chicken and egg problem, right? If I knew the centroids, I could pretty easily come up with assignments. And if I knew the assignments, I could come up with the centroids. But I don’t know either one. So how do I get started? So the key idea here is alternating minimization, which is this general idea in optimization which is usually not a bad idea. And the principle is well, you have a hard problem, maybe you can solve it by tackling kind of two easy problems here. So here’s a k-means algorithm. So step one is you’re going to- you’re given the centroids, now you kind of go into more general notation, mu 1 through mu k. And I want to figure out the assignments. So for every data point, I’m going to assign that data point to the cluster with the closest centroid. So here I’m looking at all the clusters, 1 through k, and I’m going to test how far is that point from that centroid, and I’m just going to take the smallest value, and that’s going to be where I assign that point, okay? Step two, flip it around. You’re given the cluster assignments now, Z_1 through Z_n. And now we’re trying to find the best centroids. So what centroids should I pick? So now you go through each cluster 1 through k, and you’re going to set the centroid of the kth cluster to the average of the points assigned to the cluster, right? So mathematically this looks like that. You just sum over all the points i which have been assigned to cluster k, and you- you basically add up all the feature vectors. And then you just divide by the number of things you summed over, okay? So putting it together, if you want to optimize this objective function the K-means reconstructor and loss. First you initialize mu 1 through mu K randomly. There’s many ways to do this. And then you just iterate, set assignments given the clusters, the centroids, and then set the centroids given the cluster assignments. Just alternate. Yeah. Yeah this makes sense for like coordinates, for like images, where like if you read in a similar image by bytes it looks the same, but like words, where words that are spelled totally differently can have like these same like semantic meanings. How would you accurately map them to like a same location to cluster essentially around? Yeah, so the question is like maybe for images distances in pixel space makes kinda more- more sense. But if you have words, then- two words which- you shouldn’t be looking at it like the edit distance between, you know, the, the words and two synonyms like big and large, look very different, but they are somehow similar. So this is something that word vectors, you know address, which we’re not going to talk about. Basically you want to capture the representation of a word by its context. So the contexts in which big and large occur is going to be kind of similar. And you can construct these context vectors that give you a better representation. We can talk more offline. Yeah. [inaudible] things where you can get stuck at like a local minima or you’re guaranteed if you do it enough times [inaudible]. Yeah, you can get stuck and I’ll show you an example. Any other questions about the general algorithm? Yeah. Unstable in that they say you get stuck, and then like you like kind of [inaudible] multiple. Yeah, I’ll- maybe I’ll- I’ll answer that. I’ll show you an example. Make sure you show using a fixed number of iterations, but some kind of criteria like doesn’t change anymore as the stopping condition? Yeah, so this is going up to a fixed number of iterations t. Typically, you would have some sort of- you would monitor this objective function. And once it gets below, stops changing very much, then you just stop. Actually, this is that the k-means algorithm is guaranteed to always converge to a local minimum. So why don’t I just show you this demo, and I think it’ll maybe make some things clear. Okay, so here I have a bunch of points. So this is a JavaScript demo. You can go and play around and change the points if you want. It’s linked off the course website. And then I’m going to run K-means. Okay, so I initialize with these three centroids, and these regions are basically the points that would be assigned to that centroid. So this is a Voronoi diagram of these- these centroids. Okay, and this is the loss function which will hopefully should be going down. Okay, so now I iterate- so iteration one, I’m going to assign the points to the clusters. So these get assigned to blue, this one gets assigned to red, these get assigned to green. And then the step two is going to be optimizing the centroids. So given all the blue points, I put the center in the smack in the middle of these blue points. And then same with green and red. Notice that now these points are in the red region. So if I reassign, then these become red, and then I can iterate, and then, you know, keep on going, and you can see that the algorithm, you know, eventually converges to clustering, where these points are blue, these points are red, and these are green. And if you keep on running it, you’re not going to make any progress, because if assignments don’t change, then the cluster centers aren’t going to change either. Okay. Um, so let me actually, you know, skip this since I’m- I was just gonna do it on the board but I think you kind of get the idea. Um, so let’s talk about this local minima problem. So K-means is not guarantee is, is, is guaranteed to converge to a local minimum, um, but it’s not guaranteed to find the global minimum. So if you think about this as a coy visualization of the objective function, you know, by going downhill, we can get stuck here but it won’t get to that point. So you- so you take an example for different random seeds. You can- let’s say you initialize here. Okay, so now all the three centers are here and if I run this and I run this, now I get this other solution which is actually a lot worse. Remember the other one was 44 and this is 114, and that’s where the algorithm converged and you’re just stuck. So in practice, people typically try different initializations, run it from different random points and then just take the best. Um, there’s also a particular way of initialization called K-means plus plus where you put down a point and you put down a point which is as farthest away as possible and then as far away as possible. And then that kind of spreads out the centers. So they don’t, kind of, inter- interfere with each other and that generally works pretty well. But still there’s no necessary guarantee of converging to a global optimum. Okay, any questions about K-means? Yeah. [inaudible] How do you choose K? You guys love these hyper-parameter tuning questions. Uh, so, uh, one thing you can, kind of, draw is the following picture. Um, so K then your loss that you get from K. And usually, if you have one cluster, the loss is gonna be very high and that at some point, it’s, you know, going to go down and you generally, uh, you know, lop it off when it’s, you know, not going down by very much. So you can monitor that curve. Another thing you can do is you have a validation set, um, and you can measure reconstruction error on the, you know, validation set and choose the minimum based on that, which is just another hyper-parameter that you can turn. Yeah. How is the training loss calculated [inaudible] How’s the training loss calculated? Uh, so the training loss is this quantity. Um, so you sum over all your points and then you look at the distance between that point and the sign centroid and you square that and you just add all those numbers up. Okay. So to wrap up, um, oh actually I have- actually, I have more slides here. [LAUGHTER] So, um, unsupervised learning you’re trying to leverage a lot of data and we can, kind of, get around this difficult optimization problem by, you know, doing this alternating minimization. So these will be quick. Um, so just to, kind of, summarize the learning section, we’ve talked about feature extraction. And I want you to think about the hypothesis class that’s defined by a set of features. Um, prediction which boils down to kind of what kind of model you’re looking at for classification and regression. Supervised learning, you have linear models and neural networks, and for clustering you have a K-means object- objective loss functions which, you know, in many cases all you need to do is compute the gradient. Um, and then there’s generalization which is what we talked about for the first half of this lecture which is really important to think about. You know, the task set remember is, kind of, only a surrogate for future examples. Um, so a lot of these ideas that we presented are actually quite old. So the idea of least squares, you know, the- for regression goes back to, you know, Gauss when he was, you know, solve- trying to solve some astronomy problem. Logistic regression was, you know, from statistics. For an AI, there was actually some learning that was done even in the, you know, in the ’50s for playing checkers. As I mentioned, the first day of our class, there was a period where learning kinda fell out of favor but it came back with back-propagation and then much of the ’90s actually a lot more, kind of, rigorous treatment of optimization and formalization of when algorithms are guaranteed to converge um that- that happened in the ’90s. And then in the 2000s, we know that people looked at kind of structure prediction and, um, there was a revival of neural networks. Um, some things that we haven’t covered here are, you know, feedback loops, right? So learning assumes kinda the static view where you take data. You train a model and then you go and generate predictions. But if you deploy the system in the real world, those predictions are actually gonna come around and beat data. And those feedback loops can also cause problems that you might not be aware of if you’re only thinking about, ah, here’s- I’m doing my machine-learning thing. How can you build classifiers that don’t discriminate? So, um, we, uh, often have classifiers, you’re minimizing the training set- average of the training set. So by- by a kind of construction, you’re trying to drive down the losses of, you know, kind of common examples. But often you get these situations where minority groups actually get, you know, pretty high loss because they look different and almost look like outliers but you’re not really able to fit them. But, um, the training loss doesn’t kind of, you know, care. So there’s other ways. Um, there’s techniques like distribution and robust optimization that tries to, um, you know, get around some of these issues. Um, there’s also privacy concerns. How can you learn actually if you don’t have access to an entire dataset? So there are some techniques based on randomization that can help you. And then interpretability, how can you understand what, you know, the algorithms are doing especially if you have a deep neural network. You’ve learned a model and there’s, you know, work which I am happy to discuss with you offline. So the general- so we’ve concluded three lectures on machine learning. Um, but I wanted you, kind of, to think about learning in the most general way possible, which is that, you know, programs should improve with, you know, experience. Right. So I think we’ve talked about, you know, linear classifiers and all these kind of nuts and bolts of basically reflex models. But in the next lectures, we’re gonna see how learning can be used in state-based models and also, you know, variable-based models. Okay. With that, so that concludes. Um, next week, Dorsa will be giving the lecture on state-based models.

Leave a Reply

Your email address will not be published. Required fields are marked *