Hello world it’s Siraj! And let’s visualize some data shall we? There was a recent lab study where a group of scientists strapped some participants with a fitness tracking device, then asked them to do a bunch of exercises while recording some physical measurements. And I’ve got that data set. We’ll visualize it in a 2-D graph, so we can make some discoveries from it. We live in a three-dimensional world so we can understand things in one dimension two dimensions and three dimensions pretty easily, but data can be complex AF! Sometimes data demands that we try to reason in hundreds or even thousands of dimensions. At some fundamental level our puny biological brains just can’t do that, So we’ve invented machine learning to help us learn patterns in our data that we can’t recognize ourselves. Take AlphaGo for example. Because we could reason about so many possibilities at once, It made moves that seemed strange at first to the world champion it played against, but then it ended up beating him by doing that [Thug Life] or IBM’s Watson- it consistently diagnosed cancer better than the best doctors because it was able to analyze millions of cancer research papers at once and match a patient’s genetic profile to what it had learned and there are so many things that ML (Machine Learning) hasn’t yet been applied to so opportunity is right “And if Trump turns off the satellites California will launch it’s own damn satellite we’re going to collect that data” [Thug Life Indeed] Our exercise data is kind of complex as well, but we’re going to figure out how to visualize it so that we can understand it. So let’s take a look at this data. Each row represents a different person and each column is one of many physical measurements like, the position of their arm or forearm and each person gets one of five class labels like sitting or standing that represents the activity they’ve done. So there’s a lot going on here but you’ll notice that some of these cells have empty values. So the first thing we want to do is clean our data by removing them Let’s first import pandas which is our data analysis library that will help us reach our CSV file then we’ll import NumPy to help us transform our data into a format our model can understand scikit-learn will help us create our machine learning model and matplotlib will eventually help us visualize our data. Now that we’ve imported our dependencies we can use pandas read CSV function to download that exercise data that directly from the web and store it in the data frame “all variable” and we’ll also Also create a variable to store the number of rows in the data by calling the shape function on the zeroth column to get the count of rows. We’ll call the “is-no function” then the “sum function” to get the total sum of the null or empty columns in our data set then we’ll create another variable to count the number of non-empty columns in our dataset using the previous variable as a parameter. Now we can remove the columns with missing elements by ONLY using the non-empty columns. Also, if we look at our data the first seven columns don’t have information we can use to differentiate between our classes. So let’s remove them as well using the i-x function which asks for the index of columns we want to delete. We’ll specify from the start- up to column 7- We’re going to take this clean data and transform it into a set of vectors which we can then feed to our learning algorithm. A vector is a set of numbers and it’s how we represent data in machine learning. We’ll create vectors to represent the features for each person in our data. Let’s grab all of our features from our data and store them in a variable we call X. Then we want to standardize those features using the standard-scalar-object of scikit-learn In math terms this means “shifting the distribution of each feature have a mean of zero and a standard deviation of one” Which is a way of saying “make all the features operate on the same scale so they all in proportion to one another” This will improve the quality of our results We’ll store the resulting 70 dimensional feature vectors in the “x_sdt” variable Since 70 is the number of features we have [Cue Dope Rap] Have been sitting here tryna understand this dataset! Each features a dimension and I know that’s correct [damn!] “If each feature equals one dimension that ain’t no sweat ” “Checking heartbeat, that’s easy if it goes up and there’s a threat” [woof] “But behold once I add time and it goes up and down” “When I add temperature we can move it all around [ya!] “But weight and height and strength and a hundred more features(!) I can’t visualize that I’ll get me a seizure” “Let’s reduce dimensionality to two or three so we can see and understand this dataset easily. [Data city?] There’s an entire subfield of machine learning called Dimensionality Reduction that let’s us represent high-dimensional data in a 2d or 3d space. Even a picture can be considered to have 32 million dimensions if we consider every single pixel to be a dimension. But it can also be considered to have just two dimensions (length and width of a photo) We just need to find the intrinsic low dimensionality hidden in our data, so we can visualize it. One of the most popular ways to do this is called “T-SNE” which stands for “Distributed Stochastic Neighbor Embedding” – say that three times fast go! T-SNE will allow us to reduce our vectors dimensionality to just two(!) It does this by taking each one of our 70 dimensional feature vectors and finding the similarity between it, and every other vector. These similarities are represented as values and stored in a similarity matrix. It then creates a similarity matrix for the projected map points which will contain our final representation of the data set. Our first similarity matrix represents where we are, and our second represents where we ideally want to be. We can minimize the distance between these two matrices using a process known as “Gradient Descent.” This will slowly bring down the dimensionality of our first similarity matrix by updating its values over time When it’s over, we can use the trained matrix to map the points in 2d space We’ll initialize our T-SNE model via scikit-learn and set the number of components to two. This parameter is asking how many dimensions do we want our end result to be in. We’ll fit it on our feature vectors and store the resulting two dimensional feature vectors in the “x_test_2d” variable. Now that we have that we can plot our points on a 2d graph by first creating a legend for our class labels, then plotting each point using “matplotlib” We’ll define the location of our legend and show our graph. We can see here that points of the same class tend to cluster together and our T-SNE helps make that happen without knowing the classes of the feature vectors we fed it It learned how to represent the similarity between these classes in a two-dimensional space. We could further analyze this plot to study why certain classes are clustering together, and what conclusions this gives us. Like these two classes are near each other, since the actions are similar, so perhaps the more movement and exercise requires the farther away it will cluster from the rest. There a couple of live demos of T-SNE on the web as well, like this one- that visualizes a bunch of tweets-. You can see that similar sounding tweets, tend to cluster together. So to break it down. 1) High-dimensional data is everywhere and machine learning can help us understand it. 2) If we reduce the dimensionality of our data to 2d or 3d space, we can visualize it ourselves and 3) T-SNE is a popular dimensionality reduction technique that you can use via scikit-learn. The Challenge winner for last week’s video is “Keegan Taylor” [congrats!]. He can use tensorflow to train a neural net to classify Pokemon by their type, and the classifier had a seventy-five percent accuracy after training which is pretty incredible, more than anyone else had! Badass of the week! And the runner-up is Vishal Batchu, very well-documented and clean code. This week’s coding challenge is to use T-SNE, to visualize a “Game of Thrones” dataset- that I’ll provide- and write down something you’ve discovered once you’ve visualized it. Your GitHub submissions should go in the comments, and I’ll announce the winner next time. Please Subscribe and for now I’m going to make my New Year’s prediction I mean resolution, so thanks for watching!