The Best Way to Visualize a Dataset Easily

The Best Way to Visualize a Dataset Easily


Hello world it’s Siraj! And let’s visualize some data shall we? There was a recent lab study where a group of scientists strapped some participants with a fitness tracking device, then asked them to do a bunch of exercises while recording some physical measurements. And I’ve got that data set. We’ll visualize it in a 2-D graph, so we can make some discoveries from it. We live in a three-dimensional world so we can understand things in one dimension two dimensions and three dimensions pretty easily, but data can be complex AF! Sometimes data demands that we try to reason in hundreds or even thousands of dimensions. At some fundamental level our puny biological brains just can’t do that, So we’ve invented machine learning to help us learn patterns in our data that we can’t recognize ourselves. Take AlphaGo for example. Because we could reason about so many possibilities at once, It made moves that seemed strange at first to the world champion it played against, but then it ended up beating him by doing that [Thug Life] or IBM’s Watson- it consistently diagnosed cancer better than the best doctors because it was able to analyze millions of cancer research papers at once and match a patient’s genetic profile to what it had learned and there are so many things that ML (Machine Learning) hasn’t yet been applied to so opportunity is right “And if Trump turns off the satellites California will launch it’s own damn satellite we’re going to collect that data” [Thug Life Indeed] Our exercise data is kind of complex as well, but we’re going to figure out how to visualize it so that we can understand it. So let’s take a look at this data. Each row represents a different person and each column is one of many physical measurements like, the position of their arm or forearm and each person gets one of five class labels like sitting or standing that represents the activity they’ve done. So there’s a lot going on here but you’ll notice that some of these cells have empty values. So the first thing we want to do is clean our data by removing them Let’s first import pandas which is our data analysis library that will help us reach our CSV file then we’ll import NumPy to help us transform our data into a format our model can understand scikit-learn will help us create our machine learning model and matplotlib will eventually help us visualize our data. Now that we’ve imported our dependencies we can use pandas read CSV function to download that exercise data that directly from the web and store it in the data frame “all variable” and we’ll also Also create a variable to store the number of rows in the data by calling the shape function on the zeroth column to get the count of rows. We’ll call the “is-no function” then the “sum function” to get the total sum of the null or empty columns in our data set then we’ll create another variable to count the number of non-empty columns in our dataset using the previous variable as a parameter. Now we can remove the columns with missing elements by ONLY using the non-empty columns. Also, if we look at our data the first seven columns don’t have information we can use to differentiate between our classes. So let’s remove them as well using the i-x function which asks for the index of columns we want to delete. We’ll specify from the start- up to column 7- We’re going to take this clean data and transform it into a set of vectors which we can then feed to our learning algorithm. A vector is a set of numbers and it’s how we represent data in machine learning. We’ll create vectors to represent the features for each person in our data. Let’s grab all of our features from our data and store them in a variable we call X. Then we want to standardize those features using the standard-scalar-object of scikit-learn In math terms this means “shifting the distribution of each feature have a mean of zero and a standard deviation of one” Which is a way of saying “make all the features operate on the same scale so they all in proportion to one another” This will improve the quality of our results We’ll store the resulting 70 dimensional feature vectors in the “x_sdt” variable Since 70 is the number of features we have [Cue Dope Rap] Have been sitting here tryna understand this dataset! Each features a dimension and I know that’s correct [damn!] “If each feature equals one dimension that ain’t no sweat ” “Checking heartbeat, that’s easy if it goes up and there’s a threat” [woof] “But behold once I add time and it goes up and down” “When I add temperature we can move it all around [ya!] “But weight and height and strength and a hundred more features(!) I can’t visualize that I’ll get me a seizure” “Let’s reduce dimensionality to two or three so we can see and understand this dataset easily. [Data city?] There’s an entire subfield of machine learning called Dimensionality Reduction that let’s us represent high-dimensional data in a 2d or 3d space. Even a picture can be considered to have 32 million dimensions if we consider every single pixel to be a dimension. But it can also be considered to have just two dimensions (length and width of a photo) We just need to find the intrinsic low dimensionality hidden in our data, so we can visualize it. One of the most popular ways to do this is called “T-SNE” which stands for “Distributed Stochastic Neighbor Embedding” – say that three times fast go! T-SNE will allow us to reduce our vectors dimensionality to just two(!) It does this by taking each one of our 70 dimensional feature vectors and finding the similarity between it, and every other vector. These similarities are represented as values and stored in a similarity matrix. It then creates a similarity matrix for the projected map points which will contain our final representation of the data set. Our first similarity matrix represents where we are, and our second represents where we ideally want to be. We can minimize the distance between these two matrices using a process known as “Gradient Descent.” This will slowly bring down the dimensionality of our first similarity matrix by updating its values over time When it’s over, we can use the trained matrix to map the points in 2d space We’ll initialize our T-SNE model via scikit-learn and set the number of components to two. This parameter is asking how many dimensions do we want our end result to be in. We’ll fit it on our feature vectors and store the resulting two dimensional feature vectors in the “x_test_2d” variable. Now that we have that we can plot our points on a 2d graph by first creating a legend for our class labels, then plotting each point using “matplotlib” We’ll define the location of our legend and show our graph. We can see here that points of the same class tend to cluster together and our T-SNE helps make that happen without knowing the classes of the feature vectors we fed it It learned how to represent the similarity between these classes in a two-dimensional space. We could further analyze this plot to study why certain classes are clustering together, and what conclusions this gives us. Like these two classes are near each other, since the actions are similar, so perhaps the more movement and exercise requires the farther away it will cluster from the rest. There a couple of live demos of T-SNE on the web as well, like this one- that visualizes a bunch of tweets-. You can see that similar sounding tweets, tend to cluster together. So to break it down. 1) High-dimensional data is everywhere and machine learning can help us understand it. 2) If we reduce the dimensionality of our data to 2d or 3d space, we can visualize it ourselves and 3) T-SNE is a popular dimensionality reduction technique that you can use via scikit-learn. The Challenge winner for last week’s video is “Keegan Taylor” [congrats!]. He can use tensorflow to train a neural net to classify Pokemon by their type, and the classifier had a seventy-five percent accuracy after training which is pretty incredible, more than anyone else had! Badass of the week! And the runner-up is Vishal Batchu, very well-documented and clean code. This week’s coding challenge is to use T-SNE, to visualize a “Game of Thrones” dataset- that I’ll provide- and write down something you’ve discovered once you’ve visualized it. Your GitHub submissions should go in the comments, and I’ll announce the winner next time. Please Subscribe and for now I’m going to make my New Year’s prediction I mean resolution, so thanks for watching!

100 thoughts to “The Best Way to Visualize a Dataset Easily”

  1. The end visualization has a completely different output each time the code is ran. How can this visualization be conclusive of anything factual?

  2. Siraj you should have told about how complex is tsne . my laptop hanged for 10 minutes and I have to shut it. how much does your machine took to reduce dimension for this particular example

  3. Wow. And I thought doing "=a1+a2" in an excel cell was amazing… between this and advanced computer rendering like blender; I really feel old lol! I will just sit back with wide eyes enjoying all the amazing breakthroughs!

  4. Watching your video is like watching a Bollywood movie, exaggerated facial expression and open body movements, and suddenly burst into a rap + dance. But I like the video very much, it is technically very informative, and you delivered it in such way that it is amusing to watch! keep up the good job, bro!

  5. After the end of video, i got shocked knowing that just 7 minutes passed instead of an hour. Siraj, u r literally awesome.

  6. You are really amazing, the hand movement, memes, the funny things you do makes me want to stay and i dont get bored at all! Good job, special thanks.

  7. You are amazing @siraj , You made boring mathematics to an exciting one 🙂
    I am a big fan of you, Keep it up!

  8. Learn to rap machine learning, ho yeah 😀

    Advantage T-SNE versus PCA ?
    I found the answer : https://www.quora.com/What-advantages-does-the-t-SNE-algorithm-have-over-PCA

  9. The night before the big meeting I did Jafree's ”Blue Room Meditation” and asked them to agree to choosing us, doing the deal now and buying the Super Direction  for the entire organization [ Check Details here ===https://plus.google.com/u/3/110086446704524205338/posts/aAaFSwTFTXX ]. I had written down everything that I wanted them to agree to before I did the meditation. It was the first time I have tried the “Blue Room Meditation” and it was a very powerful experience.

  10. hey! i hope you read it. i need a visialisation of what HEPPENS in a neural network (i mean, how the signal flows, which neurons become more and more active, which links become thicker, etc). is there a software for this? let me know here or make a video on this, pls. greets!

  11. It clears many of the roadblocks which was encountered during , but now all is removed and i can thin with no conflict with my inner self  [ Check Details here ===https://docs.google.com/document/d/1u_JTTJeASQZmb7KZVv1H9LhnhJLLWng5OwONTd6tYeI/edit?usp=sharing ].

  12. everytime i see this thumbnail I think of Napoleon Dynamite getting electrocuted in the crotch on the time traveling bike

  13. 1:30 "data science made easy in just 2 minutes"… actually very impressive how many key steps in data workflow are covered in such a short amount of time.

  14. can I ask for advice in here: I recently am learning Python from scratch, I studied economics as my major and want to be the data scientist, so what should my plan be?

  15. Lets say that my institution has a firewall and I cannot download the graph using url, is there another way to obtain the graph? Thanks!

  16. Hi! I suggest adding a description of new data visualization tools for your users. Try out the new AtomicusChart data visualization tool https://atomicuschart.com get a free trial for 3 months if U need

  17. I got this error (Using another dataset):

    plt.scatter(x=x[cl,0], y =x[cl,1], c=color_map[idx], marker=markers[idx],label=cl)

    IndexError: only integers, slices (`:`), ellipsis (`…`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

  18. The sklearn API has changed. The line with "train_test_split" is now:
    from sklearn.model_selection import train_test_split

    Everything else seems to work.

  19. Sir ,I want to make a project in which I have to track a targeted person and the person only looks once . What can be the approach?

  20. Even the song was developed on dimensionality, nice rap yo! And that's right, consciousness is bound by (x), and our neuronal archetypes are biological programs for dimensional reduction. Let's take a moment to appreciate how significant the compute power of our wetware is… In 50 years how cloose are we to processing data with the bandwidth and bitrate true to life sensory input?

Leave a Reply

Your email address will not be published. Required fields are marked *