ml5.js Pose Estimation with PoseNet

ml5.js Pose Estimation with PoseNet


[DING] Hello, and welcome to another
Beginner’s Guide to Machine Learning video tutorial. In this video, I
am going to cover the pre-trained model, PoseNet. And I’m going to look
at what PoseNet is, how to use it with the ml5,js
library with the p5.js library, and track your body in
the browser in real time. The model, as I mentioned,
that I’m looking at, is called PoseNet. [MUSIC PLAYING] With any machine
learning model that you use, the first question you
probably want to ask is, what are the inputs? [MUSIC PLAYING] And what are the outputs? [MUSIC PLAYING] And in this case,
the PoseNet model is expecting an image as input. [MUSIC PLAYING] And then as output, it
is going to give you an array of coordinates. [MUSIC PLAYING] In addition to each of
these xy coordinates, it’s going to give you a
confidence score for each one. [MUSIC PLAYING] And what do all these xy
coordinates correspond to? They correspond to the
keypoints on a PoseNet skeleton. [MUSIC PLAYING] Now, the PoseNet skeleton
isn’t necessarily an anatomically
correct skeleton. It’s just an
arbitrary set of what is 17 points that you
can see right over here, from the nose all the way down
to the right ankle, that it is trying to estimate where
those positions are on the human body, and
give you xy coordinates, as well as how
confident is that it’s correct about those points. One other important question
you should ask yourself and do some research about whenever
you find yourself using a pre-trained model out
of the box, something that somebody else trained,
is who trained that model? Why did they train that model? What data was used
to train that model? And how is that data collected? PoseNet is a bit of an odd
case, because the model itself, the trained model
is open source. You can use it. You can download it. There’s examples for it in
TensorFlow and tensorflow.js and ml5,js. But the actual code
for training the model, from what I understand or
what I’ve been able to find, is closed source. So there aren’t
a lot of details. A data set that’s used
often in training models around images is COCO, or
Common Objects In Context. And it has a lot
of labeled images of people striking poses
with their keypoints marked. So I don’t know for
a fact whether COCO was used exclusively
for training PoseNet, whether it was used
partially or not at all. But your best bet
for a starting point for finding out as much as you
can about the PoseNet model is to go directly to the source. The GitHub repository
for PoseNet, in fact there’s a
PoseNet 2.0 coming out. I would also highly suggest you
read the blog post “Real-time Human Post Estimation in the
Browser with TensorFlow.js” by Dan Oved and editing and
illustrations from Irene Alvarado and Alexis Gallo. So there’s a lot of excellent
background information about how the model was trained
and other relevant details. If you want to learn more
about the COCO image data set, I also would point you towards
the Humans of AI project by Philip Schmidt, which is an
artwork, an online exhibition that takes a critical look
at the data in that data set itself. If you found your way to
this video, most likely, you’re here because you’re
making interactive media projects. And PoseNet is a
tool that you could use to do real time body
tracking very quickly and easily. It’s frankly, pretty
amazing that you could do this with just a webcam image. So one way to get
started, which in my view, is one of the easiest ways,
is with the p5 Web Editor, in the p5.js library, which
very, so I have a sketch here which connects to
the camera and just draws the image in a canvas. Also want to make sure you have
the ml5,js library imported, and that would be through a
script tag in index at HTML. Once you’ve got all that set
up, we’re ready to start coding. So I’m going to create a
variable called PoseNet. I’m going to say PoseNet
equals ml5.posenet. All the ml5 functions are
initialized the same way, by referencing the ml5 library
dot the name of the function, in this case, PoseNet. Now typically, there’s some
arguments that go here. And we can look up what
those arguments are, by going to the
documentation page. Here we can see there
are a few different ways to call the PoseNet function. I want to do it the
simplest way possible. I’m just going to give it the
video element and a callback for when the model is
loaded, which I don’t even know that I need. [MUSIC PLAYING] I’ll make sure there are no
errors and run this again. And we can see PoseNet is ready. So I know I’ve got
my syntax right. I’ve called the
PoseNet function, I’ve loaded the model. The way PoseNet
works is actually a bit different than everything
else in the ml5 library. And it works based
on event handlers. So I want to set up a pose
event by calling this method on. On pose, I want this
function to execute. Whenever the PoseNet
model detects a pose, then call this function and give
me the results of that pose. I can add that
right here in setup. PoseNet on pose. And then I’m going to give it
a callback called, got poses. [MUSIC PLAYING] And now presumably, every
single time it detects a pose, it sees me, it sees
my skeleton, it will log that to the
console right here. Now that it’s working,
I can see a bunch of objects being logged. Let’s take a look at what’s
inside those objects. The p5 console is very useful
for your basic debugging. In this case, I really want
to dive deep into this object that I’m logging here,
the poses object. So in this case, I’m going to
open up the actual developer console of the browser. I could see a lot of stuff being
logged here very, very quickly. I’m going to pick any one
of these and unfold it. So I can see that
I have an array. And the first element
of the array is a pose. There can be multiple
poses that the model is detecting if there’s
more than one person. In this case, there’s just one. And I can look at this object. It’s got two properties, a
pose property and a skeleton property. Definitely want to come back
to the skeleton property. But let’s start with
the pose property. I can unfold that, and we
could see, oh my goodness, look at all this stuff in here. So first of all,
there’s a score. I mentioned that with
each one of these xy positions of every keypoint,
there is a confidence score. There is also a confidence score
for the entire pose itself. And because the camera’s
seeing very little of me, it’s quite low, just at 30%. Then I can actually access
any one of those keypoints by its name. Nose, left eye, right eye,
all these, all the way down once again to right ankle. So let’s actually
draw something based on any of those keypoints. We’ll use my nose. I going to make the assumption
that there’s always only going to be a single person. If there were
multiple people, I’d want to do this differently. And I’m going to
make a, hit stop. I’m going to make a
variable called pose. Then I’m going to say,
if it’s found a pose, and I can check that
by just checking the length of the array. If the length of
the array is zero, then pose equals
poses index zero. I’m going to take the
first pose from the array and store it into
the global variable. But actually, if you remember,
the object in the array has two properties,
pose and skeleton. So it seems there’s a lot
of redundant lingo here, but I’m going to say,
posesindex0.pose. [MUSIC PLAYING] This could be a good place
to use the confidence score. Like, only if it’s like of
a high confidence actually use it. But I’m just going to take
any pose that it gives me. Then in the draw function,
I can draw something based on that pose. So for example, let me
give myself a red nose. [MUSIC PLAYING] So now if I run the sketch,
ah, so I got an error. So why did I get that error? The reason why I
got that error is it hasn’t found a
pose yet, so there is no nose for it to draw. So I should always check to
make sure there is a valid pose first. [MUSIC PLAYING] Then draw that circle. And there we go. I now have a red dot
always following my nose. If you’re following
along, pause the video and try to add two more
points where your hands are. Now there isn’t actually
a hand keypoint. It’s a wrist keypoint. But that’ll probably
work for our purposes, I’ll let you try that. [TICKING] [DING] How did that go? OK, I’m going to
add it for you now. [MUSIC PLAYING] Let’s see if this works. Whoo. This is working terribly. It could, I’m almost
kind of getting it right. And there we go. But why is it working so poorly? Well, first of all,
I’m barely showing, I’m only showing it
from my waist up. And most likely, the model was
trained on full body images. [MUSIC PLAYING] Now I turned the camera
to point at me over here, and I’m further away. And you can see how
much more accurate this is, because it seems
so much more of my body. I’m able to control
where the wrists are and get pretty good accurate
tracking as I’m standing further away from the camera. There are also some
other interesting tricks we could try. For example, I could estimate
distance from the camera by looking at how far
apart are the eyes. [MUSIC PLAYING] So for example here, I’m storing
the right eye and left eye location in separate
variables, and then calling the p5 distance
function to look at how far apart they are. And then, I could just
take that distance and assign it to the
size of the nose. So as I get closer,
the nose gets bigger. And you almost can’t tell,
because it’s sizing relative to my face. But it gives it more of
a realistic appearance of an actual clown
nose that’s attached, by changing its size according
to the proportions of what it’s detecting in the face. You might be asking
yourself, well, what if I want to
draw all the points, all the points
that it’s tracking? So for convenience, I was
referencing each point by name. Right eye, left eye,
nose, right wrist. But there’s actually
a keypoints array that has all 17 points in it. So I can use that to just
loop through everything if that’s what I want to do. [MUSIC PLAYING] So I can loop through
all of the keypoints and get the xy of each one. [MUSIC PLAYING] And then I can draw a green
circle at each location. Oops. So that code didn’t
work, because I forgot that each
element, each keypoint is more than just an xy. It’s got the
conference score, it’s got the name of the
part and a position. So I need the keypoints
index 0’s position dot x. Pose dot keypoints index
I dot position dot x. Dot position dot y. Now I believe this’ll work. And here we go. Only thing I’m not
seeing are my ankles. Oh, it’s not. There we go! I got kind of accurate there. Here’s my pose. OK, so you can see I’m getting
all the points of my body right now, standing about
probably six feet away from the camera. There’s one other aspect of this
that I haven’t shown you yet. So if you’ve seen
demos of PoseNet and some of the
examples, the points are connected with lines. So on the one hand, you could
just memorize like always draw a line between the shoulder
to the elbow and the elbow to the wrist. But PoseNet, what I presume is
based on the confidence scores, will dynamically give
you back which parts are connected to which parts. And that’s in the
skeleton property of the object found in the
array that was returned to us. So I could actually add
a new global variable called skeleton. This would’ve been
good for Halloween. Skeleton equals, and let me
just stop this for a second. Poses index zero dot skeleton. I can loop over the skeleton. [MUSIC PLAYING] And skeleton is actually
a two-dimensional array, because in the
second dimension, it holds the two locations
that are connected. So I can say a equals
skeleton index i index zero. And b is. [MUSIC PLAYING] Index 1. And then I can just draw a
line between the two of them. [MUSIC PLAYING] I look at every skeleton point. I get the two parts. Part A, part B, and just
draw a line between the x’s and y’s of each of those. [MUSIC PLAYING] Make it a kind of thicker line,
and give it a, the color white. And let’s see what
this looks like. And there we go. That’s pretty much
everything you could do with the ml5 PoseNet function. So for you, you might
try to do something like make a googly eyes. That’s something I actually
did in a previous video where I looked at an
earlier version of PoseNet. And you could also look at some
of these other examples that demonstrate other aspects. For example, you can actually
find the pose of a JPEG that you load rather than
images from a webcam. But what I want to
do, which I’m going to get to in a
follow-up video to this, is not take the outputs
and draw something. But rather, take these outputs
and feed them as training data into an ml5 neural network. What if I say, hey, every time I
make this pose, label that a y. And every time I make this pose,
label that an m, a c, an a, you see where I’m going. Could I create a
pose classifier? I can use all of
the xy positions, label them, and
train a classifier to make guesses as to my pose. This is very similar to what I
did with the teachable machine image classifier. The difference is, with
the image classifier, as soon as I move the
camera to a different room with different lighting
and a different background with a different
person, it’s not going to be able to recognize
the pose anymore, because that was trained on the raw pixels. This is actually just trained
on the relative positions. So in theory, somebody
around the same size as me, swapping out, it would
recognize their pose. And there’s actually a
way that I could just normalize all the data,
so that it would work for anybody’s pose potentially. So you can train your
own pose classifier that’ll work generically in a
lot of different environments. So if you make something
with ml5 PoseNet or with PoseNet with
another environment, please share it with me. I’d love to check it out. You could find the
code for everything in this video in the link
in this video’s description. And I’ll see you in the future
“Coding Train” ml5 Machine Learning Beginner,
whatever, something video [WHISTLE] Goodbye. [MUSIC PLAYING]

38 thoughts to “ml5.js Pose Estimation with PoseNet”

  1. The question is.
    Can it be done in Processing?
    Not just the pose estimation in the video, but other things that was shown in the latest neural network videos.

  2. I just want to stand up and applaud for the great work you are doing… You are funny, enthusiastic, full of so much energy literally all the time … Your tutorials are so easy to understand and follow… I am lucky that I came across a teacher like you…. Keep up the Good Work..!!

  3. 3:01 "if you found you way here it's because you are making some interactive projects". Well no no, we just love your videos and whether they are relevant or not to my daily life, I watch it chooo choooo! :p

  4. Hello Dan congratulations on the content of the videos. I did a test with pathfind, I have a question. if you see in the example. http://rodrigo-kulb.com.br/rota It works sometimes and sometimes it doesn't work. could help?
    if you reload page it works sometimes
    Video => https://www.youtube.com/watch?v=jwRT4PCT6RU

  5. Teachable machines literally makes using ML super easy.. and I love how easy it is to use with p5js and python

  6. Very interesting material. Now, what I want to know is how to take these skeletons inputs and apply them onto a still image and use AI to animate that still image with the captured motion. Upcoming video, maybe? Please, please, pretty please???

  7. Thanks to Dan Oved who let me know that you can read more about the model and how it was trained in the research paper! https://arxiv.org/abs/1803.08225

  8. dear daniel,
    a few days ago i bought myself an arduino, and i've been playing with it since then, and to my honest surprise i've found out, that the ide that the arduino uses is build on top of the one, that.. processing uses! so, first of all, congratulations! but.. that brings me to my question to you: would you be willing to make tutorials on the visualization of arduino's data coming from its serial port in processing?
    also i would love to see you make some arduino project in your cabana, like a plant watering system, or a lamp that lights up either because of your movement or lack of sunlight.

  9. Hey Dan, I think you know that , but just in case…

    You can declare your variables inline like this

    let video, posenet , pose

    And yep, to make it more readable

    let video,
    ___posenet ,
    ___pose

    And yep , you can assign values right here

    let video = somenthing,
    ___posenet = somenthing,
    ___pose = somenthing

    It's no so repetitive and faster to work with since when you're adding new line you're just adding comma and new variable.
    I mean yeah…if you're using semicolons after declaring it's same thing , but if you're not…
    World will not be destroyed , you know 😀

  10. Awesome, really appreciate what you are doing!
    The applicability of this topic is huge.
    keep doing and Thank you!

  11. Omg! I was looking for something like this…. This is awesome! I got a perfect idea for my next project 🙂 Thanks man.

  12. Thanks for this Dan.
    I was wondering if you could make car recognition using license and color. From photo detect if it is vehicle and from there extract license plate & vehicle color and output it using only javascript? I have seen this in python but not javascript, would be glad if there is video from you on this 😀

  13. I would enjoy some Java implementations/videos aswell with Java libraries with processing and javafx and deeplearning4j/tensorflow Java on eclipse or maybe even thought ik it's a stretch but some c++ with sfml and open source libraries like imgui

Leave a Reply

Your email address will not be published. Required fields are marked *