CS50 2018 – Lecture 4 – Data Structures

CS50 2018 – Lecture 4 – Data Structures

[MUSIC PLAYING] DAVID J. MALAN: All right, this is
CS50 and this is a lecture four. So we’re here in beautiful
Lowell Lecture Hall and Sanders is in use today. And we’re joined by some
friends that will soon be clear and present in just a moment. But before then, recall that last
time we took a look at CS50 IDE. This was a new web-based programming
environment similar in spirit to CS50 Sandbox and CS50 Lab,
but added a few features. For instance, what features
did it add to you– to your capabilities? Yeah? AUDIENCE: Debugger. DAVID J. MALAN: What’s that? AUDIENCE: The debugger. DAVID J. MALAN: The debugger. So debug50, which opens
that side panel that allows you to step through your code,
step by step, and see variables. Yeah? AUDIENCE: Check50. DAVID J. MALAN: Sorry, say again? AUDIENCE: Check50. DAVID J. MALAN: Check50 as well,
which is a CS50 specific tool that allows you to check the
correctness of your code much like the teaching fellows
would when providing feedback on it. Running a series of tests
that pretty much are the same tests that a
lot of the homework’s will encourage you
yourself to run manually, but it just automates the process. And anything else? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: So that is true too. There’s a little hidden Easter egg
that we don’t use this semester, but yes indeed. If you look for a
small puzzle piece, you can actually convert your C code
back to Scratch like puzzle pieces and back and forth, and back to forth,
thanks to Kareem and some of the team. So that is there, but by
now, it’s probably better to get comfortable with text as well. So there’s a couple of
our other tools that we’ve used over time of course
besides check50 and debug50. We’ve of course used printf
and when is printf useful? Like when might you want to use it
beyond needing to just print something because the problem set tells you to. Yeah? AUDIENCE: To find where your bug is. DAVID J. MALAN: Yeah, so
to find where your bug is. If you just, kind of, want to print out
variables, value or some kind of text so you know what’s going on
and you don’t necessarily want to deploy debug50, you can do that. When else? AUDIENCE: If you have a long
formula for something [INAUDIBLE] and you want to see [INAUDIBLE]. DAVID J. MALAN: Good. Yeah. AUDIENCE: How running– like
going through debug50 50 times. DAVID J. MALAN: Indeed. Well, in real life– so you
might want to use printf when you have maybe a nested loop, and
you want to put a printf inside loop so as to see when it kicks in. Of course, you could
use debug50, but you might end up running debug50 or clicking
next, next, next, next, next, next, next so many times that
gets a little tedious. But do keep in mind, you can just put
a breakpoint deeper into your code as well and perhaps remove an
earlier breakpoint as well. And honestly, all the time, whether
it’s in C or other languages, do I find myself occasionally using
printf just to type out printf in here just so that I can literally see if
my code got to a certain point in here to see if something’s printed. But the debugger you’re
going to find now and hence forth so much more
powerful, so much more versatile. So if you haven’t already gotten to
the habit of using debug50 by all means start and use those breakpoints
to actually walk through your code where you care to see what’s going on. So style50, of course, checks the style
of your code much like the teaching fellows might, and it
shows you in red or green what spaces you might want to
delete, what spaces you might want to add just to pretty things up. So it’s more readable
for you and others. And then what about help50? When should you instinctively
reach for help50? AUDIENCE: When you don’t
understand an error message. DAVID J. MALAN: Exactly. Yeah, when you don’t
understand an error message. So you’re compiling something. You’re running a command. It doesn’t really quite work and
you’re seeing a cryptic error message. Eventually, you’ll get the muscle
memory and the sort of exposure to just know, oh, I
remember what that means. But until then, run help50 at the
beginning of that same command, and it’s going to try to
detect what your error is and provide TF-like feedback on
how to actually work around that. You’ll see two on the course’s
website is a wonderful handout made by Emily Hong, one of
our own teaching fellows, that introduces all of
these tools, and a few more, and gets you into the habit
of thinking about things. It’s kind of a flow chart. If I have this problem,
then do this or else if I have this problem
do this other thing. So to check that out as well. But today, let’s introduce
really the last, certainly for C, of our command line tools
that’s going to help you chase down problems in your code. Last week, recall that we had
talked about memory a lot. We talked about malloc,
allocating memory, and we talked about freeing
memory and the like. But it turns out, you
can do a lot of damage when you start playing with memory. In fact, probably by now, almost
everyone– segmentation fault? [LAUGHTER] Yeah, so that’s just one of the
errors that you might run into, and frankly, you might have
errors in your code now and hence forth that have bugs
but you don’t even realize it because you’re just getting lucky. And the program is just not
crashing or it’s not freezing, but this can still happen. And so Valgrind is a command
line program that is probably looks the scariest of
the tools we’ve used, but you can also use
it with help50, that just tries to find what are called
memory leaks in your program. Recall that last week
we introduced malloc, and malloc lets you allocate memory. But if you don’t free that memory, by
literally calling the free function, you’re going to constantly ask your
operating system, MacOS, Linux, Windows, whatever, can
I have more memory? Can I have more memory? Can I have more memory? And if you never, literally, hand it
back by calling free your computer may very well slow down
or freeze or crash. And frankly, if you’ve ever had that
happen on your Mac or PC, very likely that’s what some human accidentally did. He or she just allocated
more and more memory but never really got around
to freeing that memory. So Valgrind can help you find those
mistakes before you or your users do. So let’s do a quick example, let
me go CS50 IDE, and let me go ahead and make one new program here. We’ll call it memory.c because
we’ll see later today how I might chase down those memory leaks. But for now, let’s start with something
even simpler, which all of you may be done by now, which is
to accidentally touch memory that you shouldn’t, changing it, reading
it and let’s see what this might mean. So let me do the
familiar at the top here. Include standard IO. Well, let’s not even do that yet. Let’s just do this first. Let’s do int, main(void),
just to start a simple program and in here let me go ahead and
just call a function called f. I don’t really care what
its name is for today. I just want to call a function
f, and then that’s it. Now this function f, let me go ahead
and define it as follows, void f(void). It’s not going to do
much of anything at all. But let’s suppose, just for the sake
of discussion, that f’s purpose in life is just to allocate memory
for whatever useful purpose, but for now it’s just
for demonstration’s sake. So what’s the function with
which you can allocate memory? AUDIENCE: Malloc. DAVID J. MALAN: Malloc. So suppose I want malloc
space for, I don’t know, something simple like just one integer. We’re just doing this for
demonstration purposes, or actually let’s do more,
10 integers, 10 integers. I could, of course, do– well, give me
10, but how many bytes do what I want? How many bytes do I
need for 10 integers? AUDIENCE: sizeof(int). DAVID J. MALAN: Yeah, so I
can do literally sizeof(int) and most likely the size
of an int is going to be? AUDIENCE: Four. DAVID J. MALAN: Four, probably. On many systems today, it’s
just 4 bytes or 32 bits, but you don’t want to hard code that
lest someone else’s computer not use those same values. So the size of an int. So 10 times the size of an int. Malloc returns what type of data? What does that hand me back? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: Yeah, returns
an address or a pointer. Specifically, the address, 100, 900,
whatever, of the chunk of memory it just allocated for you. So if I want to keep that around,
I need to declare a pointer. Let’s just call it x for today
that stores that address. Could call it x, y, z, whatever, but
it’s not an int that it’s returning. It’s the address of an int. And remember, that’s what
the star operator now means. The address of some data type. It’s just a number. All right, so now if I were to– first, let’s clean this up. Turns out that you use malloc,
I need to use stdlib.h. We saw that last week, albeit
briefly, and then of course if I’m going to call f, what do
I have to do to fix this code? AUDIENCE: You need to declare. DAVID J. MALAN: Yeah, I
need to declare it up here, or I could just move f’s
implementation up top. So I think this works, even
though this program at the moment is completely stupid. It doesn’t do anything useful,
but it will allocate memory. And I’ll do something
with it as follows. If I want to change the first
value in this chunk of memory, well how might I do that? Well, I’ve asked the computer
for 10 integers or rather space for 10 integers. What’s interesting about
malloc is that when it returns a chunk of memory for
you it’s contiguous, back-to-back. And when you hear
contiguous or back-to-back, what kind of data structure
does that recall to mind? AUDIENCE: An array. DAVID J. MALAN: An array. So it turns out we can treat
this just random chunk of memory like it’s an array. So if we want to go to the first
location in that array of memory, I can just do this and
put in the number say 50. Or if I want to go to the
next location, I can do this. Or if I want to do the next
location, I can do this. Or if I want to go to the last
location, I might do this, but is that good or bad? AUDIENCE: Bad. DAVID J. MALAN: Why bad? AUDIENCE: It’s– it’s out of bounds DAVID J. MALAN: Yeah,
so it’s out of bounds. Right? This is sort of week one style
mistakes when it came to loops. Recall, with for loops or while
loops, you might go a little too far, and that’s fine. But now we actually will
see we have a tool that can help us notice these things. So hopefully, just visually, it’s
apparent that what I have going on here is just– on line 12,
I have a variable x that storing the address
of that chunk of memory. And then on line 13, I’m just
trying to access location 10 and set the value 50 there. But as you note, there
is no location 10. There’s location 0, 1, 2, 3, all
the way through 9, of course. So how might we detect
this with a program? Well, let me go ahead and increase
my terminal window just a bit here, save my file, and let me
go ahead and compile make memory. OK, all is well. It compiled without any
error messages, and now let me go ahead and run memory, enter. All right, so that worked pretty well. Let’s actually be a little more
explicit here just for good measure. Let me go ahead and print something out. So printf, %i for an integer, and
let’s make it just more explicit. You inputted %i and
then comma x bracket 10. And what do I have to
include you use printf? AUDIENCE: stdio.h. DAVID J. MALAN: Yeah, so stdio. So let’s just quickly
add that, stdio.h, save. All right, let me recompile
this, make memory, enter. And now let me go ahead and do ./memory. Huh? Feels like it’s a correct program. And yet, for a couple of weeks now
we’ve been claiming that mm-hmm, don’t do that. Don’t go beyond the
boundaries of your array. So how do we reconcile this? Feels like buggy code or at least
we’ve told you it’s buggy code, and yet it’s working. Yeah? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: That’s a
good way of putting it. AUDIENCE: It’s still very similar. We want that. DAVID J. MALAN: OK. AUDIENCE: So we can theoretically– it just created a program. DAVID J. MALAN: Yeah, and I
think if I heard you correctly, you said C doesn’t
scream if you go too far? AUDIENCE: Yeah. DAVID J. MALAN: Yeah, OK. So that’s a good way of putting it. Like, you can get
lucky in C. And you can do something that is objectively,
pedagogically, like technically wrong, but the computer’s not going to crash. It’s not going to freeze
because you just get lucky. Because often, for
performance reasons, when you allocate space for
10 integers, you’re actually going to get
a chunk of memory back that’s a little bigger than you need. It’s just not safe to assume
that it’s bigger than you need, but you might just get lucky. And you might end up having more memory
that you can technically get away with touching or accessing or changing,
and the computer’s not going to notice. But that’s not safe because
on someone else’s Mac or PC, their computer might just be operating
a little bit differently than yours, and bam, that bug is going
to bite them and not you. And those are the hardest, most annoying
bugs to chase down as some of you might have experienced. Right? It works on your computer but
not a friends or vise versa. These are the kinds of
explanations for that. So Valgrind can help us track down
even these most subtle errors. The program seems to be working. Check50 or tools like
it might even assume that it’s working because it
is printing the right thing, but let’s take a look at what
this program Valgrind thinks. Let me increase the size of
the terminal window here, and go ahead and type
in Valgrind ./memory. So same program name ./memory but I’m
prefixing it with the name Valgrind. All right? Unfortunately, Valgrind
is really quite ugly, and it prints out a whole
bunch of stuff here. So let’s take a look. At the very top, you’ll see
all these numbers on the left, and that’s just an
unfortunate aesthetic. But we do see some useful information. Invalid read of size 4 and
then it has these cryptic looking letters and numbers. What are those? They’re just addresses and hexadecimal. It doesn’t really matter
what they are, but Valgrind can tell us where the memory is
that’s acting up suspiciously. You can then see next to that,
that Valgrind is pointing to function f on memory. c 15th line. So that’s perhaps helpful,
and then main on line 8 because that’s the
function that was called. So Valgrind is actually kind of nice in
that it’s showing us all the functions that you called from bottom up,
much like the stack from last week. And so something’s going wrong
line 15, and if we go back to that, let’s see line 15 was– well, sure enough. I’m actually trying to
access that memory location and frankly I did it on line 14 as well. So hopefully fixing one or both
of those will address this issue. And notice here, this frankly just
gets overwhelming pretty quickly. And then, oh, 40 bytes in one block
are definitely lost in lost record. I mean, this is the problem
with Valgrind, honestly. It was written some years ago,
not particularly user friendly, but that’s fine we have
a tool to address this. Let me go ahead and rerun
Valgrind with help50, enter, and see if we can’t
just assist with this. All right, so still the same amount of
black and white input but down here now help50 is noticing, oh, I can help
you with an invalid write of size 4. So it’s still at the same
location, but this time– or rather same file,
memory.c but line 14. And we propose, looks like you’re
trying to modify 4 bytes of memory that isn’t yours, question mark. Did you try to store something
beyond the bounds of an array? Take a closer look at
line 14 of memory.c. So hopefully, even though
Valgrind’s output is crazy esoteric, at least that yellow output will
point you toward, ah, line 14. I’m indeed touching 4 bytes,
an integer, that shouldn’t be. And so let’s go ahead and fix this. If I go into my program,
and I don’t do this. Let’s change it to location 9,
and location 9 here and save. Then let me go ahead and
rerun Valgrind without help50. All right, progress except– oops. Nope, no progress. I skipped the step. Yeah, I didn’t recompile it. A little puzzled why
I saw the same thing. So now let’s rerun Valgrind
and here it seems to be better. So I don’t see that
same error message up at the very top like we did before, but
notice here, 40 bytes in one blocks. OK, that was bad grammar in
the program, but are definitely lost in loss record 1 of 1. So I still don’t quite understand that. No big deal. Let’s go ahead and run help50 and
see what the second of two errors apparently is here. So here it’s highlighting those lines. 40 bytes and one blocks are definitely
lost, and looks like your program leaked 40 bytes of memory. Did you forget the free memory
that you allocated with malloc? Take a closer look at
line 13 of memory.c. So in this case line 13
indeed has a call to malloc. So what’s the fix for this problem? AUDIENCE: Free. DAVID J. MALAN: Per help50
or your own intuition? What do I have to add to this program? AUDIENCE: Free. AUDIENCE: Free. Yeah, free, and where does that go? Right here. So we can free the memory. Why would this be bad? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: Exactly. We’re freeing the memory, which is
like saying to the operating system, I don’t need this anymore. And yet, two lines later we’re
using it again and again. So bad. We didn’t do that mistake
last week, but you should only be freeing memory
when, literally, you’re ready to free it up and give
it back, which should probably be at the end of the program. So let me go ahead and re-save
this, Open, up my terminal window, recompile it this time, and now,
let me run Valgrind one last time without help50. And still a little verbose, but
zero errors, from zero contexts. That sounds pretty good. And moreover, it also explicitly
says, all heap blocks were freed. And recall that the heap,
is that chunk of memory that we drew visually up here, which
is where malloc takes memory from. So, done. So this is kind of the
mentality with which to have when approaching the
correctness of your code. Like, it’s one thing to run sample
inputs, or run the program like I did. All looked well. It’s one thing to run tools like
check50, which we humans wrote. But we too are fallible, certainly,
and we might not think of anything. And thankfully, smart humans have
made tools, that at first glance, might be a little hard to use. Like debug 50, as is Valgrind now. But they ultimately help you
get your code 100% correct without you having to struggle visually
over just staring at the screen. And we see this a lot in
office hours, honestly. A lot of students, to their credit,
sort of reasoning through, staring at the screen, just trying to
understand what’s going wrong, but they’re not taking any additional
input other than the characters on the screen. You have so many tools that can feed
you more and more hints along the way. So do acquire those instincts. Any questions on this? Yeah? AUDIENCE: Sir, if you had a main
function that took arguments. Would you run Valgrind with
those arguments as well? DAVID J. MALAN: Yes, indeed. So Valgrind works just like
debug 50, just like help50. If you have command line
arguments, just run them as usual, but prefix your command with Valgrind,
or maybe even help50 Valgrind, to help one with the other. Good question. Other thoughts? Yeah? AUDIENCE: Where does
the data go [INAUDIBLE]?? DAVID J. MALAN: Good question. So at the end of the
day, think about what’s inside the computer, which
is just something like this. So physically, it’s
obviously still there. It’s just being treated
by the operating system– Mac, OS, Windows, Linux, whatever,
as like a pool of memory. We keep drawing it as a grid that
looks a little something like this. So the operating systems job is to just
keep track of which of those squares is in use, thanks to malloc. And which has been freed. And so you can think of
it as having little check marks next to them saying,
this is in use, this is in use, these others are not in use. So they just go back on the so-called
free list into that pool of memory. Good question. If you take a higher level course
on operating systems in fact, or CS61 or 161 at Harvard, you’ll
actually build these kinds of things yourself. And implement tools
like, malloc, yourself. Yeah? AUDIENCE: So why did we have to allocate
memory in this case, and what happens [INAUDIBLE]? DAVID J. MALAN: Good question. Why did we have to allocate
memory in this case? We did not. This was purely, as mentioned,
for demonstration purposes. If we had some program
in which we wanted to allocate some amount of memory,
then this is how we might do it. However, a cleaner
way to do all of this, would have been to say, hey, computer,
give me 10 integers like this, and not have to worry
about memory management. And that’s where we began in week
one, just using arrays on the stack, so to speak. Not using malloc at all. So the point is only, that once
you start using malloc, and free, and memory more generally, you
take on more responsibilities than we did in week one. Good question. And the others? All right. So, turns out, there’s one
more tool, in all seriousness. This is the thing. [? DDB50. ?] So debug 50 is an allusion
to a very popular tool called, GDB 50, [? Gnu ?] debugger. It’s an older tool that you
won’t use at the command line, but it’s what makes debug 50 work. Turns out, there’s a thing. And there’s an actual
Wikipedia article that you might have clicked on in my email last
night, called rubber duck debugging. And frankly, you don’t have to go as
all out, as excessive, as we did here, but the purpose of this technique,
of rubber duck debugging, is to keep, literally, like a rubber
duck on your shelf, or on your desk. And when you have a bug and you don’t
have the luxury of a teaching fellow, or a roommate who took CS50, or a more
technical friend who can help walk you through your code, literally,
start walking through your code verbally, talking to the duck saying,
well, online 2, I’m declaring main, and on line 3, I’m allocating
space for an array. And then, on line 4, I’m calling– ah! That’s what I’m doing wrong. So if any of you have ever had that
kind of moment, whether in office hours, or alone, where you’re
either talking in your head, or you’re talking through
your code to someone else. And here, she doesn’t
even have to respond. You just hear yourself saying the
wrong thing, or having that aha moment. You can approximate that by just keeping
one of these little guys on your desk, and have that conversation. And it’s actually not as crazy
sounding as it actually is. It’s that process of just talking
through your code logically, step by step, in a way that you can’t
necessarily do in your own mind. At least I can’t. When you hear yourself
say something wrong, or that didn’t quite
follow logically, bam, you can actually have that aha moment. So on the way out today, by all
means, take any one of these ducks. That took quite a long, time for
[? Colten ?] to lay out today. And we’ll have more at office hours in
the weeks to come, if you would like. So some of you might recall such
a duck from [? Currier ?] House last year too, which was
a cousin of his as well. All right. So that is rubber duck debugging. Now, last week, recall that we
began to take off training wheels. We’d use for a few
weeks, the CS50 library. And that’s kind of in the past now. That was just a technique,
a tool, via which we could get user input a little
more pleasantly, than if we actually started dealing with memory early on. And we revealed last week that
a “string”, quote, unquote, is just what, underneath the hood in C? Say again. An array of characters. And even more specifically, it’s a
synonym S-T-R-I-N-G for what actual data type? char star, as we’ve called it. So a char star is just
the computer scientists way of describing a
pointer to a character, or rather the address
of a character, which is functionally equivalent to saying an
array of memory, or sequence of memory. But it’s kind of the more precise,
more technical way of describing it. And so now that we know that we have
char stars underneath the hood, well, where is all of that coming from? Well, indeed, it maps
directly to that memory. We keep pointing out that something
like this is inside of your computer. And we can think of the memory
as just being chunks of memory, all of whose bytes are numbered. 0 on up to 2 gigabytes, or 2
billion, whatever the value might be. But of course last week, we pointed
out that you think about this memory not as being hardware per se, but as
just being this pool of memory that’s divided into different regions. The very top of your
computer’s memory, so to speak, is what we call the text segment. And what goes in the text
segment of your computer’s memory when you’re running a program? Text is like, poor choice of
words, frankly, but what is it? Say again. AUDIENCE: File Headers? DAVID J. MALAN: Not the
file headers, in this case. This is in the context of running a
program, not necessarily saving a file. Yeah? AUDIENCE: String literals. DAVID J. MALAN: Not
string literals here, but they’re nearby, actually, in memory. AUDIENCE: Functions. DAVID J. MALAN: Functions, closer. Yeah. The text segment of
your computer’s memory is where, when you double
click a program to run it, or in Linux, when you do dot
flash something, to run it. That’s where the zeros and ones of
your actual program, the machine code, that we talked about in week
zero, is just loaded into RAM. So recall from last week, that, you
know, anything physical in this world– hard drives, solid
state drives, is slow. So those devices are slow, but RAM, the
stuff we keep pulling up on the screen, is relatively fast. If only because it has no moving parts. It’s purely electronic. So when you double click a
program on your Mac or PC, or do dot slash something
in Linux, that is loading from a slow
device, your hard drive, where the data is stored long
term, into RAM or memory, where it can run much more quickly and
pleasurably in terms of performance. And so, what does this
actually mean for us? Well, it’s got to go somewhere. We just decided, humans,
years ago that it’s going to go at the top, so to
speak, of this chunk of memory. Below that though, are the more
dynamic regions of memory– the stack and the heap. And we said this a moment ago, and last
week as well, what goes on the heap? Or who uses the heap? AUDIENCE: Dynamic memory. DAVID J. MALAN: Dynamic memory. Any time you call malloc, you’re
asking the operating system for memory from the so-called heap. Anytime you call free, you’re sort
of conceptually putting it back. Like, it’s not actually going anywhere. You’re just marking it as available for
other functions and variables to use. The stack, meanwhile, is used for what? AUDIENCE: Local variables. DAVID J. MALAN: Local variables
and any of your functions. So main, typically takes a
sliver of memory at the bottom. If main calls another function, it
gets a sliver of memory above that. If that function calls one, it
gets a sliver of memory above that. So they each have their own
different regions of memory. But of course, these arrows,
both pointing at each other, doesn’t seem like such a good design. But the reality, is
bad things can happen. You can allocate so much memory that,
bam, the stack overflows the heap. Or the heap overflows the stack. Thus was born websites like
Stack Overflow, and the like. But that’s just a reality. If you have a finite amount
of memory, at some point, something’s going to break. Or the computer’s going to have
to say, mm-mm, no more memory. You’re going to have to quit some
programs, or close some files, or whatnot. So that was only to say that
that’s how the memory is laid out. And we started to explore
this by way of a few programs. This one here– it’s a little dark here. This one here, was a swap function. Now it’s even darker. It was a swap function that actually
did swap two values, A and B. But it didn’t actually work
in the way we intended. What was broken about this
swap function last week? Like, I’m pretty sure it worked. And when our brave volunteer came up and
swapped the orange juice and the milk, that worked. So like, the logic was correct, but
the program itself did not work. Why? AUDIENCE: It changed the
values of the copy variables. DAVID J. MALAN: Exactly. It changed values in the
copies of the variable. So recall, that when
main was the function we called, and it had two values, x
and y, that chunk of memory was here. That chunk of memory was here. And it had like the numbers 1 and 2. But when it called the swap function,
that got its own chunk of memory. So main was at the bottom,
swap was above that. It had its own chunks of
memory called, a and b, which initially, got the values 1 and 2. 1 and 2 were indeed
successfully swapped, but that had no effect on x and y. So we fixed that. With the newer version of
this program, of course, it looked a lot more cryptic at
first glance, but in English, could someone just describe
what it is that happens in this example that was more correct? Like, what does this
program do line by line? Yeah? AUDIENCE: Instead of passing
copies of the variables, you pass pointers to their addresses. DAVID J. MALAN: Exactly. Instead of passing the values of
the variables, thereby copying them, it passes the addresses
of those variables. So that’s like saying, I don’t
technically care where it is in memory, but I do need to know that
it is somewhere in memory. So instead of passing
an x in the number 1, let’s suppose that x
is at location 100– my go to example. It’s actually the number 100
that’s going to go there. And if y is at the location
like, 104, well, it’s 104 that’s going to go there, which
are not the values we want to swap, but those are sort of like little
maps, or breadcrumbs if you will, that lead us to the right location. So that when we execute this
code, what we’re ultimately swapping in those three lines, is
this and this, and all along the way, recall, we’re using a
temporary variable there that can be just thrown away after. So that’s what pointers
allowed us to do. And that’s what allowed us to actually
change values on the so-called stack, even by calling on other function. All right. Any questions then, on where we left off
last time with the stack and with swap? No? All right. So recall we introduced Binky as
well, who lost his head at one point, but why? What went horribly, horribly awry
with this scene from last week’s film from Stanford? Binky was doing everything
correctly, right? Like, moving values. 42 was successful. And then, yeah? AUDIENCE: He tried to
dereference something that wasn’t pointing to any actual address. DAVID J. MALAN: Exactly. He tried to dereference a pointer, an
address, that wasn’t actually pointing to a valid address. Recall that this was the line in code
in question that was unlucky and bad. Star y, means, go to the address
in y, and do something to it. Set it equal to the number 13. But the problem was, that in
the code we looked at last week, all we did at the start was say, hey,
computer give me a pointer to an int, and call it x. Do the same, and call it y. Allocate space and point x at it. But we never did the same for y. So whereas x contained, last week, the
address of an actual chunk of memory, thanks to malloc, what did y
contain at that point in the story? The yellow line there. What did y contain? What value? AUDIENCE: Null. DAVID J. MALAN: Null. Maybe. Maybe. But it’s not obvious because there’s
no mention of null in the program. We might get lucky. Null is just 0. And sometimes we’ve seen that 0 are
the default values in a program. So maybe. But I say, maybe, and I’m hedging why. AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Yeah. And it doesn’t allocate– well,
allocate, is not quite the right word. That suggests you are
allocating actual memory. It’s a garbage value. There’s something there. Right? My Mac has been running for a few hours. And your Macs, and PCs, and phones,
are probably running all day long. Or certainly when the lid is up. And so, your memory is getting
used, and unused, and used. Like, lots of stuff is going on. So your computer is not filled
with all zeros or all ones. If you look at it at some
random point in the day, it’s filled with like bunches
and bunches of zeros and ones from previous programs
that you quit long ago. Windows you have in the
background and the like. So, the short of it
is, when you’re running a program for the first time, that’s
been running now for some time, it’s going to get messy. That big rectangle of memory is
going to have some ones over here some zeros over here and vise versa. So they’re garbage values, because
those bytes have some values in them. You just don’t necessarily
know what they are. So the point is, you should
never ever dereference a pointer that you have not set yourself. Maybe you will crash. Maybe it won’t crash. Valgrind can help you find
these things but sometimes. But it’s just not a safe operation. And lastly, the last thing
we introduced last week, which will be the stepping stone for
what problems we’ll solve this week, was struct. So struck is kind of cool, in that
you can design your own custom data structures. C is pretty limited out
of the box, so to speak. You only have chars and boules,
and floats, and ints, and doubles, and longs, and str– well, we don’t even
have strings, per se. So it doesn’t really come with many
features, like a lot of languages do. Like Python, which we’ll
see in a few weeks. So with struct in C,
you have the ability to solve some problems of your own. For instance, with the
struct, we can actually start to implement our own features. Or our own data types. For instance, let me go up here. And let me go ahead and
create a file called say, student, or rather destruct dot h. So recall that dot h is a header file. Thus far, you have used header
files that other people made. Like, CS50 dot h, and standard IO
dot h, and standard [? lid ?] dot h, but you can make your own. Header files are just files that
typically contain code that you want to share across multiple programs. And we’ll see more of this in time. So let me go ahead and
just save this file. And suppose that I want to
represent a student in memory. A student of course, is
probably going to have what? For instance, how about
a string for their name, a string for their dorm– but
string is kind of two weeks ago. Lets call this char star. And lets call name, char star. And so you might want to associate like,
multiple pieces of data with students. Right? And you don’t want to have
multiple variables, per se. It would be nice to kind of
encapsulate these together. And recall at the very
end of last week, we saw this feature where you
can define your own type, with typedef, that is
a structure itself. And you can give it a name. So in short, simply by executing
this these lines of code, you have just created
your own custom data type. It’s now called student. And every student in the world
shall have, per this code, a name and a dorm associated with them. Now, why is this useful? Well the program, we looked at
the very end of last time looked a little something like this. Instruct zero dot c,
we had the following, I first allocated some
amount of space for student. I asked the user what’s the
enrollment in the class or whatnot? That gives us an int. And then, we allocated an array of
type student, called students, plural. This was an alternative,
recall, to doing something like this, string names enrollment,
and string dorms enrollment. Which would work. You could have two separate
arrays, and you’d just have to remember that name zero
and dorm zero is the same human. But why do that if you
can keep things together. So with structs, we
were able to do this. Give me this many student structures,
and call the whole array, students. And the only new syntax we introduce to
satisfy this goal, was what operator? AUDIENCE: The dot. DAVID J. MALAN: The dot. Yeah. So in the past, recall from like
week two, we introduced arrays. And arrays allow you to do
square bracket notation. So that is no different
from a couple of weeks back. But if your array is not storing
just integers, or chars, or floats, or whatever, it’s actually storing
a structure, like a student, you can get at that student’s name
by literally just saying dot name. And you can get at their
dorm by doing dot dorm. And then everything else is the same. This is what’s called, encapsulation. And it’s kind of like a fundamental
principle of programming where, if you have some real
world entity, like a student, and you want to represent
students with code, yeah, you can have a bunch of arrays that all
have called names, dorms, emails, phone numbers, but that just gets messy. You can instead encapsulate all of that
related Information about a student into one data structure so that now you
have, per week zero, an abstraction. Like, a student is an abstraction. And if we break that abstraction,
what is a student actually? Not in the real world, but
in our code world here? Student is an abstraction. It’s a useful word, all of us can
kind of agree means something, but technically, what
does it apparently mean? A student is actually a name in
a dorm, which really kind of is diminutive to everyone in this
room, but we’ve distilled it in code to just those two values. So there we have encapsulation. You’re kind of encapsulating
together multiple values. And you’re abstracting away
just have a more useful term, because no one is going to want
to talk in terms of lines of code to describe anything. So, same topic as in the past. So, now we have the ability to come
up with our own custom data structures it seems. That we can store anything
inside of them that we want. So let’s now see how
poorly we’ve been designing some things for the past few weeks. So it turns out that much
of the code, hopefully we’ve been writing in recent
weeks has been correct, but we’ve been not necessarily
designing solutions in the best way. Recall that when we have
this chunk of memory, we’ve typically treated
it as at most, an array. So just a contiguous chunk of memory. And thanks to this very simple
mental model, do we get strings, do we get arrays of students now. But arrays aren’t necessarily the
best data structure in the world. Like, what is a downside of an array
if you’ve encountered ones thus far. In C, what’s a downside of an array? Yeah? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Can or cannot? AUDIENCE: Cannot. DAVID J. MALAN: You cannot. That is true. So in C, you cannot mix data
types inside of an array. They must all be ints, they must all
be chars, they must all be students. It’s a bit of a white lie
because technically, you can have something called a void star,
and you can actually map– but yes. That is true though, strictly
speaking– cannot mix data types. Though frankly, even though
other languages let you do that, it’s not necessarily the
best design decision. But sure, a limitation. Other thoughts. Yeah? AUDIENCE: The size cannot change. DAVID J. MALAN: The size cannot change. Let’s focus on that one. Because that’s sort of even
more constraining it would seem. So if you want an array for,
say, two values, what do you do? Well, you can do something like
int, x, bracket, 2, semi-colon. And what does that actually give you
inside of your computer’s memory? It gives you some chunk
that we’ll draw a rectangle. This is location 0. This is location 1. Suppose that, oh, a few minutes
later, you change your mind. Oh, darn, I just took a– I want to type in a
third value, or I want to add another student to the array. Where do you put that? Well, you don’t. If you want to add a third
value to an array of size 2, what’s your only option in C? AUDIENCE: You make a new array. DAVID J. MALAN: You make a new array. So literally. And if this array had
the number like 42, and this had the number 13, the only
way to add a third number is to allocate a second array, copy the values into
the same locations, 42, 13, and then, we’ll add another value, 50. And then, so that you’re not
using up twice as much space almost permanently, now you
can sort of free somehow, or stop using that chunk of memory. So that’s fine. It’s correct what we just did. But what’s the running
time of that process? Recall a couple of weeks ago, we started
talking about efficiency and design. What’s the running time
of resizing an array. AUDIENCE: Too long. DAVID J. MALAN: Say Again. AUDIENCE: I said, too long. DAVID J. MALAN: Too long. Fair. But let’s be more precise. Big o of– big o of what? AUDIENCE: N. DAVID J. MALAN: N. What’s n? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: OK. True. But what does n represent? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Yeah. So you don’t actually have to not know. It’s just a general answer. In this case, however long
the array is, call it n. It is that many steps to
resize it into that plus 1. Technically it’s big o, over n, plus 1. But recall in our discussion,
“The big o notation,” we just ignore the smaller terms– the plus
1s, the divided by 2s, the plus n. We focus only on the most powerful
term in the expression, which is just n here. So yes, if you have an array
of size 2, and you resize it into an array of size 3,
or really, n plus 1, that’s going to take me roughly n steps. Technically n plus 1 steps. But n steps. Ergo big o of n. So it’s a linear process. So possible but not
necessarily the fastest thing because he literally had to
move all those damn values around. So what would be better than this? And if you’ve programed before, you
might have the right instincts already. How do we solve this problem? Yeah? AUDIENCE: Would you allocate more
memory at the end of the array? DAVID J. MALAN: Reallocate more
memory at the end of the array. So it turns out c does have
a function called, realloc. Perfectly, if not obviously,
named that reallocates memory. And if you pass it, the address of
a chunk of memory you’ve allocated, and the operating system
notices, oh, yeah you got lucky. I’ve got more memory at
the end of this array, it will then allocate that additional
RAM for you, and let you use it. Or worst case, if there’s
nothing available at the end of the array in memory,
because it’s being used by something else in your program. That’s fine. Realloc will take on the responsibility
of creating another array somewhere in memory, copying all of
that data for you into it, and returning the address
of that new chunk of memory. Unfortunately, that’s still linear. Yeah? AUDIENCE: Is this all
being done in the heap? Or– DAVID J. MALAN: This is
all being done in the heap. Malloc, and realloc, and
free, all operate on the heap. Yes. So that is a solution, but it doesn’t
really speak to the efficiency. Yeah? AUDIENCE: Could you use linked list? DAVID J. MALAN: Yeah. What is a linked list? Go ahead. AUDIENCE: It’s when you have an element
that points to different elements. DAVID J. MALAN: OK. Points to other elements. Yeah. So let me speak to what’s
the fundamental issue here. The fundamental problem is much like
painting yourself into a corner, so to speak, as the cliche goes. With an array, you’re deciding in
advance how big the data structure is and committing to it. Well, what if you just do the opposite. Don’t do that. If you want initially, room for
just one value, say one integer, only ask the computer for that. Give me space for one integer and
I’ll put my number 42 in here. And then, if and only if,
you want a second integer, do you ask the computer
for a second integer. And so the computer, as by a malloc,
or whatnot, will give you another one like, the number 13. And if you want a third, just ask the
same question of the operating system. Each time just getting
back one chunk of memory. But there’s a fundamental gotcha here. There’s always a trade off. So yes, this is possible. You can call malloc three times. Each time asking for a chunk of
memory of size 1, instead of size 3, for instance. But what’s the price you pay? Or what problem do we
still need to solve? Yeah? AUDIENCE: They’re not
stored next to each other. DAVID J. MALAN: Yeah. They’re not being stored
next to each other. So even though I can think of this as
being the first element, the second, and the third, you do not have, in
this story, random access to elements. And random access, ergo,
random access memory, or RAM, just means that arithmetically,
like, mathematically, you can jump to location 0, location 1,
location 2, randomly, or in constant time. Just instantly. Because if they’re all back to back
to back, all you have to do is like, add 1, or add 4, or whatever to
the address, and you’re there. But the problem is, if you’re
calling malloc again and again and again, there’s no guarantee
that these things are even going to be proximal to one another. These second chunks of
memory might end up– if this is a big chunk of
memory we’ve been talking about, where the heaps up here,
and the stacks down here– 42 might end up over here. The next chunk of memory,
50, might end up over here. The third chunk might end up over here. So you can’t just jump from
location 0, to 1, to 2, because you have to somehow remember
where location 0, and 1, and 2, are. So how do we solve this? Even if you haven’t programed before,
like, what would a solution be here? AUDIENCE: Somehow store [INAUDIBLE]. DAVID J. MALAN: OK. Somehow storing the addresses of– AUDIENCE: Of the [INAUDIBLE] DAVID J. MALAN: All right. So let’s just suppose, for the sake of
discussion, that this chunk of memory ended up at location 100. This one ended up at like 150. This one ended up at like 475. Whatever those values are. It would seem that somehow or other
I need to remember three values– 100, 150, and 475. So where can I store that? Well, it turns out, I can be a
little clever but a little greedy. I could say to malloc, you know what,
every time I call you, don’t just give me space for an integer,
give me space for an integer plus the address of another integer. So if you’ve ever kind of seen like
popcorn strung together on a string, or any kind of chain link fence
where one link is linking to another. We could create the
equivalent of– oops not that. We could create the equivalent
of this kind of picture, where each of these squares, or nodes,
we’ll start calling them, kind of links graphically to the other. Well, we’ve seen these
links, or these pointers, literally arrows that are
pointing implemented in code. An arrow or a pointer
is just an address. So you know what? We should just ask malloc not for
enough space for just the number 42, we should instead, ask for a little
more memory in each of these squares, making them pictorially rectangles now. So that now, yes, we do have
these arrows conceptually pointing from one location to another. But what values do I actually want
to put in these new additional boxes? AUDIENCE: The addresses of the next. DAVID J. MALAN: The
addresses of the next. So they’re like little breadcrumbs. So in this box here, associated
with the first value, should be the address
of my second value, 475. Associated with my second
value here, per the arrow– and let me draw the arrow
from the right place. –from the arrow, should be the
address 150, because that’s the last. And then, from this extra
box, what should I put there? Yeah? AUDIENCE: Slash 0 or something? DAVID J. MALAN: Yeah. So probably, the equivalent of slash 0,
which in the world of pointer’s recall, is null. So just a special value that means
that’s it, this is the end of the line. That still leaves us with room to
add a fourth value and point to it, but it for now, signifies very clearly
to us there’s nothing actually there. So what did we just do? We created a list of values
50, oh sorry 42, 50, 13, but we linked to them together. First, pictorially, with just arrows. Like any human might
with a piece of chalk. But technically in
code, we could do this by just storing addresses
in each of these places. So just to be clear then, what might
this actually translate to in code? Well, what if I proposed this. In code, we might do
something like this. If we want to store an integer. We’re of course, going to need to
store like int n, we’ll call it. n will represent 42, or 50, or 13. But if we want to
create a data structure, we might want to start giving
this data structure a name. I called it, a moment ago, node, which
is a CS term for a node in a linked list, so to speak. And it looks like this. So typedef means, give me my own type. Struct means, make it a
structure, like a student was. And then, node, which is going
to be the name of this thing. And I’ll explain in a moment why I
have the word node twice this time. But I left room on the board
for just one more line. In addition to an int,
called n, or whatever, I need to somehow represent
in code, the additional memory that I want malloc to
give me for the address. So first of all, these are
addresses of what data types? Each of those three new boxes. AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: They are the addresses
of integers in that point in the story. But technically, what is
this box really pointing to? Is it pointing specifically to the ints? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: It’s pointing to that
whole chunk of memory, if you will. So if you start thinking about each
of these rectangles as being a node, and each of the arrows as
pointing to another node, we need to somehow express, I need
to somehow store a pointer to a node. In other words, each of these arrows
needs to point to another node. And in code, we could say this. Right? Like, let’s give it a name. Instead of n, which is the
number, let’s call it next. So next, shall be the name of this field
that points to the next node in memory. And node star, what does that
mean in English, if you will? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Say again? AUDIENCE: Pointing to an address. DAVID J. MALAN: Pointing to an address. Right? It looks different. Node is a new word
today and that’s fine. But node star, just means
a pointer to a node. The address of a node. And it turns out that
this is a custom structure so we actually have to say this. But it’s the same principle even though
things are kind of escalating quickly here, we just need to values, an int,
and then, a pointer to another thing. That other thing is
going to be another node. And we’re just using a node,
frankly, to encapsulate two values– an int and a pointer. And the way you express in C,
albeit somewhat cryptically, a pointer, or one of those arrows, is
you say give me a variable called next, have it point to a
structure called node. Or rather, have it be the address
of a structure of type node. Yeah? AUDIENCE: How can you [? reveal ?]
the timing of struct node [INAUDIBLE]?? DAVID J. MALAN: Good question. So this feels like a circular kind of
definition because I’m defining a node, and yet, inside of a node is a node. That is OK because of the star. It is necessary in C– remember that C always is
kind of read top to bottom. So accordingly, this very first line
of code here, typedef struct note, at that point in the story,
when clang has read that line, it knows that a phrase,
struct node, exists. AUDIENCE: That’s why you
say nodes [INAUDIBLE].. DAVID J. MALAN: Exactly. Exactly. We didn’t need to do this with
students because there were no pointers involved to other students. But yes, in this case. So in short, this tells clang, hey,
clang, give me a structure called node. And then, in here, we say,
hey, clang, each of those nodes shall have two things,
an integer called n, and a pointer to another one of
these data structures of type node, and call the whole thing, node. It’s a bit of a mouthful. But all this is, is the following. Let me go ahead and erase all of this. All this data type is– if we get rid of the picture
we draw on the fly there. –is this says, hey, clang,
give me a data structure that pictorially looks like this. It’s divided into two parts. The first part is called n, the
second type is called, next. This data type is of type int. This is a pointer to another such node. And that’s it. Even though the code looks
complex, the idea is exactly that. Yeah? AUDIENCE: [INAUDIBLE]? Why do you have to
say struct node again? DAVID J. MALAN: Good question. The reason is, as just
came up a moment ago, clang and C, in general, are kind of dumb. They just read code top to bottom. And the problem is, you have to
declare the name of this structure as being a struct node
before you actually use it. It’s similar in spirit to our discussion
of prototypes– y functions need to be mentioned way up top. This just says to clang, give
me a type called struct node. You don’t know what it’s
going to look like yet. But I’ll finish my thought later. And then in here, we’re just
telling clang, inside of that node should be an integer, as well as,
a pointer to the very type of thing I’m in the middle of defining. But if I had left off the word node
up there, and just said struct, you couldn’t do that because it
hasn’t seen the word N-O-D-E yet. That’s all. Other questions? All right. So if I now have a data
structure called node, I can use it to kind of stitch
together these linked lists. And maybe just the very
things a little bit, and to start giving away
some ducks, would folks be comfortable with volunteering
to solve a problem here? Yeah? OK. Come on up. 1, 2– AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Sure. Or you can take a duck and run. OK. 1, 2, how about 3? Come on over here, 3. So if you want to be our first
pointer, you can be number 5. Come on over here. You want to be number 9. And one more. One more volunteer. Come on over here. Yeah. All right. So– I’ll meet you over here. OK, 17. All right. So if you’d like to– just so we pick this up for
those following along at home. If you would like to just
say hello to the audience. ANDREA: Hi, I’m Andrea. [? COMEY: ?] Hi, [? I’m Comey. ?] [? KYONG: ?] Hi, [? I’m Kyong. ?] SPEAKER 2: Hi, I’m [INAUDIBLE]. DAVID J. MALAN: Wonderful. OK. If you wouldn’t mind all just taking
a big step back over the ducks, just so that we’re a
little farther back. Let’s go ahead and do this. If you’re our first pointer, if you
could come over here for instance, and just stand outside the ducks. And if you guys could come a little
over here in front is still fine. So here we have the
makings of a linked list. And what’s your name again? [? COMEY: ?] [? Comey. ?] DAVID J. MALAN: [? Comey ?] is
our first pointer if you will. Via [? Comey’s ?]
variable are we just going to keep track of the first
element of the linked list. So if you could, with your
left hand, represent first. Just point over at–
what was your name again? ANDREA: Andrea. DAVID J. MALAN: So
Andrea is the number 9. If you could use your left
hand to point at number 5. And if you could use your left
hand, yep, to point at number 17. And your left hand to just point at
null, which we’ll just call the ground. So you don’t want to
just point it randomly because that would be like following
a bogus pointer, so here means null. All right. So this is a linked list. All you need to store are
linked list of three values is three nodes, inside of
which are three integers, and their left hands represents
that next pointer, so to speak. [? Comey’s ?] a little different,
in that she’s not holding a value. She’s not holding an integer. Rather, holding just the
name of the variable, first. So you’re the only one that’s
different here fundamentally. So suppose I want to
insert the number 20? Could someone volunteer to be number 20? OK. Come on up. All right. And what’s your name? ERIC: Eric. DAVID J. MALAN: Eric. Eric, you’re the number 20. And Eric, actually, let’s see. Actually can we do this? Let me give– let me make
this a little more different. OK. That never happened. OK. Eric, give me that please. I want to insert Eric as number 5. So Eric, I’m keeping this list sorted. So where, obviously, you’re going to go? ERIC: Go right there. DAVID J. MALAN: All right. But before you do that, let’s just
consider what this looks like in code. In code, presumably, we have
malloced Eric from the audience. I’ve given him a value, n of number 5. And his left hand is like, it’s garbage
value right now, because it’s not pointing to anything specific. So he’s got two values– an integer,
and a left hand representing the next pointer. If the goal is to put
Eric in sorted order. What should our steps be? Like, whose hand should point
where, and in what order? Yeah. Give us one step. AUDIENCE: You should point to number 9. DAVID J. MALAN: OK so you
should point at number 9, which is equivalent to saying,
point at whatever first. Where [? Comey ?] is pointing at. So go ahead and do that. All right next? What’s the next step? Someone else? Someone else. Almost there. Yeah? AUDIENCE: First should point to 5. DAVID J. MALAN: OK. So first, or [? Comey, ?]
could you point to 5. And that’s fine. You don’t even have to move. Right? This is the beauty of a linked list. It doesn’t matter where
you are in memory, it’s the whole beauty of these
pointers, where you can literally point at that other location. It’s not an array where they need
to be standing back to back to back. They can be pointing anywhere. All right. Let’s go ahead and insert one more. Who wants to be say, 55? Big value. Yeah. Come on down. All right. What’s your name? [? KYONG: ?] [? Kyong. ?] DAVID J. MALAN: [? Kyong. ?] OK. So come on over. So we’ve just malloced
[? Kyong ?] from the audience. I’ve given him his end value of 55. His left hand is just some
garbage value right now. How do we insert [? Kyong ?]
in the right order? Where is the obviously supposed to go? In sorted order, he
obviously belongs at the end. But here’s the catch
with the linked list. Just like when we’ve discussed
searching and sorting in the past, the computer is pretty blind
to all but just one value. And the linked list, at the moment– like, I don’t know that these
three, these four, exist. All I know really, is
that [? Comey ?] exists. Because via this first
pointer, is the only access to the rest of the elements. And so what’s cool about a linked
list, but perhaps not obvious, is that you only– the most important value is the first. Because from the first value,
you can get to everyone else. It’s not useful– excuse me
for me to remember, Andrea? –Andrea alone, because
if I do, I’ve just lost track of [? Comey ?] and more
importantly, because of his number, Eric. So all I have to do really,
is remember [? Comey. ?] So if the goal now is to insert number
55, what steps should come first? No pun intended. AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Say again. AUDIENCE: Finding the first space. DAVID J. MALAN: OK. Finding the first space. So I’m going to start at [? Comey, ?]
and I’m going to follow this pointer. Number 5, does 55 belong here? No. So I’m going to follow this
pointer and get to Andrea. Does 55 belong here? No. Gonna follow her pointer,
and 22, does it belong here? No. I follow this pointer, 26? No. But you have a free hand, it turns out. So what step should come next? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: We could have
you point at 55, and now done. So relatively simple, but what
was the running time of this? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: It’s big o of n. It’s linear. Because I had to start at
the beginning, even though we humans have the luxury
of just eyeballing it. Saying, oh, obviously, he
belongs way at the end. Mm-mm. Not in code. Like, we have to start at the beginning
to reverse the whole darn list, until we get linearly to the very end. And now we’re done. Let’s try one last one. How about 20? Yeah. Great. Come on down. What’s your name? JAMES: James. DAVID J. MALAN: James. All right, James. All right. So we just malloced James,
given him the number 20. He obviously belongs
roughly in the middle. What’s the first step? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Sorry? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: All right. So we start with [? Comey, ?] again. All right. First, OK. 5, do you belong here? No. Let me follow the link. OK 9, do you belong here? No. Do you belong at 22– ooh. But what did I just do wrong? I went too far. At least in this story. Like, I literally–
Andrea is behind me now. OK. So can I follow the pointer backwards? You can’t. Like in every picture we’ve
drawn, and every example we’ve done with an address, we only
have the address of the next pointer. We don’t have what’s called, a doubly
linked list, at least in this story, where I can just turn around. So that was a bug. So I need to start over instead. First, OK 5, OK 19– what I really need in
code, ultimately, is to kind of peek ahead and not
actually move– not that far. Just to 22. Peek ahead at 22 and realize,
oh, that’s going to be too far. This is not yet far enough. So let’s go ahead and bring James over. Well, actually, you can
stay there physically. But what step has to happen first? I know now he belongs in here. You want to point at him? OK. Point at him. ANDREA: Oh. I’m sorry, he points first. DAVID J. MALAN: Well let’s do
that, just because it is incorrect. That’s fine. OK. Andrea proposed that we point here, but
she just broke the whole linked list. Why? ANDREA: Because there’s
nothing to point at. DAVID J. MALAN: Right. No one is remembering–
what’s was your name again? [? KYONG: ?] [? Kyong. ?] DAVID J. MALAN: No one’s
remembered where [? Kyong ?] was. So you can’t do that. Your left hand has to stay there. So what steps should
happen first instead? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: James
should point at whatever Andrea is pointing at, perhaps? So a little redundantly at
the moment, just like before. OK. Now what happens next? That’s step one. ANDREA: Now I can point. DAVID J. MALAN: Now
you can point at him. OK. You could do that. All right. And so now, this looks
like a complete mess, but if we know that
[? Comey ?] is first, we can follow the breadcrumbs to Eric,
and then to Andrea, and then to James, and then the rest of our
list step by step by step. So it’s a huge amount of like logic now. But what problem have we solved? And I think we identified
it over here earlier. What was the problem first
and foremost with the arrays? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: You have to
decide on their size in advance. And once you do that, if you want
to add an additional element, you have to resize the whole darn thing. Which is expensive because you
have to move everyone around. Now frankly, I’m being
a little greedy here. And every time we’ve
inserted these new elements, I’ve been keeping them in sorted order. So it would seem that if you insert
things in sorted order, big o event, every time. Because in the worst
case, the new element might end up all the way at the end. But what if we relax that constraint? What if I’m not so uptight and need
everything nice and orderly and sorted? What if I just want to keep growing
the list in any random order? And I allocate the number 34. And I’ll play the number 34. Malloc 34. Where is the quickest
place for me to go? Yeah? AUDIENCE: Point to 5, and
then have [INAUDIBLE].. DAVID J. MALAN: OK. I’ll point to 5, and then,
[? Comey, ?] if you could point to me. Done. One– well, two steps. All right. Suppose now, I malloc 17 with
someone else, who’ll we’ll pretend is right here. Where’s the best place for 17 to go? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Right
after [? Comey ?] too. So now, [? Comey ?] can point at 17, 17
can point at me, I can point at Eric, and so forth. And that’s two steps again. Two steps– if it’s the same
number of steps every time, we call that, constant time. And we write it as big o of 1. And so here too, it’s just a trade off. If you want really fast insertions,
don’t worry about sorting. Just put them at the beginning
and deal with it later. If you want a dynamic resizeability,
don’t use an array, use a linked list, and just keep allocating more and
more as you go without wasting a huge amount of space too. Which notice, that’s another
big problem with an array. If you over allocate space, and only use
part of it, you’re just wasting space. So there’s no one solution here. But we do now have the
capabilities, thanks to the structs and pointers to stitch together,
if you will, these new problems. Yes, please. SPEAKER 2: Why can’t
the node [INAUDIBLE]?? DAVID J. MALAN: And
who am I in this story? SPEAKER 2: [INAUDIBLE]. DAVID J. MALAN: Oh, OK. Absolutely. So another very reasonable
idea would be, well, why don’t we just put
the new ones at the end? That’s fine if I keep
track of who is at the end. The problem, is at the
moment in the story, and we’ll ultimately see this in code,
I’m only remembering [? Comey. ?] And from [? Comey ?] am I
getting everywhere else. I could have another
pointer, a second pointer, and literally call it, last,
that’s equivalent to you. Or that’s always pointing at you. I just need then two pointers,
one literally called first, one literally called last. That’s fine. That’s a nice optimization if I want
to throw all the elements at the end. And frankly, I could get really fancy– and to solve the problem
that Andrea cited earlier– if I store not just an int
and a pointer, but instead, an int and two pointers,
I can even have each of these guys pointing with
their left and right hands in a doubly linked list, so as to solve
the problem Andrea identified, which was if I go too far no big deal. Take one step back. I don’t have to think as
hard about that logic. So there too, a trade off. Let’s go ahead and take
a five minute break. I’ll turn on some music. Grab a duck now, if you’d like. And we’ll return with some
fancier data structures still. Thanks. All right. We’re back. So let’s now translate some
of these ideas to code. So that we can actually solve
this problem a little more concretely than just having
humans pointing at each other. So for instance, let’s
try to distill everything we’ve been talking about
into just a goal in code of storing a list of numbers. I would propose that we can take
like three passes at this problem. The first would be, let’s just
decide in advance how many numbers we want to store so we don’t
have to deal with all this complexity with the pointing
and the pointers and all this, and just hard code that
value somehow, and just stop when the user is inputted
that many numbers and no more. Two, we can improve upon that and at
least let the user dynamically resize their array. So that if they decide to input
more numbers than we intend, it’s going to grow, and deal with that. Of course, arrays are
not necessarily ideal because they have to do all that
damn copying from old to new. That’s linear time. It would seem smartest to
get subversion 3, which is actually going to use a linked list. So we’re just more modestly
allocating space for another number, and another number, and another
number, or really a node. One number at a time. So let me go ahead and start as follows. I’m going to go ahead and include
some familiar lines in list 0.c, of the CS50 library, just to make it
easy to get some user input for this. And standard iO dot h, for printdef. And let me go ahead and declare
my main function as usual. And then, in here let’s
do a couple of things. First, let’s ask the user
for the capacity of the array that we’re going to use. Or rather, let’s do this first. Let me first rewind
and say, you know what? Int, numbers, 50. Well, that’s going to be
annoying to type in 50 numbers. We’re going to give the user two numbers
at first, that here, she can type in. Next, let’s go ahead and prompt
the user for those numbers. So let me go ahead and say– let’s do this. Let’s at least clean this up a little
bit so that we can reuse this value. So we don’t have a magic number. This just came up in
discussion actually. So while– do I want to do that? Nope. Let me fix this. This will be my capacity of size 2. And that’s going to give me that size. And then, I’m going to keep
track of how many integers I’ve prompted the user for so far. So initially, the size of this
structure is going to be 0. But it’s capacity, so to speak, is 2. So size means how many things are in it. Capacity means how many
things can be in it. And while the size of the structure
is less than its capacity, let’s go ahead and get
some inputs from the user. Let’s go ahead and ask them for a
number, using our old friend, get int. And just say, give me a number. And then, let me go ahead
and insert the number that they type in into this array
at location size, like this. And then, do size plus, plus. I think. You know, I wrote it pretty quickly. But let’s consider what I just did. I initialized size to 0, because
there’s nothing in it initially. Then I say, while size is less than
the capacity of the whole thing– and capacity is 2 by default– go ahead and do the following. Give me an int from the user. OK. So int number gets int. Then, put at location,
size, in my numbers, array, whatever the human typed in, number. And then, increment
size with plus, plus. All right. So on the first iteration size is 0. So numbers, bracket, 0,
gets the first number. Numbers, bracket, 1,
gets the second number. Then, size equals capacity. So it stops, logically. Any questions on the logic of this code? All right. So once we have those numbers,
let’s just do something simple. Like for int, I gets 0. I is less than the actual
size I, plus, plus. Let’s just go ahead and
print out the number you inputted, percent I, backslash n,
and type out numbers, bracket, I. All right. So if I made no typos in list 0
dot C, then, I’m going to go ahead and do dot, slash, o, dot, C. I’m going
to be prompted for a couple of numbers. Let’s go ahead and do 1, 2. You inputted 1, you inputted 2. All right. So not bad. But this is bad design, arguably, why? Just find one fault. It’s correct. But bad design. AUDIENCE: Repetitive. DAVID J. MALAN: Repetitive, because
I’m using a couple of loops, sure. And it’s fundamentally– it’s
very limited in functionality. Why? Like how useful is this program? AUDIENCE: It’s hard coded at 2. DAVID J. MALAN: Yeah. It’s hard coded at 2. So let’s at least improve
upon this a little bit, and get rid of this hard coding. Why don’t I at least ask the
user for something like this? Well, instead of just declaring the
capacity, let me go ahead and say, you know what? Let’s just replace the 2. Get int, and just say
capacity, for instance. All right. And now if I do this, I’m
going to be prompted– so make list 0. Dot slash list 0. The capacity will be 2. 1, 2, that’s nice. But if I run it again, and
give it a capacity of 3– 1, 2, 3, I get more capacity. So that’s nice. It’s an improvement for sure. There is a bug here. Before I test it further, can anyone
identify a bug or somehow crash this? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Oh, go ahead. AUDIENCE: If you don’t input an integer. DAVID J. MALAN: If I
don’t put an integer. Or– is that same comment up here? AUDIENCE: I was going to say,
what happens if you go back and put in [INAUDIBLE] those other
[INAUDIBLE] will be in the memory. DAVID J. MALAN: Oh. No. Because I’m rerunning it in each time. I don’t need to worry about
previous runs of the program. Yeah? AUDIENCE: In the for
loop, it just goes 1, 2, 3, it doesn’t actually
care what you put it. DAVID J. MALAN: [INAUDIBLE] 1, 2,
3– well, I am iterating up to size, which could be capacity. Because now they do end
up being equivalent. Because I’m filling the whole thing. But let’s try this. If you don’t type in a value. So let me go ahead and rerun this. My capacity shall be duck. All right. So we did handle that. Because getInt does that for me. But I bet I can still break this. Ooh, yeah, let’s always
try something negative. Oh, OK. So bad. Like cryptic looking
message, but clearly, has to do with a negative value. So I should probably be a
little smarter about this. And recall from like,
Week 1, we did do this. With Mario, you might have done this. So I could do something like, do,
while capacity is less than 1. I could go ahead and say,
capacity getInt capacity. So just a little bit of error checking
to close the bug that you identified. All right. So let’s go ahead and recompile this. Make lists 0– oops we’re going
to start hearing that a lot today. Aren’t we [INAUDIBLE]? Make list 0, dot, slash, list 0. Capacity will be 3. 1, 2, 3. Now capacity will be negative 1. Doesn’t allow it. Capacity 0, doesn’t allow it. Capacity 1, yes. So non-exhaustively, I’ve tested it. It feels like it’s in better shape. OK. But this program, while correct,
and while more featureful, still has this fundamental limit. Wouldn’t it be nice to allow the
user to just keep typing numbers, as many as they want, and then quit
once they’re done inputting numbers. Right? If you’re making a program
to compute someone’s GPA, different students might
have taken different courses, you don’t want to have them
to type in all 32 courses. If they’re younger and haven’t
taken all those courses. Like there’s a lot of scenarios
where you don’t know in advance how many numbers the user wants to provide. But you want to support a few
numbers, lots of numbers, or beyond. So let’s do this in a second version. In list 1 dot C, let me go ahead and
improve upon that example as follows. First, let me give my familiar
friends up here CS50 dot for iO, standard iO dot h, and then,
in here, int main void. And then, let’s start writing this. So now, I don’t know in advance,
necessarily, how many numbers the user is going to type in. Like the goal is, I want
them to be able to type in a number, another number,
another number, and then hit the equivalent of like, q, for quit,
when they’re done inputting numbers. Like I don’t want them to have to think
about in advance, how many numbers it is they’re inputting. But how do I do that? Like I can’t just come up with an
array called numbers, and say, 50. Because if the user wants
to type in 51 numbers, I’m going to have to resize that. But how do you resize an array? How do you resize an array? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: What’s that? AUDIENCE: You can’t. DAVID J. MALAN: You can’t. Right. We’ve never seen an instance
where you’ve re-sized an array. We talked about it on
the blackboard here. Well, just like, allocate a
bigger one and copy everything in. And we did identify realloc. But you can’t actually
use realloc on an array. Realloc actually accepts an
address of a chunk of memory that you want to grow, or shrink. So it turns out, if we
now start to harness the sort of fundamental definition of
what an array is, a chunk of memory, we can actually build arrays ourselves. If an array is just a chunk of
memory, or more specifically, it’s like the address of the
first byte of a chunk of memory, it would seem that I could declare
my array, not with square brackets as we’ve been doing for
weeks, but I can say, you know what numbers really
is, it’s really just a pointer. And I’m initially going
to initialize it to null. Because there is no array. But now I have the ability
to point that pointer at any chunk of memory, small or big. Now why is this useful? Well, initially let me
claim that my capacity is 0, because nothing’s going on yet. I haven’t called malloc or anything. And initially, my size is 0 because
there’s nothing in the array. And it doesn’t even have a size. But let me just do this forever. Much like in scratch, we had the forever
block you can use, while true, and C, to just say keep doing this until
the user breaks out of this. And let me go ahead and ask the
user, give me a number, getInt. And just ask them for a number. And then, we just need
a place to put that. So where do I put this number? Well, do I have, at the moment,
any place to put the number? No. And technically speaking,
how do you express that? Like in pseudo code, I want to
say, if no place for number. But technically, I could do this. Well, if the size of the array at
the moment, equals its capacity, that feels like a lower level
way of expressing the same thing. If whatever the capacity is, if the
size is the same, there is no more room. And that simple statement also covers
the scenario where the capacity is 0, the size is therefore, 0. So its the same question. Either we have no space at
all, or we have some space but we’ve used it all– size
equals, equals, capacity. So if the size equals
capacity, or put more casually, if I don’t have enough space. What do I want to do intuitively? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Allocate more memory. And it turns out, you proposed,
or someone proposed earlier, reallocating memory. We can use this function
for the very first time. Let me go ahead and say this– the catch with realloc is you
have to be smart about it, because it returns a pointer. So let me propose this code first. First, just give me a temporary
variable, call it, temp, that’s going to store the following. Actually, no. Let me start this more simply. Let me go ahead and say, numbers
should be reallocated please, realloc by passing its self in. And this time, give me the
size of an int, times– how many ints do I want this time? How many numbers did the
human just input presumably? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Just one. Right. Because literally, we’ve only
called getInt once in this story. So whatever the size of this array
is now, we need to increase it by 1. That’s all. So this line of code here
is saying, hey computer, go ahead and reallocate this array
from whatever its current size is, and make it this size instead. The size of whatever it is, plus
1, times the size of an int. Because that’s what we’re
trying to store, is an int. So we have to do that multiplication. And realloc, as mentioned
earlier, is pretty fancy. It’s going to take an pointer,
whatever chunk of memory you’ve already allocated, and
it’s going to then reallocate a bigger chunk of memory. Hopefully, what’s going
to happen is this– if your chunk of memory
initially looks like this, it’s going to hopefully notice,
oh, this memory is free. Let me just give you
back the same address. So if this is address
100, and you get lucky and this address is also
available, the realloc function’s going to remember that
for the operating system. It’s going to return
the number 100 again. And you’re good to go. You can safely touch memory here. Or if this is in use already, this
chunk of memory, and therefore we can’t fit another byte there
because some other code you wrote is using that memory. But there is twice as much
memory available down here. What realloc will do, is if
you’ve stored the number 50, it will handle the process of
copying 50 to the new value. This is going to be left as a
garbage value for you to deal with. And it’s going to return to you the
address of the new chunk of memory, having done the copying for you. So even though it’s technically
re-allocating the array, it’s not necessarily
just going to grow it. It might relocate it in
memory to a bigger chunk, and then give you the new
address of that memory. Question? AUDIENCE: Is that
process really preferable to just creating extra
memory in it’s place. And then saving the time and energy of
reallocating them [? all at once. ?] DAVID J. MALAN: That’s
a really good question. Honestly, we could avoid this problem
slightly by just doing, you know what, give me at least– go ahead and give me at least
the size of an int, times– I don’t know, most humans are not
going to type in more than 50 numbers. Let’s just pick 50. So you could do this, and that
would indeed save you time. Because the approach
I’m currently taking is pretty inefficient because
every damn time the user calls getInt, and gives an int,
we’re resizing, resizing, resizing. Very expensive. As to what the best value is though– 50? Should it be 25? Should it be 1,000? I’m either going to
under bet or over bet. And it just depends on you to decide
which of those is the worst decisions. AUDIENCE: But, like,
in terms of programs, is it also pretty expensive to
have memory that you’re not using or generally, is it usually more OK? DAVID J. MALAN: Good question. In programs you’re writing, is it better
to have more memory than you’re using, or should you really be conservative? These days, memory is cheap. We all have gigabytes of memory. And so wasting 50 bytes or 200 bytes,
times 4, of memory, not a big deal. Like, just get the job
done quickly and easily. But in resource constrained
devices, maybe, things like phones or little internet of
things style devices that have a lot fewer resources, you
don’t really want to go wasting bytes. But honestly, the CPUs, the
brains in our computers, are so darned fast these days,
even if you’re calling malloc 10 times, 1,000 times, it’s
happening so darned fast that the human doesn’t even notice. So there too. These are what are
called design decisions. And these are the kinds of
things that, in the real world, you might actually debate
with someone at a whiteboard, saying, no, this is stupid
because of this reason. Or he or she might push
back for other reasons. And no one’s necessarily right. The whole goal is to just
that thought process first so you’re at least
confident in what you chose. Yeah? AUDIENCE: When we were writing
to a file in the last PSET, was it storing it in memory first or
putting it right on the hard drive? DAVID J. MALAN: When
you were calling fread, you were by definition in
the forensics problem set reading bytes from disk into memory. When you were calling fwrite, you were
copying bytes from memory back to disk. If that answers the question. OK. Other questions? Yeah? AUDIENCE: Why did you
say, size + 1, in line 16? DAVID J. MALAN: Why do I
say, size + 1, in line 16? Because the whole goal is to make room
in this array for the newly inputted number that the human just typed in. And so whatever the current
size of the array is, I clearly need one more space. AUDIENCE: So that repeats on and on? DAVID J. MALAN: It
does repeat on and on. Because at the moment, I’m
inside of this while loop. So we do need to ask a question,
when is the human done inputting. And it turns out– and
this is not obvious. And it’s not the best user experience
on a keyboard for the human. But we can actually detect
the following sentiments– if user is done inputting numbers,
then let’s go ahead and break. But the question then is, how
do you express that pseudo code? Well, you could in some
programs maybe type q for quit. But is that going to
work when using getInt? Could we detect q? Why not? AUDIENCE: Because getInt immediately
prompts you for another integer. DAVID J. MALAN: Exactly. Because getInt immediately
prompts you for another int. So because of the way we designed
the CS50 library, you can’t detect q, or you can’t have the human type
quit unless you don’t use getInt. You instead use? AUDIENCE: getString. DAVID J. MALAN: We could use getString. And then every time the human
types in a number, we could use, like, A2i to convert it to an int. But if the human types in q or Q-U-I-T– a string also– we could just have an if
condition with string compare and quit. But honestly, then you’re
reimplementing getInt– so trade-off. Anyhow, a common way to
work around this would be, you know that Control-C
quits programs, perhaps, cancels out of your program. There’s another popular
keystroke, Control-D, that sends what’s called end of file. It simulates the end of a file. It simulates the end
of the human’s input. So it’s kind of like the period
at the end of an English sentence. So if you want to signal to a computer
that’s waiting for input from you that you don’t want to quit the
program– that would be Control-C– but you just want to be done
inputting input to the computer, you hit Control-D,
otherwise known as EOF. And the way to express this–
and you would only know this from documentation– would be
to say something like this, if the number the human
typed in equals end of file– but there is no such
thing in this context– you actually do this because
of the CS50 library works. It turns out that if the only
values a function can return are integers, that means you can
return 0, 1, negative 1, 2 billion, negative 2 billion give or take. What humans did for years
with old programming languages is they would just steal
one or a few numbers. For instance, you’d steal the number
two billion and call it intmax– the maximum integer. And you’d just say, you can
never actually type 2 billion, because we’re using that as
a special value to signify that the human hit Control-D. Or
you could do negative 2 billion, or you could do 0, or 50. But at some point, you have to steal
one of the 4 billion available numbers to use as a sentinel
value, a special value that you can then check
for as a constant. So anyhow, this just means, when
the user is done typing input, go ahead and break out
of this while loop. And as an aside, let me fix one thing. It turns out things can
go wrong with realloc. And if realloc fails
to allocate memory, it can return null, a special value that
just means, eh, something went wrong. It’s an invalid pointer. It’s the address 0. And so it turns out there’s a subtle
bug here where, technically, I should actually do this– store realloc’s return value
in a temporary variable. Because if temp=null,
something went wrong. And I should actually go ahead
and quit out of this program. But let me wave my hand at that for
now because it’s more of a corner case. But you’ll see in the online version of
this program we have additional error checking that just checks, in
the rare case that realloc fails, clean it up and return properly. But I’ll wave to the
online code for that. All right. Any questions on that
example before we move on? Yeah? AUDIENCE: So in realloc, when it creates
the new pointer for the [INAUDIBLE],, does it clear the memory
from the original pointer? Does it automatically clear it? DAVID J. MALAN: Good question. When you call realloc and it
ends up allocating more space, does it clear the original memory? No. And that is where garbage
values come from, for instance. Because they’re just left in
memory from the previous use. Other questions? Yeah? AUDIENCE: What does the
user actually type to break? DAVID J. MALAN: Oh, Control-D.
Control-D. And it’s not break. It is to send end of file, end of input. Control-C kills or breaks
out of the program itself. AUDIENCE: And that’s the
same as the intmax kind of? DAVID J. MALAN: Same as intmax? Yes. AUDIENCE: Because you’re not
adding, like, a giant value. DAVID J. MALAN: Correct. In the CS50 library,
intmax, yes, is the symbol. Yes. Yeah? AUDIENCE: Could you also
just ask the user to say, do you want to enter
another number yes or no? DAVID J. MALAN: Absolutely. We could add more logic. And you could use getString. And we could prompt him or her, hey,
do you want to input another number. The only downside of
that would be, now, I have to type in not only my
number, but yes or no constantly. So it’s just a trade-off
user interface-wise. All right. So let me go ahead. And let me go ahead and return 0
here just as my simple solution to this problem of
something going wrong. I’ve just compiled this program. Let me go ahead and run it. I’m going to type in one number,
two numbers, three numbers. And now I’m bored. I don’t want to keep doing this. How do I tell the computer I’m done? AUDIENCE: Control-D. DAVID J. MALAN: Control-D. Oops. Oh, OK. That’s correct behavior
because I forgot a key step. What’s that? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Yeah. I’m not actually doing
anything with the values. I should probably for int
I get 0, I less than size, I + + code we had before. And I should probably print
out You inputted %I, this. Save that. Make list one. So all I did was re-add
the printing code. Now if I rerun this– one,
two three, Control-D– dammit. AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Oh. OK. Now I broke my code here. Let me do this. We’re going to get rid
of this error checking because I’m not actually ever resizing. numbers gets realloc. Oh, and maybe someone
chiming in with this– numbers bracket size
gets the user’s input. Size + +– was this a key
detail someone wanted me to do? OK. So I didn’t actually
finish the program earlier. Notice we left off as follows– hey, computer, give me an
array of size 0 initially that’s null– there’s no memory for it. Therefore, the size of this array is 0. Do the following forever. Get a number from the human. If the number equals this
special value, intmax just breakout because the program is done. And actually, sorry. This is why I write
these in advance too. OK. Go ahead and prompt
the user for a number. If they have inputted the Control-D,
just break out of this loop. However, if the size of the array
equals its current capacity, go ahead and reallocate space for this
thing being one number bigger than it previously was. Now, assuming that succeeded
and we have memory, go ahead, and just like our list 0
example, store in the numbers array at the current location, which is 0,
whatever number the human typed in. And then increment the size by
one to remember what we have done. I’m also though going to
need to do capacity + + here to remember that we’ve increased
the capacity of the array. So again, two new measures. capacity is how much
space there is in total. size is how much we’re using. They happen to be
identical at the moment because we’re growing this
thing step by step by step. All right. Let me go ahead and hit Save. Let me go ahead and
compile this one last time. ./list1 and input 1, 2, 3. Control-D. OK. Now it’s just an aesthetic bug. I forgot my /n. So just to prove that I can actually
program, ./list1; 1, 2, 3; Control-D. Phew. All right. So you inputted 1. And the reason it didn’t
move to another line is because Control-D gets sent
immediately without hitting Enter. All right. Phew. That’s all using arrays. Now let’s do the sort of cake baked
already and pull it out of the oven. The third and final
example here is list two. And actually, before we get
there, let me note one thing. Yeah, let’s do one last thing here. Let me go ahead and run, per earlier,
our new friend valgrind on list1. Enter. It’s waiting for me to type in 1, 2, 3. Let me go ahead and hit
Control-D. Interesting. I seem to have a buggy program even
though I claimed a moment ago that I knew what I was doing. 12 bytes in one blocks are definitely
lost in lost record one of one. Again, I don’t understand
most of those words. But 12 bytes definitely lost– probably my fault. Why is it 12? And what are those 12 bytes? Yeah? AUDIENCE: I think you
made three integers. DAVID J. MALAN: Yeah, 1, 2, and 3. AUDIENCE: And each one is 4 bytes. And you never freed them
after you used malloc. DAVID J. MALAN: Exactly. I typed in three numbers– 1, 2, and 3. Each of those is 4
bytes on this computer. That’s 12– 3 times 4. And so I’d never freed them seems
to be the source of the issue. So at the end, let’s
just prove that valgrind can detect correctness as well. Free my numbers, semi-colon. Let me go ahead and rerun make list1. And now let me increase the size
of this and do valgrind again on list1, typing in the same values– 1, 2, and 3. Control-D. All he blocks were freed. No leaks are possible. So again, valgrind is your friend. It finds problems that you
didn’t even necessarily notice. And you didn’t have to read
through your lines of code again and again to identify the source
of the issue unnecessarily. All right. Any questions then on these arrays
that are dynamically allocated and the bugs we find
therein with valgrind? All right. So the last demonstration
of code is going to be this. I have stolen, for this final
example, some of the building blocks that we had on the screen earlier. In my code for list2.c, I
need a structure called node. And that node, as we claimed
earlier with our human volunteers, is going to contain a
number called number, we’ll call it this time, instead of n. And it’s going to contain a ptr
called next to another such node. So that’s copied and pasted earlier,
albeit with the integer renamed to number for clarity. Now, notice in main
what I’m doing first. Go ahead and allocate an
array of no space initially. So this was like when Comey was
holding up first and representing the beginning of our data structure. This is the analog using an array,
that the piece of paper that would be held up here would be numbers. And it’s just pointing at
nothing, null– like left hand down on the floor. Because there is no
memory yet allocated. But then, and while true,
go ahead and get an integer from the user with this code here. Check if the user hit Control-D,
as with this arcane technique. And then our code is
similar in spirit, but we have to stitch these things together. Allocate space for the number. So when I malloc an additional
volunteer from the audience and he or she came down, the
equivalent in code is this– hey, computer, allocate with malloc
enough space to fit the size of a node, then store the results
in a ptr called n. So node *n just means, give me
a pointer to a node, call it n, and store the address that was just
allocated from the audience as before. Why do I have these lines of code
here that I’ve highlighted in blue? What’s that expressing? If bang n, or if not n would
be how you pronounce it– what’s going on there? Yeah? AUDIENCE: If there is no more memory
that you can point to, then it fails. DAVID J. MALAN: Exactly. This isn’t going to
happen all that often. But if the computer is out of
memory, and therefore malloc fails, you don’t want the program
just to crash or freeze. Like, all of us hate when that
happens on Mac OS or Windows. So check for it. If not n, or equivalently,
if n==null, just return 1. Quit gracefully, even though annoyingly. But don’t just crash or
do something unexpected. So you can simplify that check to just
if not n– if n is not a valid ptr, return 1. Now, here’s the code with which we
were implementing the demonstration with our humans. And this is the scariest
looking or most cryptic at least looking code we’re going to see in C. Today is our final day in
C. We’ve been running up a really steep hill of
late, learning about memory, and now data structures and syntax. This is the last of our syntax in C. So what are the symbols to be aware of? This line of code here is how I handed
one of our volunteers a piece of paper. On the right-hand side is the
number that was typed in– 55, or 5, or 20, or
whatever the value is. On the left-hand side is
where you want to put it. n and then literally an
arrow number does this. It has, with malloc a line or
so prior, given me in memory just one of these big rectangles. And again, the top of this in
this example is called the number and the bottom is called next. So that’s our human having stood
up from the back of the room. When I hand that human a number,
like 55, it visually goes there. The line of code with which
you achieve that is this here. Because notice on line 31
here, when I malloc that node, I stored its address
in a variable called n. And that’s a pointer, as drawn
with an arrow, to that big node. Or if we really want to be
nit-picky, if this is in address 100, yes, then the pointer actually
has the value 100 in it. But again, that’s rarely
useful information. So we can abstract away
with just an arrow. So line 31 is what creates
those boxes on the screen. Line 38 is what puts the number– for instance, 55– into the box exactly,
much like I handed a piece of paper over. So what is this? This is the only real
new notation today, even though we’re using
lots of stars elsewhere– arrow This is wonderfully the first time
in C it actually maps to our pictures. If n is the variable and
you do n arrow something, that means follow the arrow– kind of like Chutes and Ladders
if you grew up playing that– and then put the number where the arrow
has led you in the field called number. So as an aside, we can think about this
a different way. n is what data type? What is this thing in blue– n? AUDIENCE: Pointer. DAVID J. MALAN: It’s a pointer. And it’s a pointer to one of these
things that we created earlier. So we’re not doing students
anymore with our structures. We’re implementing nodes, which
have numbers and next pointers. So it turns out that if n
is a pointer to a node– recall that dot notation from before– this is not how you access
number in this case. Because n is not a node itself. It’s a pointer. But if n is a pointer, how
do you go to a pointer? How do you go to an address? With what notation? AUDIENCE: Star. DAVID J. MALAN: Star. So recall from last week, if
we want to go to an address, you could do syntax like this. Ignore the parentheses for a moment. Just *n means if n is an address of
a chunk of memory, *n means go there. Once you’re there, you’re conceptually
right here– top left-hand corner. How do you access individual
fields like number or next? You use dot notation. So if you literally do *n.number, that
means go to the address and access the number field. There is nice syntactic
sugar in C, which is just a fancy way of saying shorthand
notation, where it’s just the arrow. But that’s all it is. This arrow notation
doesn’t do anything new. It just combines, go there, with, access
a field in a struct, all in one breath if you will. And this just looks a little prettier. When I told our volunteers
earlier, point your hand down at the floor, that’s all
that line of code is doing. It’s saying, go to n’s address,
which is here, access the next field, and write in that field
null, which is just the address 0– the default, special
address, like pointing at the floor. This line of code, 40, is
just a quick error check. if (numbers)– what
is that equivalent to? That’s actually just saying,
if numbers, not equals null. So if numbers is legitimate, if malloc
worked correctly, then let’s go ahead and do the following. Phew. This is a mouthful. What is going on here? So this is a for-loop
that’s not using numbers. Well, or is it? Almost every for-loop we’ve written
and you’ve probably written just uses I, J, maybe K, but
just integers probably. But that doesn’t have to be the case. What is a pointer? It’s an address. What is an address? AUDIENCE: A place in memory. DAVID J. MALAN: A place in
memory, or a number really. So you can certainly use for-loops
just involving addresses. But how? So we’ll consider this line of code. This here looks different
today, but it’s everything before that first semi-colon. That’s just where you
initialize a value. So this is like saying, hey,
computer, go ahead and give me a variable called ptr and initialize
it to be the start of my list. Then I’m saying, hey, computer, do this
so long as ptr does not equal null. And then what am I doing? if– and let’s ignore this
for now, it’s an error check– go ahead and– sorry, let
me think for one second. OK. Let’s do this. What are these lines of code doing? This is the code that
was actually suggested at the very end of our human example. Like, what if we wanted to
insert all of the elements at the end of the link list? How do you express that? So in this highlighted lines of
code, we’re asking the question, if the current pointer’s next
field is null, we’ve found the end. Go ahead and update that next
field to equal n and then break. So let me translate this
to an actual picture, but using smaller boxes that makes
clear where something is going. So suppose that this program’s
been running for a little while and we have a length list
that looks like this, where this one is pointing here
and maybe this one’s pointing here. And this says null here. And this points here. And the numbers are, as we’ve
been using today, 42, 50, 13. So the start of this
list is called numbers. This points to the start of the list. What am I doing in this for-loop? I am just implementing the
following logic with this loop– give me a variable called
ptr, as represented in the story by my left
finger, here, and initialize that to be the start of the list. If that node’s next pointer is
equal to null, add a new node here. But this is not null. I want to follow the
bread crumbs to here. And then, oh, we’re at
the end of the list. I want to insert this new thing here. So how do express this
code actually in C? So if I look back up here,
this is the line of code that allocates my left finger
here called ptr and initialize it to equal numbers, which is the same
as pointing at the first element. It’s kind of like Comey was
representing first earlier. But now our array is called numbers. Next, what am I doing? Does ptr equal null? Well, no. If my left hand is pointing here,
it obviously doesn’t equal null. So we don’t have to worry yet. Then what do I want to do? If ptr next equals null,
well, what does that mean? Well, ptr is here. ptr arrow next means here. Does this equal null in this story? I mean, it literally doesn’t. Because null is not written
there. null is way down there. So the condition does not pass. So what do I do next? If ptr is equal to null doesn’t
apply, here’s a weird update. ptr gets ptr next. So it’s cryptic-looking syntax. But if ptr is pointing
here, what is ptr next? That’s just this, right? This is n. This is next. Or this is number. This is next. So ptr next is this. So what is this value? Well, this is a pointer pointing here. So that highlighted block of
code, ptr equals ptr next, has the effect visually of doing this. Why? If the arrows are a little too magical,
just think about these being addresses. If this is saying, the next
address is location 100, ptr equals ptr next is like
saying, well, this also equals 100. Whatever 100 is, for
instance, over here is what both arrows should now point out. And if you now repeat this
process and repeat this process, eventually that question we
asked earlier is going to apply– if ptr next equals null,
what do I want to do? Well, if ptr x equals null, there’s
two lines going on. ptr next equals n. So ptr next is no longer null. It should instead be pointing
at n, which is the new node. And then that’s it. Because this was already
initialized to null. And let’s suppose this was 55. And we’re done. So much easier to do, obviously,
in person with just humans, and moving around, and
pointing with their left hands. But in code, you just have to think
about the basic building blocks. What is each of these values? Where is each of it pointing? And which of those fields
do you need to update? And the only new code here– even though
we’re kind of combining it all in one massive example– is this. We are actually using arrow
notation to say, go to that address and access some value therein. And this condition down here, which
I’ll wave my hand out for now, just handles this situation where
the list is initially empty. Any questions on this thus far? All right. So let’s take a look more graphically
at some final problems we can solve. And what you’ll see in the
days ahead is the following when it comes to these
linked lists and more. We now have the ability to actually
allocate things in memory dynamically. We don’t necessarily know in
advance how many numbers we have or, in the case of the next problem
set, how many words we have. We have the ability though to use
malloc, and maybe even realloc, to grow and grow our
data structure in memory. And we have the ability
in code to actually traverse those values
in such a way that we can access memory that’s
all over the board now and not necessarily
back to back to back. But what happens if we want to combine
these ideas into fancier solutions still? Well, let’s take a look at that. In particular, if I go let’s
say over here to the following, let’s consider a problem
we might now solve. If I wanted to store everyone’s name
in this room in a data structure, I could do what? Well, we could use an array. So I could actually decide how
many people are in the room– let’s call it n– and actually draw n boxes on the
board, and then iteratively ask everyone for their name,
and actually write it down. If I then wanted to take attendance
thereafter and say, oh, is Alice here, or is Bob here, or is
Kareem here, or Brian, I could just look through that array
and say yes or no, that human is here. But what’s the running
time of that algorithm? How long would it take to look
up a name in a data structure where I’ve just drawn it as an
array, a big list on the board? AUDIENCE: A big O of n. DAVID J. MALAN: What’s that? AUDIENCE: A big O of n. DAVID J. MALAN: A big O of n, right? Because if it’s just a list of
names, it’s going to take big 0 of n. And frankly, that seems a little slow. How could I do an optimization? Well, what if we combined
some of these ideas? Arrays are nice because
they give me random sort of instant access to memory locations. But linked lists are nice because they
allow me to dynamically add or subtract elements even if I want from the list. So you know what? Instead of writing down everyone’s
names, like Alice, and Bob, and Charlie, like this in just one big
array of some fixed size that might paint me into a corner– now I
only have room for one more name– what if I instead do things
a little more cleverly? So when I’m actually jotting down
everyone’s name in the room, what if I instead did, OK, is Alice here. All right. Alice is here. And then Brian is here. I’m going to put Brian here. And then maybe Charlie is here. All right. So Charlie. And then maybe Arnold is here. Where should I put Arnold? So also starts with A. You know what? Let’s just put Arnold here. Arnold. And Abby is here. So you know what? Let’s just put Abby up here as well. Bob came as well. So Bob– so what’s the pattern
I’m obviously following as I’m hearing names called out? AUDIENCE: Alphabetically sorted. DAVID J. MALAN: Alphabetically sorted– kind of. Like, Abby kind of ended
up in a weird place here. But that’s fine because I
didn’t hear her name first. But I did kind of bucketize people
into different rows of the board. In other words, all
of the A names I seem to just write down for
convenience at the top, and then all of the B names
together, and C names. And probably if I kept going,
I could do this all the way through Z in the English alphabet. So what’s nice about this is that,
yeah, I’m making lists of names, but how long is each of those lists? If there’s n people in
the room, each of my lists is not going to be n
long, which is slow. It’s going to be what? n
divided by 26, give or take. If we assume that there’s
an equal number of people with Z names and A names, it’s going
to be roughly n divided by 26 so that I have these chains
of human names, but they’re much shorter than they would have been
if I just grouped everyone together. And this is a fundamental technique
in programming called hashing. It turns out there are things in
this world called hash functions. These are just mathematical, or
verbal, or code-implemented functions that take as input something and produce
as output a number typically– a number from 0 to, say, 25, or from 1 to 26. But they can also output strings
in other contexts as well. So my hash function here in my
mind is, if you hand me a name, I’m going to look at the
first letter in your name. And if it’s A, I’m
putting you in location 0. If it’s B, I’m going to
put you in location 1. If it’s a Z, I’m going to put
you in location 25 at the end. So these are all buckets
I’ve got, so to speak, in computer science–
like 26 buckets or room on the board that represent
the starts of people’s names. So what is that? Well, it would seem that if I don’t
know in advance how many A names I have, that’s kind of like drawing this
as a linked list, if you will, that might just get longer and longer. But I do know that I only have a
finite number of first letters. So that– at the risk of
drawing a little messily– is kind of like drawing
what data structure? AUDIENCE: An array. DAVID J. MALAN: Yeah. It’s kind of like drawing an
array that just has 26 spots. And what’s nice about an array
is that I have random access. I can jump right to any letter of the
alphabet in constant time, one step. And once I get there, I’m still
going to see a list of names. Thankfully, thanks to linked lists,
that list can be short or long. But on average, let’s
say it’s going to be 126th the length that it would have been
if I just used one array or one linked list. So this technique of using a
hash function– which, again, I’ve defined as you give me
a name; I take that as input; I look at the first letter; and I
return as output a number from 0 to 25– a hash function lets
you create a hash table. And there’s different ways
to implement hash tables, but perhaps one of the most
common is indeed like this. You decide in advance
on the size of an array. But that array does not contain
the strings or the humans’ names. That array actually
contains linked lists. And it’s the linked lists
that contain the names. So we borrow ideas from, like, week two. We merge them with an idea today
from week four of adding arrays to linked list respectively. And we kind of get the
best of both worlds. Because I can immediately jump to any
letter of the alphabet super fast. And once I’m there,
yeah, there’s a list, but it’s not nearly as long as it would
have been if I didn’t use this trick. So what’s the running
time of all of this? Well, it turns out that a
hash table in the worst case might still take you how many steps
to find someone’s name once it’s been added to the list? In the very worst case, how many
steps, if there’s n people in the room? AUDIENCE: n. DAVID J. MALAN: Maybe n. Why? It’s kind of a perverse situation. But can you contrive
a scenario in which, even though we’re doing
this fanciness, it still takes me n steps to confirm
or deny that someone’s here? Yeah? AUDIENCE: Everyone’s name
starts with the same letter. DAVID J. MALAN: Everyone’s name
starts with the same letter for some weird reason. Now, it’s a little silly
in the human world. But it could happen
if you’re just talking data or whatever in the computer world. This can devolve into, sure, an array
with just one really linked list. But in practice, that’s not
likely going to happen, right? If we actually spent the time here
and asked everyone for their name, we’d probably get a reasonably
uniform distribution of letters, at least as is statistically
likely with just human names. So that would kind of spread things out. And so there’s this fundamental
distinction between sort of real-world running time, or wall clock time–
how many seconds are actually spinning on the clock– versus asymptotic running time. We’ve talked for a couple of weeks now
about running time as being big O of n. And that might be still the case, that
a hash table– yes, in the worst case, it’s still a big O of n data structure. Because in the worst case,
it’s going to take n steps. But in the real world, big O of n
is really big O of n divided by 26, even though we always ignore
those lower-order terms. But when it’s you, the human, running
the code and analyzing the data, running 26 times faster is
actually real time saved, even though a mathematician might say,
ah, that’s the same fundamentally. And indeed, one of the problems
ahead for the next problem set is going to be to suss out
exactly what the implications are in your own code for actual
wall clock running time. And making smarter design
decisions, like something like this, can actually really speed up your
code to be 26 times as fast, even though, yes, a
theoretician would say, ah, but that’s still asymptotically
or mathematically equivalent to just something linear. So it’s this fine tuning that will
make your code even better and better. Now, frankly, hashing
on first names probably isn’t the smartest thing alone, right? Like, does anyone’s– and
this is going to be hard. Does anyone’s name start with X here? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: [INAUDIBLE] is not here. But thank you for that
perfect counter-example. But she’s not here. So look, there’s no Zs. So now we’re down to 25 possible values. And I could probably pick
some less common letters too. The point is there’s probably
a few more As than there are Zs or a few more B’s than there are
Q’s just by nature of human names. So maybe just using the first
letter isn’t good enough. And frankly, with 26 names– suppose
we did this for all of Harvard and had thousands of names. Each of my chains might still have
hundreds or thousands of names. So another design question is going to
be, well, how many buckets should you have, how big should the array be. Maybe you shouldn’t look
at the first letter. What if you look at the first and the
second letter together– so AA, and AB, and AC, and then dot dot dot,
BA, BB, BC, so you could come up with more and more buckets? But what else? How else might we kind of
uniformly distribute people? What do all of you have that we could
use as input to a hash function? AUDIENCE: A last name. DAVID J. MALAN: OK. Well, you could do last
name, which might give us a different or similar distribution. Yeah? AUDIENCE: ID number. DAVID J. MALAN: Whats that? AUDIENCE: ID number. DAVID J. MALAN: Yeah. We could use your ID number and actually
look at the first digit of your ID. And odds are, it’s 0 through 9. So we could probably at least
get 10 buckets that way. And that’s probably
uniformly distributed. I’m not sure. We could use birth dates in some way. Like, we could put all of
the freshmen in one bucket, all the seniors in another
bucket, and everyone else, and so forth, in their own buckets,
which would also give us some input. So again, a hash function is entirely
up to you to program and design. The goal though is to smooth things out. You want to have roughly the same
number of things in each linked list just so that you have
about the same performance across all of these various inputs. So let’s take a look at a
couple of other data structures, again, in this abstract way. Now that we know that, even though
it’s not obvious at first attempt, we know how to construct arrays. We kind of know now how
to construct linked lists. It stands to reason we could
implement them together in code. What else could we do now
with these building blocks? So for instance, this structure here
is a very common one, known as a tree. A tree like a family tree, where
there’s one patriarch or matriarch at the top, and then their children,
and then their grandchildren, and great grandchildren, and so forth. And what’s nice about a tree structure
is that, if you’re storing data, you can actually store the data
in clever ways to the left child, to the right child, and
so forth, as follows. Notice here, there’s something curious
about all the numbers in this data structure. What is noteworthy about them? What is noteworthy? Yeah? AUDIENCE: Multiples of 11. DAVID J. MALAN: What’s that? AUDIENCE: They’re multiples of 11. DAVID J. MALAN: They
are multiples of 11. That was just to make them look
pretty though by the author here. Yeah? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Yeah. There’s a mathematical significance too. Like, no matter what node or
circle you look at, the value in it is bigger than the left child and
it’s smaller than the right child. So it’s kind of in-between. Any circle you look at, the
number to the left is smaller, the number to the right is bigger. And I think that applies
universally all over the place. Yes? So what does that mean? We’ll recall from, like, week 0 when we
had a whole bunch of phone book pages that we were searching– 1, 2, 3, 4, 5, 6. Let’s give ourselves a 7th one. Recall that when we did divide
and conquer, or binary search, we did it on an array. And what was nice about binary
search was we started in the middle, and then we maybe went left,
or we maybe went right, and we kind of divided
and divided and divided and conquered the problem much more
efficiently in logarithmic time than it would have been
if we did it linearly. But we know now weeks later that
arrays are kind of limiting, right? If I keep storing all of
my values in an array, what can I not do with the array? Make it bigger, right? I can’t add an element to it
without copying every darn element, as we’ve discussed thus far today. But what if I was a
little smarter about it? What if I stored my values,
not just in an array, but I started storing
them in these circles– let’s call them nodes– and each of those nodes is really just
an integer plus two additional values? How would we implement this
data structure in memory? Well, here’s an int n– could
represent the number in question. And we could put that
in a data structure called a node that just has the
same syntax as earlier today, but I’ve left room for two more fields. What is it that I want
to represent in code if I want to start storing my numbers,
not in this old-school week 0 array, but in a tree? AUDIENCE: Two pointers. DAVID J. MALAN: Two– AUDIENCE: Pointers. DAVID J. MALAN: Two pointers. Right? A tree, as drawn here
literally with arrows, is just like saying every
one of these nodes or circles has a left child and a right child. How do you implement children? Well, you can literally just use
pointer notation as well here. A left child is just a pointer
to another struct on the left. And a right child is just another
pointer to the child on the right. And what’s nice about this
ultimately is that we can now traverse this tree just as efficiently
as we can traverse this array. Because notice if I want to
search for the number 66, how many steps does it take
me if I start at the top? Just like Comey represented
the start of our linked list, so in the world of a tree does the
root have special significance. And that’s where we always begin. So how many steps does it take
me to find 66 given the top? AUDIENCE: Three. AUDIENCE: Two. DAVID J. MALAN: It looks like– yeah, two or three, right? I start at the top. I look at it and say, hmm,
55, which way do I go. I go to the right. Then I see 77. OK. Which way do I go? I go to the left. So it’s the same logic as week 0 in
dividing and conquering the phone book or an array a couple of weeks later. But we get to the number we
care about pretty quickly. And it’s not linear. And in fact, if we actually
did out the math, what’s really cool about a binary search
tree is that if you have n elements, n circles, the height of that tree is
by definition mathematically log n. So the height of the
tree just so happens to correspond to exactly how
many times you can take n and divide it, divide it,
divide it, divide it in two. And you can actually see this if you
think about it the reverse direction. On the bottom row, there
are how many elements? All right? And on the middle row, there is? AUDIENCE: Two. DAVID J. MALAN: Two. And on the top row, there’s one. So you can actually see it
in the reverse direction. This is like divide and conquer,
but in a different conceptual way. Every row in the tree has half as
many elements as the one below it. And so the implication of that is just
like from week 0 in the phone book when we’re dividing, and dividing, and
dividing in half, and half, and half. So this is only to say, now that
we have structures and pointers, we can build something like this. But let’s try one
other example here too. This is a crazy looking example. But it’s kind of amazing. Suppose that, if we wanted to
store a dictionary of words– so not humans’ names this
time, but English words. So Merriam Webster or Oxford
English Dictionary has what? Thousands, hundreds
of thousands of words these days in English for instance? How do you actually store those? Well, if you just look up words in
a dictionary back in yesteryear, that is linear. You have to start at the
beginning and look through it page by page, looking for words. Or you could be a little smarter. Because the words in any dictionary
are hopefully alphabetized, you can do the Mike Smith-style divide
and conquer by going to the middle, then the middle of the
middle, and so forth– log of n. But what if I told you, you could
look up words in constant time– some fixed number of steps? None of this divide
and conquer complexity. No log n. Just constant time– you want
a word, go get it instantly. That’s where this last structure
comes in, which is called a trie– T-R-I-E– short for retrieval, even
though it’s pronounced the opposite. So a trie is a tree each
of whose nodes is an array. So it’s like this weird Frankenstein’s
monster kind of data structure. We’re just really combining lots
of different ideas, as follows. And the way a trie works, as is implied
by this partial diagram on the board, is that if you want to store
the name Brian, for instance, in your dictionary–
it’s the first word– what you do is you start by
creating a tree with just one node. But that node is effectively an array. That array is of size, let’s
say for simplicity, 26. So A through Z. This location here
therefore represents B for Brian. So if I want to insert Brian into this
tree, I create one node at the top. And then for the second
letter in his name, R, I create another node,
also an array, A through Z. And so here, I put a
pointer to this node here. B-R-I. So I should have
drawn some more boxes. A, B, C, D, E, F, G, H, I. So here, I’m
going to draw another pointer to B– wait. Bian. [LAUGHTER] OK. That’s wrong. Billy shall be our name. Billy is at B. Wait. No. Dammit. B, B. B-I-A– yes, this works. This works. OK. Sorry. So here we go. We’re inserting Billy into
this fancy data structure. So the first node
represents the first letter. The second node represents
the second letter. The third node represents
the third letter. And so forth. But what’s cool about
this is the re-usability. So notice if this is the second letter
and I counted this out correctly, I, this is going to lead
to a third node deeper in the tree where it’s L that we
care about next, and then another one down here which represents another L. And I’ll start drawing the letters. L. This is B. This is I.
L. And we’ll call this L. And then, finally, another one
over here, which is a Y. And this gets pointing down here. This gets pointing here. And so forth. So in short, we have
one node essentially for every letter in the word that we’re
inserting into the data structure. Now, this looks stupidly
inefficient at the moment. Because to store B, I, L, L, Y,
how much memory did I just use? 26 plus 26 plus 26 plus 26 plus 26. Just to store five
characters, I use 26 times 5. But this is kind of thematic
in computer science– spend a little more space, and I bet
I can decrease the amount of time it takes to find anyone. Because now no matter how many other
students are in this data structure– and for instance, let’s do another one. If we had another one, like Bob– so B is the same first letter. That leads us to this second node. O is somewhere else in
this array, say, over here. So this represents O. And
then Bob has another one. So there’s going to
be another array here. And this is why the picture
above draws this so succinctly. This is how we might store Bob. So B, I, L, L, Y. Or you can
follow a different route, B, O, B. So we can start to reuse
some of these arrays. So there’s where you start to
get some of the efficiency. Any time names share a few letters,
then you start reusing those same nodes. So it’s not super, super wasteful. But the question now is, if there’s
like 1,000 students in the class, or 1,000 students in the room,
we’re going have a lot of nodes there on the board. But how many steps does
it take to find Billy, or Bob, or any name with this data
structure, and to conclude yes or no that student is in the class? So, like, five for Billy, three for Bob. And notice none of that
math has any relationship to how many students are in the room. If we instead wrote out a long list
of 1,000 names, in the worst case, it might take me 1,000
steps to find Billy or Bob. Maybe I could be a little
smarter if I sort it. But in the worst case,
big O of n, it’s linear. Or if I used a hash table
before, and maybe there’s 1,000 students in the
room, but, OK, there’s 26 letters in the English
alphabet at least. So that’s 26 buckets. So maybe it’s 1,000
divided by 26, worst case, if I’m using those linked
lists inside my array. But wait a minute. If I’m using this structure, a
trie, where every node in the tree is just in an array that leads me to the
next node, ala breadcrumbs, B, I, L, L, Y is 5 and always 5. B, O, B is always 3. B, R, I, A, N would have been 5 as well. None of these totals has
any impact or any influence from the number of total
names in the data structure. So a trie in some sense
is this amazing holy grail in that, by combining these various data
structures, now you get constant time, but you do pay a price. And just to be clear, what is
the price we seem to be paying? AUDIENCE: Memory. DAVID J. MALAN: Memory. And in fact, this is why I’m
not really drawing it much more. Because it just becomes a big
mess on the screen because it’s hard to draw such wide data structures. It’s taking a huge amount of memory. But theoretically, it’s coming faster. Yeah? Question. AUDIENCE: So would you deal with
a case if someone is in the Bob, but then the other kid is in the Bobby? DAVID J. MALAN: Good question. So it’s a bit of a simplification. If you were storing both Bob and
Bobby, you would actually keep going. So each of these elements
is not just one letter. You also have essentially a node
there or some other data structure that says either stop here or continue. And you’ll see actually
in the problems that we’ll propose to you how you can
represent that idea if you choose to go this route. Indeed, the challenge ahead ultimately
is something quite like this. You will implement your
very own spell checker. And we will give you code that
gets you started with this process. And of course, a spell checker
these days in Google Docs and Microsoft Word just underlines
in red misspelled words. But what’s going on? And how is it that
Word or Google Docs can spell check your English or
whatever language so quickly? Well, it has a dictionary in memory,
probably with tens of thousands or hundreds of thousands of words. And all they’re doing constantly
is, every time you type a word and hit the Spacebar,
or Period, or Enter, it’s quickly looking up that new
word or those words in its dictionary and saying, yes or no, should I squiggle
a red line underneath this word. And so what we’re going to do is
give you a big text file, ASCII text, containing 100-plus thousand words. You’re going to have to
decide how to load those into memory, not just correctly,
but in a way that’s well designed. And we’ll even give you a
tool, if you choose to use it, that times how long your code takes. And it even counts how much
RAM you’re actually using. But the key goals for this
week and our final week in C is to take some of these
basic building blocks, like arrays, and
pointers, and structures, and decide for yourselves how you’re
most comfortable stitching them together, to what extent you want to
really fine tune your code beyond just getting it correct, and to give you
a better sense of the underlying code that people have had to
write for years in libraries to make programming doable, ala Scratch. Because in just a few weeks, we’re
going to transition to Python. And the dozens of lines of
code you’ve been writing now are going to be whittled
down to one line, two line, because we’re going to get a lot
more features from these newer, fancier languages. But you’ll hopefully have an
appreciation of what is actually going on underneath that hood. So I’ll stick around for
any one-on-one questions. Let’s call it a day. Take a duck on your way
out for roommates as well. And we’ll see you next time.

54 thoughts to “CS50 2018 – Lecture 4 – Data Structures”

  1. 15:48 this is quite no well explained, the short part is : when a program exits, the memory is free automaticly by the kernel, so most of the time you don't need to worry about freeing the memory, at least of course your program runs forever that is common use in servers, but at the end you will be using a language that have garbage collector so you don't need to worry about that, the large part: https://stackoverflow.com/questions/2975831/is-leaked-memory-freed-up-when-the-program-exits

  2. This guy is amazing! The way he simplifies and explains data structures is a gift. I've been terrified of learning data structures and big O notation, but he makes it so easy to understand. Not just how they work, but the applications as well. Thank you so much David! Thank you Harvard for open sourcing this material.

    I hope I'm smart and dedicated enough to become a Developer or Software Engineer one day. If I do, I'll be sure to give back to Harvard and the world in general in the spirit of open source learning.

  3. Hi David, btw fantastic lectures and your energy is contagious. As a beginner in programming, i just want to know which lectures should i watch in CS 50?, since there are a lot of them. please advise me the order of all playlists for me to comprehend in a coherent way. thanks in advance.

  4. 11:38
    It's not that lucky all the time.
    my Debian core dumped when I did that locally.
    memory: malloc.c:2406: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize – 1)) == 0)' failed.
    Aborted (core dumped)
    Reading: https://stackoverflow.com/questions/2987207/why-do-i-get-a-c-malloc-assertion-failure

  5. guys! thank you!!! this kind of videos been a great help! I'll surely repay your kindness in some other way, someday..

  6. In the Tries algorithm, when he "gets to billy" or "gets to bob" what is he actually doing? traversing the arrays, and the last array points to a value (i.e. present, or not present?) and why couldn't bob use billys second array (containing the letter 'i')? since every array contains every letter of the alphabet.

  7. Class moved to new location due to class size, happened a ton at UCF, otherwise known as "optimization." Nah, but seriously, people can't hang. Seriously though David, the most comprehensive lesson on programming I've ever seen (including MIT) and I've been developing games and writing in C# (some C++) for a few years now.

  8. Nooo!! Harvard students get to take ducks for roommates as well!!…
    We don't even get a glass of water ;(

  9. Thanks David , I am preparing to get my Masters in Computer Science coming from a physics background as an undergraduate and you are really helping me.

  10. Excellent professor. He's mastered the art of explaining concepts, while delivering the information with enough energy to keep it exciting. You definitely won't fall asleep watching this man!

  11. At 56:53 , David says that we cannot point the pointer backwards. That is true. But can't we achieve that by using "&" operator. Suppose we want to delete an element from linked list. So we can write, say, int *ptr = &(the no. to be deleted) and eventually use free(ptr).( Yeah we won't do it directly this way, else linked list would be broken.) But, my point is just that we can still access the pointer pointing the node containing this element( to be deleted)
    Edit: well I think it is wrong because & operator might not necessarily give you the correct address also of the specific element. Like if 12 is stored in more than 1 variables, then obviously & operator may not give the desired value.

  12. At 33:00 how can we implement the same code if we don't have access to functions like get_string but instead wanna pull it off using scanf.
    I tried but the code just got really messy and would produce some error every time.

  13. These lectures are saving my butt in CS school. Having a huge overview class at the beginning that gives you a little taste of everything is SO HELPFUL! It's so hard understanding what is going on when the classes are segmented and there isn't one class to link it all together.

Leave a Reply

Your email address will not be published. Required fields are marked *