87 thoughts to “Multithreading Code – Computerphile”

  1. 12:03 What's the point of running two threads if one has to wait for the other? Just give them their own registers, add the registers at the end for the final value, instead of sharing a register…

    There. I just near doubled the running speed of the code. You're welcome 😀

  2. if you spend any amount of time at the command prompt, you should check out iTerm 2. Macs built in Terminal app is super basic

  3. Not sure if this is an actual educational video or just the editor and Steve flexing on their editing and acting skills respectively.

  4. Online game AI's are really great examples of individual systems co-ordinating through
    classic multi-threading.
    Inline quantum AI's are real and grate exAmples of indiviDual psystems co-cardinating thRough
    toric multi-brAiding.

  5. Couldn't you just have each thread save to its own unique memory location and do one mutex for each thread when both are are done and add their final values?

  6. Oooh! Fourier Transforms, or FFT. Or rather, FFT Window functions.
    When running FFT continually, you usually don't wait for a whole new set of samples before creating the next transform. You are more likely to have an overlap (say 25%).
    The last 25% of the first set become the first 25% of the second set. A new thread could start on the 2nd transform, before the 1st is finished.

    Problem: some FFT algorithms change the order of samples while doing their "butterfly" transform. Need to be careful there.

  7. Just a reminder that the point of the video wasn’t to optimize the code for finding the sum of all the numbers that add up to 1,000,000, but to break down how multi-threading works and some solutions you would use for more common scenarios. Great video!

  8. The shared memory location coherency is like a read modify write instruction used extensively in symmetric multiple processors for example semaphores.

  9. Would it be possible to introduce the concepts of memory coherence and atomicity in a follow-up video to explain how this can be done without acquiring a lock? Atomic counters and interlocked instructions are great stuff that everyone should be exposed to.

    The "barrier to entry" (pun intended) to introduce these concepts is a little high, but the speed gained is something that might blow peoples minds. We cannot do stuff like modern GPGPU without these concepts, locks just don't work in a sensible way when you have thousands of threads running at once!

  10. I just had thought god uses multi threads for generation learning which we call alternate universe and this video pops up.Nice one youtube,nice one,and this multi Steve nailed it

  11. To me it would seem the most efficient way to deal with this problem is have one thread do A and one thread do B, and when they both finish A+=B, so no locking is required.

  12. Nice explanation. We master the sequential process a long time ago. We have very good mathematical models on sequential processing but even now in 2019 we lack a good mathematical model on concurrency processing. Having a huge complex program and trying to split it in multiple threads is a nightmare. I think because we have only two hands our minds are not good at concurrency thinking. I would really like to see a programming language where the source code file has "normal code – handled by programmers" and on top of it there is a "synthetic code – handled by AI", the "synthetic code" is invisible to the programmer and it is in the same source code file. So the human job is to write the code and the AI job is to split the same code automatically. So as a human we don't need to know that the code was run in a complex concurrent fashion. This programming language can have multiple layers (Human layer + many AI layers). It is quite difficult to make a sequential looking code (TEXT) act concurrently (This just makes the code look quite complicated) .

  13. Hadn't they commented out the multithreaded code before the last part ? With true parallelism the real time is less than the user (total cpu) time, which is not the case in both experiments. The theory was good, but we just ended up seeing the single-threaded version 3 times xD

  14. I'm surprised that your program without lock didn't crash. Two threads writing to the same location is a perfect recipe for "Access Violation error".

  15. question 1: so a token to prevent one thread from accessing a variable until the other is done with it?

    question 2: do threads share memory space when they are running off of a single cpu?

  16. This summarizes the difficulties in group work. All parties are working on the same thing, slowing down efficiency due to mutex locking.

    Of course, that assumes that there is even a mutex, because most of the time it's just a single guy doing all the work, with the other parties blackmailing that thread to credit them equally.

  17. Wait… there's something very wrong here. Since the variable is volatile, the compiler must (if it wants to adhere to the volatile flag) add the local variable into the global memory address directly, and in x86 most operations that alter memory lock that memory address until they're done. Why isn't gcc doing it here? Does it have to be compiled with optimizations turned on?
    What gcc's probably doing:
    mov eax, [i]
    cdqe
    mov rdx, [a]
    add rdx, rax
    mov [a], rdx

    While it should be doing:
    mov eax, [i]
    cdqe
      add [a], rax

  18. Anybody else impressed by Brady's guess of 500 billion? .. 500 000 000 000 ~= 500 000 500 000. That's like 0.0001% error

  19. 1:46 (about processes) "If you changed one memory location, it updated in the other side […]" – NO! THAT IS WRONG!!! The defining feature of processes was memory isolation. No changing memory in other processes unless both processes asked for it (look up shared memory). It is true that the process' memory is not necessarily copied during the fork(). The CPU's support for virtual memory allows using page faults and what is known as copy-on-write, or COW, for short. This means both processes continue to use the same memory, but set as read-only in the page tables managed by the OS, until one of them tries to write to that memory. This causes a page fault by the CPU (writing into a read-only page), but this is transparently handled by the OS which allocates a new (physical) page, copies the data and attaches it to the writing process as read/write, then the "offending" process is resumed. But this is an optimization by the OS and does not change the concept of memory isolation. I got worked up over this, sure, people make mistakes, but memory isolation versus common address space… that is basically the defining difference between processes and threads. E.g. here is an excerpt from the documentation of the GNU/Linux function clone():

    Unlike fork(2), clone() allows the child process to share parts of
    its execution context with the calling process, such as the virtual
    address space, the table of file descriptors, and the table of signal
    handlers. (Note that on this manual page, "calling process" normally
    corresponds to "parent process". But see the description of
    CLONE_PARENT below.)

    One use of clone() is to implement threads: multiple flows of control
    in a program that run concurrently in a shared address space.
    Note how they define threads as "multiple flows of control … in a shared address space ."? I'll admit that this is tricky stuff, e.g. the distinction between threads sharing the table of file descriptors whereas fork() effectively copies this table. This means that, while after both a fork() or a clone(), both processes will have the same files opened with the same file descriptors (these are integer IDs usually), changes to the FDT will only be seen by a clone, not by a forked process – e.g. if one forked process closes a file descriptor, it will yet remain open in the other.
    TL;DR: Your explanation of the difference was like saying "a bike is different from a trike in that it has three wheels", which is horrible, which in turn should explain my reaction.

  20. Using the token, doesn't that create an "additional" queue, as the multithreading already causes a queue-ish scenario? Why not split up the store to n*threads and then combine results as finishing task? Ah, 13:40 explains one reasoning. Still, would my approach work?

  21. Not the best example. If you look at the assembly emitted for an optimised simple single threaded implementation you see a decent compiler will calculate the result at build time. Multithreading is just going to slow stuff down.

    movabsq $500000500000, %rsi

    The count variable was declared volatile to stop that kind of thing, but the reason for that wasn't explained very well. Just some hand waving about 'more problems'.

  22. Great video as usual, though this one gets a prize for cool editing effects. I was wondering; how about using two separate accumulators, one for each thread, and adding them together after the join?

  23. Problem with the demo was that adding two numbers takes around the same amount of time as acquiring a lock. Giving each thread some real time-consuming work before needing to get the lock would show a significant speedup.

  24. Doesn't using a lock, which makes you use one core at a time, render multithreading useless? Or is it just because of this small example?

  25. Find me a better channel on YouTube, I'll wait .

    Every time I come to computerphile, it blows my mind open.

  26. One note is that multi-threading for raw computation rarely produces benefits. Threading is better suited to IO bound operations

  27. It was the optimiser. It sees you are adding a fixed amount of numbers to a variable so it converts to a + sum(stuff) form and puts the mutex around it auto-magically.

  28. ==== pthreads-cp.c ====
    “`

    #include <stdio.h>
    #include <pthread.h>
    #include <stdlib.h>

    volatile long int a = 0;
    pthread_mutex_t aLock;

    void threadFunc1(void* arg)
    {
    int i;
    long int localA = 0;

    for (i = 0; i < 500000; i++)
    {
    localA = localA + i;
    }
    pthread_mutex_lock(&aLock);
    a = a + localA;
    pthread_mutex_unlock(&aLock);
    }

    void threadFunc2(void* arg)
    {
    int i;
    long int localA = 0;

    for (i = 500000; i <= 1000000; i++)
    {
    localA = localA + i;
    }
    pthread_mutex_lock(&aLock);
    a = a + localA;
    pthread_mutex_unlock(&aLock);
    }

    int main()
    {
    pthread_t one, two;
    a = 0;

    pthread_mutex_init(&aLock, NULL);
    pthread_create(&one, NULL, (void*)&threadFunc1, NULL);
    pthread_create(&two, NULL, (void*)&threadFunc2, NULL);

    pthread_join(one, NULL);
    pthread_join(two, NULL);

    printf("%ldn", a);
    return EXIT_SUCCESS;
    }
    “`

Leave a Reply

Your email address will not be published. Required fields are marked *