C++11 multithreading tutorial

The code for this tutorial is on GitHub: https://github.com/sol-prog/threads.

In my previous tutorials Iâ€™ve presented some of the newest C++11 additions to the language: regular expressions, raw strings and lambdas.

Perhaps one of the biggest change to the language is the addition of multithreading support. Before C++11, it was possible to target multicore computers using OS facilities (pthreads on Unix like systems) or libraries like OpenMP and MPI.

This tutorial is meant to get you started with C++11 threads and not to be an exhaustive reference of the standard.

Creating and launching a thread in C++11 is as simple as adding the thread header to your C++ source. Letâ€™s see how we can create a simple HelloWorld program with threads:

#include <iostream>
#include <thread>
 
//This function will be called from a thread
 
void call_from_thread() {
    std::cout << "Hello, World!" << std::endl;
}
 
int main() {
    //Launch a thread
    std::thread t1(call_from_thread);
 
    //Join the thread with the main thread
    t1.join();
 
    return 0;
}

On Linux you can compile the above code with g++:

1	`g++ -std=c++0x -pthread file_name.cpp`

On a Mac with Xcode 4.x you can compile the above code with clang++:

1	`clang++ -std=c++0x -stdlib=libc++ file_name.cpp`

On Windows you could use a commercial library, just::thread, for compiling multithread codes. Unfortunately they donâ€™t supply a trial version of the library, so I wasnâ€™t able to test it.

In a real world application the â€œcall_from_threadâ€ function will do some work independently of the main function. For this particular code, the main function creates a thread and wait for the thread to finish at t1.join(). If you forget to wait for a thread to finish his work, it is possible that main will finish first and the program will exit killing the previously created thread regardless if â€œcall_from_threadâ€ has finished or not.

Compare the relative simplicity of the above code with an equivalent code that uses POSIX threads:

#include <iostream>
#include <pthread.h>
 
//This function will be called from a thread
 
void *call_from_thread(void *) {
    std::cout << "Launched by thread" << std::endl;
    return NULL;
}
 
int main() {
    pthread_t t;
 
    //Launch a thread
    pthread_create(&t, NULL, call_from_thread, NULL);
 
    //Join the thread with the main thread
    pthread_join(t, NULL);
    return 0;
}

Usually we will want to launch more than one thread at once and do some work in parallel. In order to do this we could create an array of threads versus creating a single thread like in our first example. In the next example the main function creates a group of 10 threads that will do some work and waits for the threads to finish their work (there is also a POSIX version of this example in the github repository for this article):

...
static const int num_threads = 10;
...
int main() {
    std::thread t[num_threads];
 
    //Launch a group of threads
    for (int i = 0; i < num_threads; ++i) {
        t[i] = std::thread(call_from_thread);
    }
 
    std::cout << "Launched from the main\n";
 
    //Join the threads with the main thread
    for (int i = 0; i < num_threads; ++i) {
        t[i].join();
    }
 
    return 0;
}

Remember that the main function is also a thread, usually named the main thread, so the above code actually runs 11 threads. This allows us to do some work in the main thread after we have launched the threads and before joining them, we will see this in an image processing example at the end of this tutorial.

What about using a function with parameters in a thread ? C++11 let us to add as many parameters as we need in the thread call. For e.g. we could modify the above code in order to receive an integer as a parameter (you can see the POSIX version of this example in the github repository for this article):

#include <iostream>
#include <thread>
 
static const int num_threads = 10;
 
//This function will be called from a thread
 
void call_from_thread(int tid) {
    std::cout << "Launched by thread " << tid << std::endl;
}
 
int main() {
    std::thread t[num_threads];
 
    //Launch a group of threads
    for (int i = 0; i < num_threads; ++i) {
        t[i] = std::thread(call_from_thread, i);
    }
 
    std::cout << "Launched from the main\n";
 
    //Join the threads with the main thread
    for (int i = 0; i < num_threads; ++i) {
        t[i].join();
    }
 
    return 0;
}

The result of the above code on my system is:

Sol$ ./a.out
Launched by thread 0
Launched by thread 1
Launched by thread 2
Launched from the main
Launched by thread 3
Launched by thread 5
Launched by thread 6
Launched by thread 7
Launched by thread Launched by thread 4
8L
aunched by thread 9
Sol$

You can see in the above result that there is no particular order in which once created a thread will run. It is the programmerâ€™s job to ensure that a group of threads wonâ€™t block trying to modify the same data. Also the last lines are somehow mangled because thread 4 didnâ€™t finish to write on stdout when thread 8 has started. Actually if you run the above code on your system you can get a completely different result or even some mangled characters. This is because all 11 threads of this program compete for the same resource which is stdout.

You can avoid some of the above problem using barriers in your code (std::mutex) which will let you synchronize the way a group of threads share a resource, or you could try to use separate data structures for your threads, if possible. The use of mutex is too advanced for the purpose of this tutorial, you could read more about mutex on one of the references suggested at the end of this post.

In principle we have all we need in order to write more complex parallel codes using only the above syntax.

In the next example I will try to illustrate the power of parallel programming by tackling a slightly more complex problem: removing the noise from an image, with a blur filter. The idea is that we can dissipate the noise from an image by using some form of weighted average of a pixel and his neighbours.

This tutorial is not about optimum image processing nor the author is an expert in this domain, so we will take a rather simple approach here. Our purpose is to illustrate how to write a parallel code and not how to efficiently read/write images or convolve them with filters. Iâ€™ve used for e.g. the definition of the spatial convolution instead of the more performant, but slightly more difficult to implement, convolution in the frequency domain by use of Fast Fourier Transform.

For simplicity we will use a simple non-compressed image file format like PPM. Next we present the header file of a simple C++ class that allows you to read/write PPM images and to store them in memory as three arrays (for the R,G,B colours) of unsigned characters:

class ppm {
    bool flag_alloc;
    void init();
    //info about the PPM file (height and width)
    unsigned int nr_lines;
    unsigned int nr_columns;
 
public:
    //arrays for storing the R,G,B values
    unsigned char *r;
    unsigned char *g;
    unsigned char *b;
    //
    unsigned int height;
    unsigned int width;
    unsigned int max_col_val;
    //total number of elements (pixels)
    unsigned int size;
 
    ppm();
    //create a PPM object and fill it with data stored in fname
    ppm(const std::string &fname);
    //create an "epmty" PPM image with a given width and height;the R,G,B arrays are filled with zeros
    ppm(const unsigned int _width, const unsigned int _height);
    //free the memory used by the R,G,B vectors when the object is destroyed
    ~ppm();
    //read the PPM image from fname
    void read(const std::string &fname);
    //write the PPM image in fname
    void write(const std::string &fname);
};

A possible way to structure our code is:

Load an image to memory.

Split the image in a number of threads corresponding to the max number of threads accepted by your system, for e.g. on a quad-core computer we could use 8 threads.

Launch number of threads â€“ 1 (7 for a quad-core system), each one will process his chunk of the image.

Let the main thread to deal with the last chunk of the image.

Wait until all threads have finished and join them with the main thread.

Save the processed image.

Next we present the main function that implements the above algorithm (many thanks to wicked for suggesting some code improvements):

int main() {
    std::string fname = std::string("your_file_name.ppm");
 
    ppm image(fname);
    ppm image2(image.width, image.height);
 
    //Number of threads to use (the image will be divided between threads)
    int parts = 8;
 
    std::vector<int>bnd = bounds(parts, image.size);
 
    std::thread *tt = new std::thread[parts - 1];
 
    time_t start, end;
    time(&start);
    //Lauch parts-1 threads
    for (int i = 0; i < parts - 1; ++i) {
        tt[i] = std::thread(tst, &image, &image2, bnd[i], bnd[i + 1]);
    }
 
    //Use the main thread to do part of the work !!!
    for (int i = parts - 1; i < parts; ++i) {
        tst(&image, &image2, bnd[i], bnd[i + 1]);
    }
 
    //Join parts-1 threads
    for (int i = 0; i < parts - 1; ++i)
        tt[i].join();
 
    time(&end);
    std::cout << difftime(end, start) << " seconds" << std::endl;
 
    //Save the result
    image2.write("test.ppm");
 
    //Clear memory and exit
    delete [] tt;
 
    return 0;
}

Please ignore the hard coded name of image file and the number of threads to launch, on a real world application you should allow the user to enter interactively these parameters.

Now, in order to see a parallel code at work we will need to give him a significative amount of work, otherwise the overhead of creating and destroying threads will nullify our effort to parallelize this code. The input image should be large enough to actually see an improvement in performance when the code is run in parallel. For this purpose Iâ€™ve used an image of 16000Ã—10626 pixels which occupy about 512 MB in PPM format:

Iâ€™ve added some noise over the above image in Gimp. The effect of the noise addition can be seen in the next detail of the above picture:

Letâ€™s see the above code in action:

As you can see from the above image the noise level was dissipated.

The results of running the last example code on a dual-core MacBook Pro from 2010 is presented in the next table:

Compiler	Optimization	Threads	Time	Speed up
clang++	none	1	40 s
clang++	none	4	20 s	2x
clang++	-O4	1	12 s
clang++	-O4	4	6 s	2x

On a dual core machine this code has a perfect speed up 2x for running in parallel versus running the code in serial mode (a single thread).

Iâ€™ve also tested the code on a quad-core Intel i7 machine with Linux, these are the results:

Compiler	Optimization	Threads	Time	Speed up
g++	none	1	33 s
g++	none	8	13 s	2.54x
g++	-O4	1	9 s
g++	-O4	8	3 s	3x

Apparently Appleâ€™s clang++ is better at scaling a parallel program, however this can be a combination of compiler/machine characteristics, it could also be because the MacBook Pro used for tests has 8GB of RAM versus only 6GB for the Linux machine.

If you are interested in learning the new C++11 syntax I would recommend reading Professional C++ by M. Gregoire, N. A. Solter, S. J. Kleper 2nd edition:

or C++ Primer Plus by Stephen Prata:

A good book for learning about C++11 multithreading support is C++ Concurrency in Action: Practical Multithreading by Anthony Williams:

Source : http://solarianprogrammer.com/2011/12/16/cpp-11-thread-tutorial/

C++11 multithreading tutorial

RELATED

0 COMMENT

ABOUT

HOW IT WORKS

FOLLOW US

FEEDBACK