The Fibers of Threads

For the last several months in this column, we've been looking at programming with Linux's threads library, pthreads. However, we have taken for granted the work that is actually done under the covers by the pthreads libraries. So this month's Compile Time will dissect Linux pthreads themselves to discover exactly what it is that makes them tick.

For the last several months in this column, we’ve been looking at programming with Linux’s threads library, pthreads. However, we have taken for granted the work that is actually done under the covers by the pthreads libraries. So this month’s Compile Time will dissect Linux pthreads themselves to discover exactly what it is that makes them tick.

Before we dive in and start looking at this topic though, you’ll probably find it a lot easier to follow if you are already familiar with a couple of other concepts. First, we assume that you have an understanding of the difference between code that runs in “privileged mode” (i.e., kernel space) and code that runs in “user mode” (i.e., user space). It’s also important that you understand how system calls work and how they switch between user mode and privileged mode. If you need to brush up on these topics, a good introduction can be found in the May 1999 Compile Time, located on the Web at: http://www.linux-mag.com/1999-05/compile_01.html.

There are three basic ways that threads can be implemented on a system: many-to-one, one-to-one, and many-to-many. Each method requires a different amount of support from the operating system and has its own strengths and weaknesses. The names of the methods refer to the number of threads the application ‘sees’ in user space compared to the number of threads (i.e., processes) the operating system sees in kernel space.


In a many-to-one implementation of threads, the threads are completely written in user space. From the kernel’s point of view, there is only one process executing. Although that process is multi-threaded, to the kernel it looks the same as any other non-threaded process. The advantage of this method is that it requires no additional support from the kernel and therefore can be implemented on any system. All the thread management issues (including scheduling) are handled in user space by the application.

Interestingly, the many-to-one implementation’s greatest strength just happens to be its greatest weakness as well. Because the operating system views the multi-threaded application as a single process, that application can never take advantage of additional processors if they exist. The multi-threaded application is guaranteed to run only on a single processor even if additional resources are available.

In fact, in many cases, the multi-threaded application will not even act like a multi-threaded program at all. For example, if one of the threads is waiting for a system call to return, the entire application must also wait! When a system call executes, the calling process is required to block until it completes. Therefore, in a many-to-one implementation, all threads will stop executing when any one of them needs to switch into kernel space. This is intolerable behavior in a threaded application (e.g., imagine if all the windows in your GUI froze up when the application was waiting for user input!) and, therefore, this implementation is not really a solution to the problem of implementing threads.


The one-to-one implementation of threads solves the two problems of the many-to-one implementation listed above. In this implementation, every thread in a multi-threaded application is its own process in the kernel. This implementation is often simpler than the many-to-one implementation because it allows the kernel to perform the scheduling of the threads just as it does for other processes. Also, if the machine on which the multi-threaded application is running has multiple processors, the operating system can schedule each thread to run concurrently on different processors.

Of course, since each thread is its own process, they need to not be blocked by the operating system when another one of them switches into kernel space. Unlike the many-to-one method, this implementation requires some support from the operating system. Recall that threads in a multi-threaded application share the same memory while executing. Therefore, the kernel must provide a means for creating new processes that share memory.

Finally, because each thread is its own process, many of the synchronization techniques (e.g., semaphores and condition variables) carry a large overhead cost due to context switching between user space and kernel space. For example, when waking up a set of threads that are waiting on the condition, a switch between kernel space and user space and back again must occur to wake up each thread. The associated overhead of this operation is two times the number of threads waiting times the cost of a context switch. Not cheap.


The many-to-many implementation of threads is nice because it takes the best ideas from both the one-to-one and many-to-one methods while avoiding many of their disadvantages. In the many-to-many model, many kernel processes execute, and each process represents a scheduler that picks a thread to run. Basically, each process acts as its own many-to-one implementation of threads by scheduling different threads to run within that kernel process.

Having multiple processes coordinate to do this solves the problem of all threads being blocked when one thread makes a system call and switches into kernel space. In the many-to-many model, the other kernel processes can still schedule threads to run. Also, the problems with context-switch overhead are mitigated because each process performs the switches between execution of threads. As a result, this switching happens entirely in user space.

Of course, all this functionality comes with a price — this method is the most complicated to implement of the three and requires kernel support for the relationship between kernel threads and user threads. It is the most common implementation on commercial Unix systems (e.g., Solaris, Digital Unix, and IRIX).

The Linux Way

The Linux implementation of pthreads follows the one-to-one method. The kernel does not support the functionality required for a many-to-many implementation, and since context-switches in Linux are relatively fast, that disadvantage of the one-to-one method is much less of a problem. Linux also provides the mechanism necessary to create a new process that shares a memory space. You are probably familiar with the fork() system call that creates a new process based upon the calling process. Linux provides another system call, __clone(), that is more general than fork().

Where fork() only gives the child process a copy of the state of the parent process, clone() can be used to create a new process that shares or copies resources with an existing process. You can share or copy the memory map, file systems, file descriptors, and signal handlers of the existing process. The fork() system call is essentially a special case of clone() where none of the resources are shared. Let’s take a closer look at the clone() system call. Its prototype, as found in <sched.h>, is listed in Figure Two.

Figure Two: Prototype for clone()

int __clone (int (*fn) (void *arg), void *child_stack, int flags, void *arg)

Notice that, unlike fork(), clone() takes in a function, fn, to be executed. This is the first argument to the call. When clone() successfully completes, fn will be executing simultaneously with the calling function.

The child_stack argument specifies where the child process should start its stack. Since it is possible for the child and the parent to share a memory map, the parent is responsible for allocating space for the child process’s stack. The parent and child can execute different functions after the call to clone(), so they cannot share a single stack.

The flags argument indicates exactly which parts of the parent should be shared with the child. Table One shows a list of the flags that may be bitwise-or’ed together and passed in the call to clone(). Besides the sharing flags, the lowest byte of the flags argument represents the number of the signal to be sent to the parent when the child process dies.

Table One: Flags for clone()

CLONE_VMShare the memory between the parent and child process. If this flag is not set, the memory will be copied as is the case with fork()
CLONE_FSShare the file system information. A call to a function like chdir() will have an effect on both the child and parent process regardless of the caller.
CLONE_FILESShare file descriptors. Any newly created file descriptors after the clone() will be vaild in both child and parent processes.
CLONE_SIGHANDShare signal handlers. All signal handlers set up by either parent or child after the clone() are common to both.
CLONE_PIDShare the process ID. The child and parent process will have the same pid. Should not be used with Linux kernel versions after 2.1.97.

The last argument to clone(), arg, represents the argument that is to be passed to the function, fn, that clone() will execute. The __clone() function returns the process ID of the child process when it succeeds. It will return -1 when it fails. See the man page for more information about the failure cases.

If you’ve been following this column’s series of articles on threads over the past few months, you may be thinking that the functionality of clone() seems awfully familiar… Well, that’s because it is pretty much how the pthread_ create() function behaves, and the similarity is no coincidence. As stated in the man page for clone(), “the main use of __clone() is to implement threads.” The Linux kernel developers felt that the one-to-one method of implementing threads was the way to go, and therefore designed the kernel to enable threads support in that way. The implementors of the pthreads library simply followed their lead.

The Structure of Threads

Now that we understand the clone() system call, let’s return to our discussion about the implementation of threads under Linux. In the following, we will be referring to the implementation of pthreads found in glibc version 2.1.3. All of the files mentioned are in the linuxthreads directory unless otherwise specified.

The basic structure of the implementation of threads is as follows: when the first thread is created (i.e., in the first call to pthread_create()), a manager thread is created to manage the subsequent creation and termination of threads. The manager will perform the actual creation of new threads. Once the manager is created and initialized, the thread that wishes to create a new thread sends a message to the manager via a pipe and then suspends itself until the manager satisfies the request and wakes it up.

The thread manager spaces each of the thread’s stacks 2 MB apart. At the top of each thread’s stack resides a struct containing the relevant information about that thread (e.g., it’s process ID, etc.) for use by the manager and other threads. The definition of that struct is found in internals.h, and the first few fields can be seen in Listing One (there are more than 40 fields in this struct).

Listing One: struct Defining a Thread

struct _pthread_descr_struct {
pthread_descr p_nextlive, p_prevlive;
/* Double chaining of active threads */
pthread_descr p_nextwaiting; /* Next element in the queue holding the threads */
pthread_descr p_nextlock; /* can be on a queue and waiting on a lock */
pthread_t p_tid; /* Thread identifier */
int p_pid; /* PID of Unix process */
int p_priority; /* Thread priority (== 0 if not realtime) */
struct _pthread_fastlock * p_lock; /* Spinlock for synchronized accesses */

/* New elements must be added at the end. */
} __attribute__ ((aligned(32))); /* We need to align the structure so that
doubles are aligned properly. This is 8
bytes on MIPS and 16 bytes on MIPS64.
32 bytes might give better cache
utilization. */

stack graphic
Figure One: A stack after a few threads have been created.

To get a better idea for how threads are laid out in memory, let’s look at a picture of the stack after a few threads are created (see Figure One,). Since stacks grow down on most architectures, we draw our picture assuming that this is the case in our example. Notice that the initial thread lies at the top of the stack. This is the thread that represents the main() function in an application. The thread descriptor for this thread is kept in the space for global variables and is defined in pthread.c. Below the initial thread on the stack is the thread manager’s stack (it’s descriptor is also in global space and defined in pthread.c), followed by the struct for each thread and its stack. The manager keeps track of each thread by keeping track of the top of the thread’s stacks and, therefore, a reference to their respective structs.

Creating New Threads

Now that we have a clear picture of the memory layout, we can look at the code that does the work of creating new threads (see Listing Two). When a program calls pthread_create(), the __pthread_create_ 2_1() function is called in the glibc library — the file called pthread.c. First, the function gets the reference to itself (thread_self() returns a pointer to the struct_ pthread_descr_struct).

Listing Two: The pthread_create() Function in pthread.c

/* Thread creation */

int __pthread_create_2_1(pthread_t *thread, const pthread_attr_t *attr,
void * (*start_routine)(void *), void *arg)
pthread_descr self = thread_self();
struct pthread_request request;
if (__pthread_manager_request < 0) {
if (__pthread_initialize_manager() < 0) return EAGAIN;
request.req_thread = self;
request.req_kind = REQ_CREATE;
request.req_args.create.attr = attr;
request.req_args.create.fn = start_routine;
request.req_args.create.arg = arg;
sigprocmask(SIG_SETMASK, (const sigset_t *) NULL,
__libc_write(__pthread_manager_request,(char*) &request,sizeof(request));
if (THREAD_GETMEM(self, p_retcode) == 0)
*thread = (pthread_t) THREAD_GETMEM(self, p_retval);
return THREAD_GETMEM(self, p_retcode);

Next, the function creates a request to be sent to the thread manager. Notice that the request is of type REQ_ CREATE. The request also contains the function to be run (start_routine), its argument (arg), and the attributes (attr) the new thread should have. After indicating that the current thread should not be disturbed by any signals through the call to sigprocmask(), it writes the create request to the write end of the pipe to communicate that information with the manager. Then the thread suspends itself, waiting for the manager to wake it up once the request has been completed. If the manager successfully created the new thread, the argument passed into pthread_ create() is set, and the return code is returned.

The code for the thread manager is found in the file manager.c. The main function of this file is __pthread_ manager(), which sits in a while() loop, polling the read end of the communication pipe. Whenever it sees that there is information to be read, it reads the request, calls the appropriate handler, and continues to poll. The snippet of code that handles the REQ_CREATE request is shown in Listing Three. Note that the function that does the creation is pthread_ handle_create(), which is also in “manager.c” and is shown in Listing Four.

Listing Three: Code That Handles the REQ_CREATE Request

/* The server thread managing requests for thread creation and termination */

int __pthread_manager(void *arg)
… /* Initialization and signal handling omitted for brevity. */
/* Enter server loop */
while(1) {
n = __poll(&ufd, 1, 2000);
… /* Other checks omitted for brevity. */

/* Read and execute request */
if (n == 1 && (ufd.revents & POLLIN)) {
n = __libc_read(reqfd, (char *)&request, sizeof(request));
ASSERT(n == sizeof(request));
switch(request.req_kind) {
request.req_thread->p_retcode =
pthread_handle_create((pthread_t *) &request.req_thread->p_retval,
/* Other cases */


Listing Four: The pthread_handle_create() Function

static int pthread_handle_create(pthread_t *thread, const pthread_attr_t *attr,
void * (*start_routine)(void *), void *arg,
sigset_t * mask, int father_pid,
int report_events,
td_thr_events_t *event_maskp)
/* Initialization of local variables. */

/* Find a free segment for the thread, and allocate a stack if needed */
for (sseg = 2; ; sseg++)
return EAGAIN;
if (__pthread_handles[sseg].h_descr != NULL)
if (pthread_allocate_stack(attr, thread_segment(sseg), pagesize,
&new_thread, &new_thread_bottom,
&guardaddr, &guardsize) == 0)
/* Allocate new thread identifier */

/* Initialize the thread descriptor. Elements which have to be
initialized to zero already have this value. */

/* Initialize the thread handle */

/* Determine scheduling parameters for the thread */

/* Finish setting up arguments to pthread_start_thread */
new_thread->p_start_args.start_routine = start_routine;
new_thread->p_start_args.arg = arg;
new_thread->p_start_args.mask = *mask;
/* Raise priority of thread manager if needed */

/* Do the cloning. We have to use two different functions depending
on whether we are debugging or not. */
pid = 0; /* Note that the thread never can have PID zero. */

if (pid == 0)
pid = __clone(pthread_start_thread, (void **) new_thread,
__pthread_sig_cancel, new_thread);
/* Check if cloning succeeded */
if (pid == -1) {

/* Insert new thread in doubly linked list of active threads */
new_thread->p_prevlive = __pthread_main_thread;
new_thread->p_nextlive = __pthread_main_thread->p_nextlive;
__pthread_main_thread->p_nextlive->p_prevlive = new_thread;
__pthread_main_thread->p_nextlive = new_thread;
/* Set pid field of the new thread, in case we get there before the
child starts. */
new_thread->p_pid = pid;
/* We’re all set */
*thread = new_thread_id;
return 0;

Much of pthread_handle_ create() has been omitted for space purposes. For parts that are left out, the comments have been kept so you can see what occurs there. The code that remains is the guts of the thread creation mechanism. First, this function allocates a new stack for the new thread through the pthread_allocate_ stack() function. It then sets the necessary fields in the thread descriptor (including the start function and the argument to it). Next, it calls the __clone() function with the flags discussed above to share all attributes with the child. The pthread_start_thread() function that is run from clone() essentially performs some more initialization of the thread and finishes by calling the start_routine() function. And thus, the new thread is born.

There’s Always More…

Hopefully, you have enjoyed this whirlwind tour of the implementation of threads under Linux. If you would like to learn more about the clone() system call and how it relates to fork(), take a look at the book Linux Kernel Internals, by Beck, Hohme, Dziadzka, Kunitz, Magnus, and Verworner. As you might imagine, the implementations of the two functions are very similar to each other.

If you are interested in learning more about the implementation of threads (including the other pthreads functions), the code in glibc/linuxthreads is very well documented and is quite easy to read. Also, that directory contains a README and a FAQ.html that provide even more information about the use of threads. In the meantime, happy hacking!

Benjamin Chelf is a freelance author and engineer at CodeSourcery, LLC. He can be reached at chelf@codesourcery.com.

Comments are closed.