Operating system kernels have been the object of my attention for the last few days. I’m really not sure of the reason behind this new found affection for everything ring-0 - but it’s sure been fun.

However, unlike most folks, my adventures in kernel-mode have all been in Windows kernels - in specific, the WinCE kernel and the Windows NT kernel. For Windows CE, I’m using the source code that ships with Platform Builder (and is also available as a free download under shared source licensing).

For NT, I’ve been using the code from the Windows Research Kernel effort. This surprises people - why use that when I have access to the current Windows source tree? Well, for the simple reason that the WRK is a whole lot easier to build! Besides, since it contains only the stuff required to build the kernel itself, it suits my needs perfectly. Also, this lets me demo any kernel mode work in academia. Showing the live NT code on stage would be a quick route to getting fired :-)

The best part of the Windows Research Kernel offering, more than the source code, is access to the original NT specs, written by Dave Cutler, Lou Perrazoli, Kimura and others. I recently laid my hands on them internally and felt a sense of awe when reading through them. How many people get to write the specifications for such an insanely successful product?

This (and the following blog posts) are an attempt to document my wanderings through the kernel sources. In this post, I’ll walk through the schedulers of the 2 kernels and try to guess at what they do and why they do things differently. For a thorough description of the NT scheduler, see this article from Mark Russinovich

Schedulers

Schedulers, in essence, ‘schedule’ time on your operating system (_Duh!_)). If you want to learn more about the fundamentals behind how these beasts work, I suggest reading a good operating systems book - like the one from Tannenbaum.

Before I dig into the internals of the NT scheduler, let’s define some terms

Thread - Basic unit of execution on NT or CE. Corresponds 1:1 to what you get when you make a CreateThread call. When you create a process, you get a thread for free

Process - Corresponds to one or more threads. More importantly, it usually corresponds to an executable on your machine (e.g iexplore.exe, firefox.exe, etc). Note that I said ‘mostly’. The default process (called ‘System’ by task manager) loosely corresponds to ntoskrnl.exe.

Quantum - A unit of scheduling time for which a thread is given control over the processor. Usually expressed in 3 * clock ticks. Ranges between 10 milliseconds to 120 milliseconds. Why does it vary? We’ll see that below.

Now, there is one fundamental difference between how the Windows* kernels work when compared to most other kernels. Both NT and CE schedule ** threads** rather than processes i.e, the scheduler schedules the processor between threads rather than the processes which own these threads. One side effect of this that a process with more threads will tend to hog more of the CPU - there is no concept of ‘fairness’ between processes, only threads 1

NT

Let’s look at the NT kernel first. Most of the scheduler was written by Dave Cutler himself and until a week ago, he was still the owner of all the scheduling related code.2

Priority

Threads have different priorities. Theoretically, NT has 32 priorities but in practice, only 31 as priority 0 is reserved for the thread that zeroes out memory. Normal threads have priorities between 1 and 15. Priorities above 15 are called ‘realtime’ and are meant for time critical stuff. The realtime threads are also interesting in that they don’t get boosted or some of the other special treatment that normal threads too. Of course, they don’t need this special treatment given that they get so much preference in the processor.

When threads are created, they start off at the priority mapping to the priority class of their process. If you have access to the WRK, you can look at the gory details for yourself in base/ntos/ke/thredobj.c (KeStartThread)

Picking a thread to run

The entire section below comes from _base/ntos/ke/thredsup.c__ and _ _balmgr.c_

Threads are assigned quantums - the unit of time for which they can run on a processor. At the end of the quantum, the kernel needs to decide which thread to run next. This is not the only time the kernel needs to decide - the kernel needs to make similar scheduling decisions when a thread is waiting on something and that something becomes available. Or if the thread or some other thread decide to play around with thread priorities.

The kernel makes this scheduling decision based on a datastructure called the the ‘Dispatcher Ready List’. Shown below, the DRL basically lets the kernel do a weighted-round-robin algorithm. With multiple processors, there is one DRL per processor. The kernel stores the list heads of the individual thread queues in in the processor control block (PRCB).

The DRL is basically a 32 element array whose elements are the heads of a set of linked lists. These linked lists contain the list of ready threads at that particular priority. Below, you see 2 threads ready to execute at priority 1.

Update: I updated this blogpost to reflect that the array contains the list heads themselves rather than pointers to the list heads. Therefore, the diagram below is incorrect (the priority 1 box should be a node too and the circular link list should cycle back to that node. Will update the image sometime

Dispatcher Ready List

If you want to write some scheduler code and iterate through the DRLs, you would want to do something like this.

PRLIST_ENTRY ListHead;  
PRLIST_ENTRY NextEntry;  
  
for(int Priority=1; Priority<32;Priority++)  
{  
  ListHead = &Prcb->DispatcherReadyListHead[Priority];  
  NextEntry = ListHead->Flink;  
  
 do  
 {  
  
 //  
 // Do something interesting with the thread.Implement your new super-cool scheduling algorithm  
 //  
  
 NextEntry = NextEntry->Flink;  
  
 }while(NextEntry!=ListHead);  
  
  }  

NT picks the highest priority thread which is ready to run. Sounds simple and efficient doesn’t it? Well, it is - except that this doesn’t work out so well in some cases.

* In this mode, lower priority threads will never get a chance to execute (starvation) unless the higher priority threads finish executing. * An application running in the background receives as much of the processor as an application running in the foreground. You could have a game running full screen but it would receive the same importance as a background downloader. Not a good state of affairs!

Boosting

To make sure your operating system acts all snappy in such situations, NT uses a concept of ‘boosting’ and ‘decay’ to vary thread priorities at runtime. Note that thread priorities can only be varied if it is in the dynamic range (<=15). If your thread is above 15, then it’s like a kid at a party who has already got his share of the birthday cake but wants more - it doesn’t get any more special treatment.

NT boosts threads when an event happens. An event could be a mouse input or keyboard input. A few versions back, NT used to boost the threads of the foreground applications. However, that has now been switched to increasing the quantum of the foreground threads. They won’t run at higher priority - they’ll just run longer without interruption (possibly).

There’s still a problem here. What if you’re part of a background thread merrily doing some work and not getting any events? Do you wither away in a corner of the processor due to lack of scheduler time? This is where the balance set manager comes into the picture. One among its tasks is to come to the rescue of all such neglected threads.

The Balance Set Manager (see balmgr.c) is a kernel component that ensures fairness in the realm of the kernel. Once every second, it looks through the dispatcher ready lists for any thread that hasn’t run in around 4 seconds or so.3 . When it finds one such thread, it assigns it the highest dynamic priority available throughout the realm and puts it at the front of the ready list for that priority (priority 15). This pretty much ensures that the execution of that thread next in the absence of real-time threads. The balance set manager also helps to avoid priority inversion this way.

In a multiprocessor scenario, the decision-making process is slightly more complicated.

Win CE

The Windows CE kernel is an entirely different beast altogether. It has a completely different set of requirements when compared to the NT kernel. NT runs on desktops and serves with oodles of processor and RAM. WinCE runs on tiny devices with very little RAM, slow processors and needs to be ‘more’ real-time than the NT kernel when needed.

The Windows CE code is also available under a Shared Source license for academia. If you have Platform Builder installed, you already have the source on your machine. All the interesting stuff for this blogpost is under %_WINCEROOT%\private\winceos\coreos\nk\kernel\schedule.c .

The CE scheduler is similar to the NT scheduler in that it schedules threads instead of processes. However, there are a lot of differences, some of them quite subtle

* CE threads have 256 possible priorities. You can access this bigger set using CeSetThreadPriority and CeGetThreadPriority. The scheduling among the processes is done using the same weighted round-robin like mechanism that NT uses.See KcNextThread in schedule.c * Lower numbers mean higher priority i.e a thread at priority 0 is higher-priority than one at priority 1. * The thread priority has no relation to the process priority class. Unlike NT where a thread in a process can only be at a delta of 2 from the process’ priority class, you could have a process in CE which has one thread at priority 1 and one at priority 240. I’m not sure of the rationale behind this design - I need to ask someone on the CE team about this

However, unlike the NT kernel, it doesn’t do the same amount of boosting. It does boosting to some extent to avoid priority inversion, but it doesn’t have a balance set manager equivalent.

The reasoning is obvious - it makes sense to keep the CE code smaller given the more restricted hardware it runs on. Especially given the limited cache on these machines, you don’t want your scheduler code to cause cache misses!

But given that modern Windows Mobile operating systems are moving closer to the desktop model and with mobile processors becoming faster and with more RAM, it might make sense to bring more desktop scheduling concepts down on to Windows CE. This might be an interesting research project for the operating systems enthusiasts out there :-)

Notes:

  1. So what does process priority mean in NT (the column you see in task manager, the flags to CreateProcess , etc)? Process priority determines the priority class of the process. To over-simplify, it determines what priority the threads within the process start off with.

  2. Now that he is moving over to the Live effort, I’m not sure whether he’ll continue to own the code.

  3. This ‘4’ second limit is hard coded into the kernel. I’m not sure of the logic behind this number. My bet is that the number was arrived at after some empirical analysis


#
Follow @sriramk