On a recent mailing list discussion, Veer Wangoo asked: “DOT NET CLR only uses thread-based scheduling and does not support Fiber-mode scheduling; however, SQL Server can use Fiber-mode scheduling as well. How does SQL 2005 Handle this Limitation?” Answering this question led to a lot of deep-dive and I finally managed to answer this in about 2500 words over the weekend. Srini thought it was an interesting discussion and that I should share it with a larger group. So I am posting it here (after some small changes). Enjoy!
AFAIK, CLR 2.0 was supposed to have support for Fiber mode, but they did not make it to ship because of time-constraints and Fiber-mode support was dropped from the CLR. So SQLCLR also does not have Fiber mode support. Details on this can be found at http://blogs.msdn.com/dinoviehland/archive/2005/09/15/469642.aspx.
However, this question does lead to some interesting discussions around scheduling, fiber-mode and CLR hosting. Thinking about why they exist and how they work gives some fascinating insights into the working of SQL Server. But before I launch into this discussion, a caveat: Whatever I say here is my personal view and most of this is construction from what I have read / heard over the years. This may not be the real picture. Take this as a hypothesis.
The first question is, why does SQL Server have its own scheduler and that too non-preemptive?
Let’s think about life as SQL Server. As SQL Server, I get several requests to process. How do process these requests. One model could be that I create one thread per request. Another model could be using a thread pool. It obviously does not make too much sense to go the one thread / request model – that can quickly get out of control. A thread-pool sounds like a good idea. But then, how big a thread-pool should I have? What’s an optimum number? Remember that the default stack size on 32-bit Windows is 1 MB, and address space is 2 GB. Assuming you do nothing else, that means around 2000 threads. With the 3 GB support, it can go up to 3000 threads. If I reduce the stack size, I can get more threads. (To find out how to do this for SQL, see http://support.microsoft.com/kb/q160683/) For a stack size of 256 KB, it would be mean 12000 threads with a 3 GB switch. So now if you are the guy writing SQL Server thread management, you think: Ok, so on the 32-bit platform, I will maybe hit a max limit of 12k threads, and a max of 32 CPUs. What is the ideal number of threads to work with and which CPUs do I schedule them on?
To answer this question, let’s see how the OS handles threads. To keep things simple, we can imagine that a thread can be in three states: Running, Waiting or Ready. If we don’t take hyper-threading and multi-core into account, on a single CPU, there is at any given time a single thread in the “Running” state. (This does not mean that the thread is necessarily doing some processing. It may simply be waiting for user-input, but it is still occupying the CPU). Per CPU, there is a “Waiting” queue and a “Ready” queue (Actually there are 32 “Ready” queues. More on that in a minute). A thread enters the “Waiting” queue when it voluntarily gives up CPU time (waiting for a event to signal, or going to sleep, etc.). A thread enters the “Ready” queue (at the tail) when its time-slice, its quantum gets over (I think one quantum = 2 clock ticks on XP and 12 on Server by default). Now the scheduler never preempts a thread of higher priority by a thread of a lower priority. Threads of same priority are scheduled using a FIFO “Ready” queue. So this is round-robin. Since there are 32 levels of thread priority, there are 32 Ready queues / CPU.
From the perspective of a server, typically most requests would be served on the same thread priority level. This means that there will be round-robin preemption. So the scene is as follows: as requests arrive, the server creates a number of threads. One of threads is executing, doing useful work when the OS decides to preempt it and put it away, replacing it a by some other thread (assume from the same server). Is this good or bad? From an OS perspective, it is doing the right thing. But from the server’s perspective, since it owns all threads, it would **NOT** like a thread to be preempted by another of its own threads. Why? Because the server knows that the thread is doing useful work. There is no point creating contention on your own threads. You incur the overhead of context-switching. For a small number of threads, this may still be negligible. But when you are looking at hundreds of threads, context-switching can murder perf.
Now why is context-switching such a perf killer? The issue is this: as more and more memory is becoming available to the CPU, it takes more and more instructions to access that memory. One of the ways CPUs address this problem is by creating caches on the CPU. The cache is a smaller-faster memory which stores copies of data most frequently accessed from the primary memory. When the CPU wants to read / write data to the main memory, it first checks if it is already there in the cache or not. If the data is there, you have a cache-hit, otherwise it is a cache-miss. In the case of a hit, the data is immediately written to the cache line. In the case of a miss, most caches copy the data from the main memory to the cache, which takes time. So in the case of a single cache, typically the data on the cache is the data being read / written by the thread executing on the CPU. When this thread is preempted and a new thread run, the cache is invalidated. Obviously, if there are several threads on the CPU, the cache misses will be very high and this would hit perf in a big way.
The story does not end here: modern-day CPUs come with multiple levels of cache. The reason is that there is a trade-off between cache size and speed. So most CPUs have small-fast cache backed up slower-larger caches. The idea is that the CPU checks the smallest-fastest cache (L1) first, and in case of a miss moves on to L2 and so forth. At the end of the day, your memory is a hierarchy going from L1, L2, … main memory. Higher the context switching, the lower in this hierarchy you will have to go.
For a multi-CPU machine, cache invalidation can be even higher since threads may be scheduled on any of the processors. So T1 ran on CPU1 and built up a cache, and then got preempted by T2. Meanwhile CPU2 gets free and T1 gets scheduled on CPU2. You get cache-misses which reduces perf. With NUMA this gets worse – its not just about caches, its also about memory on one node vs. memory on another node. To prevent this, threads can be made to run with affinity. Soft affinity means the OS attempts to run T1 on CPU1, but would schedule T1 on CPU2 if it CPU1 is unavailable. Hard affinity means T1 will only run on CPU1.
Bottom-line: Preemptive scheduling is great for giving every thread a chance to run and does not allow any thread to monopolize the system, but at the cost of context-switching which hits perf badly as the number of threads and CPUs increase.
So how does a server go about maximizing perf? The thing to remember is that you cannot do anything about thread contention with threads spawned by other processes. You need to be optimal about your own threads. In any case, the scale at which these things start mattering, you are typically running a dedicated box and consume 95+% of CPU time. So if you are smart about managing your own threads, your perf would certainly improve.
Now what is the smartest way of avoiding context-switching? Simple, don’t make all threads schedulable. So let’s say you have 1000 threads, and 8 CPUs. You can choose to schedule just 8 threads and keep the rest of 992 in the “Waiting” state. One thread for each CPU. It is bound to run and not get preempted because there is no other thread on that CPU!! This also takes care of the affinity issue. For this, you would abstract the synchronization primitives in the OS by your own primitives which will be used by your threads. So when a thread is beginning to block or waking up from a block, you would be aware of it, and that would help you in releasing or holding back a thread on the CPU’s “Ready” queue.
Now it can happen that the thread may be waiting on something without having entered the “Waiting” queue. So you would typically write your server to release threads as a multiple of the number of CPUs. Too low a multiple can lead to the CPU not being utilized, and too high a number can lead to excessive context switching.
SQL Server takes a similar approach. The UMS is non-preemptive (since all threads are SQL spawned, they are “well-behaved” and include code that prevents them from monopolizing the system). There is one scheduler / CPU. The scheduler’s
job in life is to ensure that there is one unblocked thread executing on that CPU at any given point of time, and all other threads are blocked. Threads are affinitized to one scheduler / CPU combo. Of course it is more complicated than this. If you want to understand this better, the best resource is Ken Henderson’s “Guru’s Guide to SQL Server Architecture and Internals.”
Right, so now we understand why SQL does its own scheduling and what approach it takes. But what has that got to do with Fiber Mode and CLR? To answer this question, we need to understand why a Fiber Mode is there in the first place.
2. The Need for Fiber Mode:
First, what is a fiber? A Fiber is a user-mode, light-weight execution unit with its own stack, saved registers and since Windows XP / 2003, its own Fiber Local Storage. The scheduling of Fibers is done by the app which creates the fibers. Multiple fibers can run on the same thread. This was introduced for Unix programs which used threading libraries, and for servers like SQL which are very sensitive to context-switching.
Coming back to why we need the Fiber mode, we know that a server can ensure that there is only one unblocked thread running on a CPU at a given point of time. The question is, what happens to the other threads? They are in the “Waiting” queue from a OS perspective. What is the impact of a “Waiting” thread on the OS scheduler? Minimal. But if you are looking at thousands of threads over 8 CPUs or more, this overhead does not stay minimal. This is where fiber mode comes in.
When you switch to Fiber mode, you tell SQL Server that you want requests to be dispatched to fibers instead of OS threads. The moment this happens, the number of threads that are under the consideration of the OS reduce dramatically. In the idealized world we created earlier, there was one running thread / CPU. In Fiber mode, this is still the case – there is still one running thread / CPU, but now this thread has several stacks / contexts. When a stack is ready to run (and this is controlled by the user-mode scheduler in the server), the thread switches from the earlier stack to the new stack. The kernel mode scheduler in the OS is of course not aware of this and continues to execute the same thread. So irrespective of the Fiber mode or Thread mode, the UMS is non-preemptive. The difference is whether the OS is aware of other tasks (yes in thread mode, which leads to the over-head), or not (no in Fiber mode which leads to better perf).
So the thing to remember with Fiber Mode is that it can improve performance for a highly loaded server, but that applies only to large multi-CPU boxes.
3. What happens with SQLCLR?
Now what happens when SQL hosts the CLR in Fiber Mode? First let’s understand how CLR hosting works with CLR 2.0. The basic idea is as follows:
1) Load CLR into the process. (This is done by calling CorBindToRuntimeEx, which returns a pointer to ICLRRuntimeHost)
2) Tell CLR what all aspects of code execution you would like the host to control: for example, memory allocation, thread scheduling and synchronization, assembly loading, garbage collection, etc. (This is done by implementing a interface called IHostControl. You pass your implementation of IHostControl to the CLR by calling ICLRRuntimeHost::SetHostControl)
3) For each of the tasks that you want the host to control, implement the corresponding managers. The managers are nothing but a set of interfaces that allow the CLR and the host to work with each other. This needs functionality inside the host which CLR could call to when it needs to do something – these are host-implemented interfaces. Similarly, the host may want CLR to do something, and that functionality is provided by CLR-implemented interfaces. Interfaces which are implemented by CLR are named as ICLRXXXX and the ones implemented by the host are called IHostXXXX.
So, for example, If a host wants to control how CLR works with memory, it needs to:
a) Inform the CLR that it intends to control memory by implementing IHostControl and calling ICLRRuntimeHost::SetHostControl with that implementation. CLR would discover whether you implement memory management when it calls IHostControl::GetHostManager with the IID of IHostMemoryManager.
b) Implement the IHostMemoryManager and IHostMalloc interfaces.
Now when CLR needs to do some memory management tasks, it would call the methods under IHostMemoryManager and IHostMalloc. So for example, if CLR needs to allocate virtual memory, instead of calling the Win32 VirtualAlloc, it would call IHostManager::VirtualAlloc. Or, if the CLR needs to know the memory load on the system, instead of calling the Win32 GetMemoryLoad, it will call IHostManager::GetMemoryLoad. This allows the host to control resources.
Ok, that was your three-step guide to CLR hosting. Back to SQL-CLR in Fiber Mode.
Prior to CLR 2.0, CLR used to assume that the execution unit was always a thread and the scheduling was done by the OS and was preemptive. With CLR 2.0, the design goal was to support fibers too since a lot of hosts including SQL Server use Fibers to process requests. For this, the CLR has to be insulated from the details of the unit of execution (thread / fiber) and its scheduling (preemptive / co-operative). To address this, a abstraction called “Task” is provided, which is formalized in IHostTask – ICLRTask.
IHostTask allows the *CLR to ask the host* to start, abort or awaken a task. It also provides a method to associate a ICLRTask instance with the IHostTask instance. ICLRTask allows the *host to ask the CLR* to abort a task, exit a task, get info about the task, let the CLR know that the task is entering or leaving an operable state, etc. Note that we use Task and it applies both to Threads and Fibers. It is upto the Host whether the task is a a Fiber or a Thread – the CLR does not know. The interfaces which allow a host to work with Tasks are IHostTaskManager and ICLRTaskManager. IHostTaskManager allows the CLR to ask the host to create tasks and provides equivalents of Windows APIs for managing these tasks. In order to support Threads as well as Fibers, a host will typically have two implementations each of the four interfaces described above: One for Fibers and the other for Threads. So you may have CHostThread and CHostFiber for IHostTask.
As we discussed earlier, a host which decides to do its own scheduling would also want to manage synchronization. For this purpose there is IHostSyncManager. This provides equivalents to the Win32 synchronization APIs: Auto-Events, Critical-Sections, Manual Resets, Semaphores, etc. So if the CLR wants to create a critical section and our host implements IHostSyncManager, the CLR would instead call IHostSyncManager::CreateCrst. There is a corresponding ICLRSyncManager also which allows the host to get info about the requested tasks and detect deadlocks.