SMP: Two Processors and Beyond

Adding CPUs can be successful if handled with care

By Mario Charest

Sometimes you just need more power than a single processor can provide at a reasonable cost. Symmetric multiprocessing allows you to add a second CPU to your system, or perhaps even a third or a fourth, to increase the available processing power. But for this strategy to work properly, you need to design your software to access the additional power.

Symmetric multiprocessing (SMP) is becoming an extremely cost-effective way of increasing a system’s computing power. Often, the cost of two slower processors is significantly less than a single processor twice as fast. Multiprocessing has the potential for making the two processors function as well as one when running multiple tasks. In the past, SMP hardware hasn’t been well suited to embedded systems because of the extra space and heat-dissipation associated with the second processor. With the appearance of single chips with multiple CPUs on the device, however, this is changing.

There are different ways of handling multiprocessing. A networked system, for instance, can share tasks among the different computers in the network. But if there is a lot of information to be shared, you can end up handling more data than the network can support. In the SMP system, each CPU has access to the same memory, I/O ports, and other hardware, so sharing information is easy. In general, SMP won’t make a single process run faster, however, unless it was written specifically for SMP.

The first thing you need for SMP, then, is an operating system that can support the technology. Fortunately, there are a good number that are up to the task, including Windows NT, Windows 2000, Linux, VxWorks, and QNX Neutrino. Other operating systems may not support SMP, however. Win98 is not SMP-capable, for instance; even if you have 16 CPUs in a Win98 machine, only the first CPU will be used.

The operating system alone isn’t the whole answer, however. There are hardware and software design considerations that can affect the performance gains achieved by adding processors. One prime concern is memory bottleneck. Two CPUs in an SMP system cannot access memory at the same time. They either have to take turns with access, or use memory caching to reduce access bottlenecks.

Memory bottlenecks

The program in Figure 1 is a very crude example that demonstrates the memory-bottleneck effect. The program fills 50 Mbytes of RAM with zeros 50 times. When this program is run once on a dual-500 MHz Celeron processor system, execution time is 14.3 seconds with a memory throughput of 175 Mbytes/sec. Running the program twice, once on each of the two processors simultaneously, the respective execution times area now 24 seconds with a throughput of 100 Mbytes/sec. In that case, the overall throughput is 200 Mbytes/sec - a negligible 12.5 percent increase.

Figure 1: This simple routine overfills the CPU cache to help illustrate the memory bottleneck in multiprocessing systems.

It’s clear that SMP is of little benefit here. This is a worst-case scenario with a data block that won’t fit into the Celeron processor’s cache, resulting in few cache hits and forcing the two processors to share access to the main memory.

If we go to a best-case scenario, however, say by setting the fill size to 50,000, the data block should now fit in the Celeron’s cache and the system will not be overly affected by the memory bottleneck. To force the program to run for the same duration as in our first test, we have to increase the loop count to 250,000. Running a single instance of the program now takes 10.6 seconds at 1180 Mbytes/sec. When run simultaneously, two programs take 10.6 sec at 1180 Mbytes/sec, for a total throughput of 2360 Mbytes/sec. In this case, because each CPU needed very little main memory access, the two programs running side by side had no affect on one another and performance doubled.

Of course these are extreme examples that rarely happen in real life. They do, however, provide some insight into the basic nature of SMP and how it can help you with your design. They also demonstrate that it’s next to impossible to estimate the performance gain that will be obtained by moving to SMP. The gain is always software dependent.

One of the prime factors in determining the gain is how much of the software can fit into the CPU cache. The cache’s job is to hold the data most recently accessed, or most likely to be needed. When it comes to SMP, the kernel has to make intelligent use of that cache to reduce the chances of a memory bottleneck without working against other multiprocessing activities.

For example, one thing the kernel can do is to move a thread based on CPU availability in an attempt to keep every CPU busy. However, thread migration can cause cache problems. For example, if a thread’s data is in the first CPU cache when the thread moves to a different CPU, the data in the first cache has to be invalidated and flushed from memory. Subsequently, the second CPU doesn’t have any data related to the thread in its cache, so it will have to get the data from memory.

Usually, system designers don’t really have to deal with cache details-those intricacies are left up to the kernel and compiler. Kernels such as the Neutrino SMP kernel, for instance, use heuristics to find the right balance between CPU utilization and thread migration cache problems. Still, there are things designers can do during software development that can help avoid problems.

Simple tricks

Let’s go back to the example. The array is 256,000 bytes; it won’t fit in a Celeron’s cache. When the memset command executes, only a portion of the array can sit in the cache. When memset is setting the last byte of the array (array[255999]) the byte located at array[0] won’t be in the cache. The FOR loop then starts and its first step is to access array[0], but array[0] is not in the cache anymore. A cache miss occurs and time is lost.

The same principal applies for smaller blocks of data that may not fit in the cache’s available space at any given time. In an OS like Neutrino, you have other things happening like interrupt handlers, threads, and so on. A 16K block of data may have been half flushed to make room for an interrupt handler. Even if the data should fit in the cache, it may not always be all there.

One way to avoid this problem is to write and use a memset_r function that fills data backward. This ensures that the first array access always gets a cache hit. It’s a simple fix. To implement it in existing code just replace the existing memset() routines with the new memset_r().

Here’s another cache tip-try to process data in small packets. With multiprocessing, it’s typically faster to process smaller blocks of data many times than to process one huge block because of the reduced chances of a cache miss. It’s worth mentioning that in an SMP system the two processors together have twice the amount of cache as the single, faster processor. More cache is always better.

One hidden challenge in multiprocessing lies in the scheduling of tasks. Multi-tasking is often conceptualized as running multiple programs at the same time, but that’s not actually the case. The CPU is simply shared among different programs. On SMP machines, however, threads really do run at the same time. This can cause unexpected problems. For instance, a thread at priority_1 can run at the same time as a thread of priority_63 because each will run on a different processor. Thus, you cannot rely on priority to create critical sections in an SMP design - you’ll get bitten.

Another example of an SMP scheduling challenge is the FIFO scheduling algorithm of QNX 6. If a group of programs are running at the same priority in FIFO mode, when one of these programs is processing in the CPU, the other programs of that group will not get CPU time until relinquished by the currently running process. A designer might rely on this attribute of FIFO to implement data protection with no extra overhead. On an SMP machine, however, FIFO is your worst enemy. It simply will not work across multiple CPUs. Two programs of the same priority, running in FIFO mode, stand a very high chance of running on different CPUs and thus running at the same time.

Unexpected behaviors

Of course, there is a way to get around this problem. With Neutrino, it’s possible to force a process to run on a specific CPU. Thus a group of programs running in FIFO mode could be forced to run on a common CPU. Unfortunately, this approach has negative side effects; it can introduce behaviors that you might misinterpret as bugs.

I once spent three days searching for a bug that didn’t exist. I wrote a program to handle data logging. To make sure logging wouldn’t interfere with the simulation, the logger was composed of two threads. The main thread was a resource manager handling the write operation performed by the client program. It stuffed the data into a circular buffer. The other thread, set at a low priority, was set to flush the circular buffer and write the contents onto the hard disk. That way, clients would never block waiting for the logger to write to disk. In fact, they could be forced to wait if the circular buffer got full.

I started to test the program piece by piece; everything looked good and I wasn’t expecting any trouble. It was, after all, a simple program. Then it came time to test the “buffer full” case and to make sure my index math was correct. I started putting printfs here and there to watch over the indexes before writing to disk. This allowed me to test the overhead of the threads, compared to having the client writing directly to the hard disk.

The indexes didn’t look right. I checked and rechecked everything, assuming I had forgotten to create a critical section, but to no avail. After much wasted effort, I recognized the problem. Because the program was on an SMP machine, both the resource manager and the disk-write threads were running at the same time. The circular queue never held more than a few kilobytes of data because the data in the buffer was processed (thus removed) as fast as it could be filled.

Facing reality

Like any other tool, SMP is not the solution to all performance-related problems, but it’s sure worth investigating. Just make sure in the process that the operating system supports SMP and the software tolerates the simultaneous execution of different threads. Handled correctly, SMP can definitely boost system performance at a minimal cost.

Mario Charest is a software consultant working in Quebec, Canada. He has ten years of experience on multiprocessing and QNX4, and has been working with QNX6 since the early days of Neutrino 1.0 beta. He also has extensive experience deploying systems in industrial environments throughout the world.