Leverage Multi-Core Processors for High-Performance Embedded Systems
To choose the best multiprocessing platform, designers must focus on the problems that they're trying to solve while using an OS that adequately supports every model.
By Paul N. Leroux and Roger Craig, QNX SoftwareMuch ink has been spilled over how multi-core processors will usher in a new era of performance for desktops, laptops, corporate servers, and home media centers. If anything, however, the benefits for embedded systems are even greater. Network elements, medical test systems, industrial-control applications, and even in-car infotainment units are all growing in complexity. They have a voracious
appetite for computational power. Nevertheless, many of these systems also must satisfy rigorous requirements for low weight, low power consumption, low heat dissipation, or any combination thereof. Multi-core processors, such as the Intel Core 2 Duo processor, address these requirements head on. Compared to their uniprocessor counterparts, they provide much greater processing capacity per ounce, watt, and square inch.
In effect, multi-core processors are multiprocessing systems-on-a-chip. Unfortunately, few embedded developers
have experience in multiprocessing systems. Plus, the vast majority of legacy code in embedded devices was designed for uniprocessor--not multiprocessor--environments. Consequently, embedded developers face a migration challenge. They must graduate from a serial execution
model, in which software tasks take turns running on a single processor, to a concurrent execution model. In the concurrent
model, multiple software tasks run simultaneously. The more concurrency
that developers can achieve, the better their multi-core systems will perform.
To begin, developers must select the appropriate form of multiprocessing for their application requirements. More than anything else, this choice will determine how easily both new and existing code can achieve maximum concurrency. As Table 1 illustrates, developers have three basic forms to choose from: asymmetric, symmetric, and bound multiprocessing.

AMP: A Familiar Environment
Asymmetric multiprocessing (AMP) provides an execution environment similar to that of conventional uniprocessor systems. Consequently, it offers a relatively straightforward path for porting legacy code. It also allows developers to directly control how each CPU core is used. Typically, it also works with standard debugging tools and techniques.
AMP can be homogeneous, where each core runs the same type and version of operating system (OS). Or it can be heterogeneous, which means that each core runs either a different OS or a different version of the same OS. In a homogeneous environment, developers can make best use of the multiple cores by choosing an OS, such as the QNX Neutrino real-time operating system (RTOS), that supports a distributed programming model. If it is properly implemented, the model will allow applications running on one core to communicate transparently with applications and services (device drivers, protocol stacks, etc.) on other cores. Yet it will eliminate the high CPU utilization imposed by traditional forms of interprocessor communication.
A heterogeneous environment has somewhat different requirements. In this case, the developer must either implement a proprietary communications scheme or choose two OSs that share common protocols (likely IP-based) for interprocessor communications. To help avoid resource conflicts, the two OSs should provide standardized mechanisms for accessing shared hardware components.
AMP is useful for many applications--especially legacy code. Yet it can result in the underutilization of processor cores. In most cases, for example, applications running on that core cannot migrate to a core that has more CPU cycles available if one core becomes busy. While such dynamic migration is possible, it typically involves complex checkpointing of the application's state. In addition, it can result in a service interruption while the application is stopped on one core and restarted on another. This migration becomes even more difficult--if not impossible--if the cores use different OSs.
In AMP, neither OS "owns" the whole system. Consequently, the application designer-- not the OS--must handle the complex task of managing shared hardware resources. Such resources include physical memory, peripheral usage, and interrupt handling. Resource contention can crop up during system initialization or normal operations as well as on interrupts and when errors occur. The application designer must design the system to accommodate all of these scenarios. The complexity of this task increases significantly as more cores are added, making AMP ill suited for processors that integrate more than two cores.
SMP: Transparent Resource Management
Allocating resources in a multi-core design can be difficult--especially when multiple software components have no knowledge of how other components use those resources. Symmetric multiprocessing (SMP) addresses this issue by running only one copy of an OS on all of the chip's cores. Because the OS has insight into all system elements at all times, it can:
- transparently allocate shared resources on the multiple cores with little or no input from the application designer
- dynamically schedule any thread or application to run on any available processor core, thereby allowing every core to be utilized as fully as possible
- provide dynamic memory allocation, allowing all cores to draw on the full pool of available memory without a performance penalty

Because a single OS controls every core, all intercore IPC is considered local. This approach can improve performance dramatically, as the system no longer needs a networking protocol to implement communications between applications running on different cores. Communications and synchronization can take the forms of simple POSIX primitives, such as semaphores, or a native local transport capability like QNX distributed processing. Both forms offer higher performance than networking protocols.
In Figure 1, the control plane for a network element is running in SMP mode. The OS can dynamically schedule any process, such as the command-line interface (CLI), on any core. Nonetheless, a well-designed SMP OS will always try to dispatch a thread to the core where the thread last ran. That way, the core can often fetch the thread's instructions directly from the L1 cache, rather than having to reload them from the L2 cache or main memory.
As an added benefit, SMP allows system-tracing tools to gather operating statistics for the multi-core chip as a whole. This aspect gives developers valuable insight into how to optimize and debug applications. For instance, the system profiler in the QNX Momentics development suite can track thread migration from one core to another as well as OS primitive usage, scheduling events, core-to-core messaging, thread migration, and other information--all with high-resolution time stamping. In AMP, developers have to gather this information separately from each core and then somehow combine it for analysis. Figure 2 shows the system profiler being used to analyze a quad-core SMP system.
Though it offers many advantages, SMP isn't a panacea. In particular, legacy applications with poor synchronization among threads may work incorrectly in the truly concurrent environment provided by SMP. This issue may not present a problem with software developed in house. Yet it can create difficulties when a system must support software from multiple third-party suppliers.

BMP: Transparent Management Plus Developer Control
Bound multiprocessing (BMP) is a new approach pioneered by QNX Software Systems. It combines the transparent resource management of SMP with the developer control of AMP. Like SMP, BMP uses a single copy of the OS to maintain an overall view of all system resources. BMP goes beyond SMP, however, by allowing developers to "lock" any application to a specific core. This approach does the following:
- allows legacy applications written for uniprocessor environments to run correctly in a concurrent multi-core environment without modifications
- allows legacy applications to coexist with newer applications that take full advantage of the concurrent processing and dynamic load balancing enabled by multi-core hardware
- eliminates the processor-cache "thrashing" that can sometimes reduce performance in an SMP system by allowing applications that share the same data to run exclusively on the same core
- enables simpler application debugging than traditional SMP by restricting all execution threads within an application to run on a single core
In Figure 2, a medical system is running in BMP mode on a quad-core processor. One core handles data acquisition while another core handles graphics rendering. Yet another core handles the human-machine interface while the fourth core tackles other database and data-processing operations. While these applications are each locked to a specific core, BMP also allows non-locked applications to be dynamically scheduled on whichever core has the most available CPU cycles.

A Matter of Choice
Should a developer choose AMP, SMP, or BMP? The answer depends, of course, on the problem that the developer is trying to solve. It's therefore important that an OS offers robust support for each model, giving developers the flexibility to choose the best form of multiprocessing for the job at hand. Although AMP works well with legacy applications, it has limited scalability beyond two cores. SMP offers transparent resource management, but it may not work with the software designed for uniprocessor systems. BMP offers many of the same benefits as SMP. But it allows uniprocessor applications to behave correctly, simplifying the migration of legacy software. As Table 2 illustrates, an OS that supports all three models enables developers to strike the optimal balance between performance, scalability, and ease of migration.

Dr. Robert Craig is Senior Software Engineer for the OS Kernel Group at QNX Software Systems. Craig holds a Bachelor�s of Science in Computer Science and Physics and a Doctorate in Physics with a focus on optical computing technologies.

Paul Leroux is a Technology Analyst at QNX Software Systems. His interests include high-availability design, multiprocessing systems, and OS-kernel architecture.












