Multi-Coprocessing Can Complement Multi-Processing Platforms

Multi-coprocessor architectures provides a way for microprocessor experts and novices to predictably develop derivative designs.

By Richard Taylor

Many software-development considerations affect multiprocessor system design. An alternative approach—namely application-optimized programmable coprocessing—promises to meet or beat the performance of multiprocessor architectures. At the same time, it maintains a standard, single-processor software-development methodology. Application-optimized programmable coprocessing enhances system performance by adding both instruction- level parallelism and parallel hardware resources that speed the execution of compute-intensive software. An application-optimized coprocessor is easily synthesizable. It requires no microprocessor design expertise. In addition, it obviates the software redevelopment and hardware implementation effort necessitated by the deployment of additional microprocessors. The coprocessor operates under the control of the central processing unit (CPU) and the CPU’s real-time operating system (RTOS). Moreover, the programmable coprocessor’s performance often eliminates the need to laboriously convert some compute-intensive software routines to fixed-function hardware implemented in either an on-board FPGA or functional blocks.

Multiprocessor Software
The efficacy of traditional multiprocessor architectures is illustrated by data recently published by a leading Japanese semiconductor manufacturer. A software routine executed 1.95 to 2.83 times faster on four microprocessors than on one--a variation of 45%. Naturally, the variation depended on the efficiency of the software partitioning and the communications overhead.

Such variability makes it difficult to predict the required processing resources. It also introduces the risk that additional resources must be deployed after fundamental architectural decisions have been implemented. Moreover, the reuse of legacy software doesn’t reduce this risk if it’s being migrated from single-processor to multiprocessor operation. After all, software redevelopment for multi-threaded execution introduces new uncertainties. In any case, the addition of new software generally necessitates re-partitioning of the entire software suite to achieve the requisite throughput, thus introducing further unpredictability.

In the kind of symmetric-multiprocessor (SMP) architectures utilized in system-on-a-chip (SoC) designs, software partitioning is further complicated by the microprocessors’ shared use of a single cache memory to minimize power consumption. This architecture exponentially increases the potential for cache-access contentions and misses with adverse effects on chip functionality and performance.

Of course, even if the design team conquers the foregoing problems, it still needs an RTOS that supports multi-threading and multi-tasking. But most of the RTOSs available today are designed for single-processor operation. Using an asynchronous multiprocessing architecture could potentially solve this problem. Software partitioning across irregularly synchronized processors is even more difficult than in SMP, however, resulting in even greater unpredictability.

Multiprocessor architectures introduce a complexity and an unpredictability that make it difficult to develop software that reliably meets function and performance specifications. This software-development challenge also makes the multiprocessor architecture an unsuitable platform for derivative design. After all, the benefit of derivative design is that incremental changes can be made to a robust, well-proven platform. But a platform that uses additional processors on each generation of design isn’t well proven. Plus, constant software porting and repartitioning with difficult-to-predict outcomes is not an incremental change.

Multiprocessor Hardware
Now, many compute-intensive algorithms must be accelerated by a factor considerably greater than 1.95 to 2.83. Modern algorithms, such as those used for video, audio, error correction, and encryption, require a 5X to 15X boost over single-processor operation. This would require the deployment of between 8 and 22 microprocessors--optimistically assuming a linear scalability that wasn’t achieved with the original four processors. It also presents significant power-consumption and communications challenges and is economically unviable in many systems. Consequently, it may be necessary to implement some of the algorithm in fixed-function hardware, such as an on-board FPGA or on-chip functional block. Of course, the flexibility of a software solution is therewith lost.

The traditional multiprocessing approach relies upon adding general-purpose (GP) microprocessors to the design. Clearly, it does not deliver the desired results, is not scalable, and introduces significant unpredictability and risk. Why couldn’t the four processors deliver a greater performance boost than a mere 2.83? GP microprocessors are essentially serial engines that lack the instruction-set parallelism and parallel- processing resources necessary to accelerate compute-intensive software. Plus, the performance gains are partially lost in communications overhead.

Increasing the clock rate may partially compensate for the architectural deficiencies of the GP processor. However, it’s not an easily extensible approach--especially in SoC, where power consumption is critical. In any case, an appreciable portion of any local performance improvement can be lost in memory latency and system traffic jams, such as those between processors and cache memory. Add the effects of interrupts, and real-time system behavior may become impossible to develop and predict within acceptable tolerances. Dual-core approaches simply put all of the problems described above into one package instead of two. So the solution to the acceleration problem must be to enhance the system with additional instruction-level parallelism and parallel- processing resources while optimizing the communications overhead.

Coprocessor Synthesis
A more effective approach is to add one or more application-optimized parallel- processing extensions to the CPU that operate under its control. They also should execute the software developed for that processor without redevelopment. Devoid of the kind of control circuitry required in a GP processor, optimal parallel resources can be synthesized without the need for microprocessor design expertise.

The Cascade Coprocessor Synthesis technology from CriticalBlue, for example, creates a loosely coupled programmable coprocessor. That coprocessor accelerates the execution of compiled binary-executable software code offloaded from the CPU. The coprocessor thus requires no compiler. It supports the continued use of the established CPU and its associated investment in design tools and infrastructure. Moreover, the coprocessor’s direct memory access (DMA) enables it to execute algorithms autonomously. It also can store results with minimal communication with the CPU. A single coprocessor may be optimized to process multiple algorithms provided that they aren’t required to execute simultaneously.

Essentially, the user specifies performance requirements and resource constraints. The technology synthesizes a coprocessor that maximizes the execution speed of legacy software “as is”--a true software re-use methodology. Or the user co-optimizes the coprocessor architecture and software to maximize overall performance. The basic design flow is illustrated in Figure 1.

The design team first identifies offload candidates by profiling the software running on the CPU using its usual profiling tools. The technology then analyzes the offloaded code to identify inherent parallelism. It synthesizes the coprocessor architecture--including cache-memory architecture and size--and estimates performance. This procedure-- together with automatically generated instruction and bit-accurate C functional models--enables an extensive design-space exploration to determine the optimum architecture. The technology then synthesizes the coprocessor and system-bus-interface RTL, generates the RTL testbench, and automatically modifies the offloaded microcode to manage coprocessor communication. The solution maximizes system performance by automatically optimizing the cache design with data pre-fetch capability to minimize memory/system latency. It also automatically minimizes bus communication overhead.


Customer Results
In one project, a coprocessor was generated to accelerate the SHA-1 Secure Hash Algorithm. The algorithm generates a 160-bit digest from a message with a maximum size of 264 bits. It is designed to make unauthorized tampering computationally expensive. The algorithm consisted of eight functions implemented in about 120 lines of code that couldn’t be modified. The coprocessor met the customer’s requirement of 5X acceleration. The design was completed in only two engineer-days.

A second project demonstrated that the incremental area impact of a multi-function coprocessor is very low. A coprocessor was developed to execute two ADSL security algorithms. The gate count was 43% less than that of two coprocessors developed to execute the algorithms individually--with no loss of acceleration.


Multi-Coprocessor Solution
A multi-coprocessor platform solves the most intractable problems that are inherent in multiprocessing (see Figure 2). Each coprocessor executes software developed for single-processor use without modification. The coprocessors’ optimized parallelism, data access, and communications overhead--together with their autonomous operation--simplify both software partitioning and hardware synchronization. New software can be ported by adding a new coprocessor or re-programming an existing one, thereby eliminating the need to repartition the entire software. It meets or beats the performance of multiprocessor architectures while maintaining a standard.



Richard Taylor is CTO of CriticalBlue. He can be reached at richard.taylor@criticalblue.com.