Parallelism Primer

By Max Domeika - Intel

The typical goal of threading is to improve performance by either reducing latency or improving throughput. Reducing latency is also referred to as reducing turnaround time and means shortening the time period from start to completion of a unit of work. Improving throughput is defined as increasing the number of work items processed per unit of time.


A thread is an OS entity that contains an instruction pointer, stack, and a set of register values. To help in understanding, it is good to compare a thread to a process. An OS process contains the same items as a thread such as an instruction pointer and a stack, but in addition has associated with it a memory region or heap. Logically, a thread fits inside a process in that multiple threads have different instruction pointers and stacks, but share a heap that is associated with a process by the OS.

Threads are a feature of the OS and require the OS to share memory which enables sharing of the heap. This is an important fact because not all embedded OSes support shared memory and as a result, not all embedded OSes support threads. A clarification is that the type of threads discussed here is user level software threads. Hardware threads are a feature of many microprocessors and are quite distinct in the meaning and capability. For example, the term simultaneous multi-threading is a microprocessor feature that enables one processor core to appear and function as multiple processor cores. For this discussion, hardware threads are relevant only in that they may provide the processor cores that the software threads execute upon.

One other clarification is that this discussion focuses on user level threads only. We do not include discussion of kernel level threads or how different OSes map user level threads to kernel threads.


Effectively threading an application requires a plan for assigning the work to multiple threads. Two categories of dividing work are functional decomposition and data decomposition and are summarized as:

1. Functional decomposition – division based upon the type of work.

2. Data decomposition – division based upon the data needing processing.

Functional decomposition is the breaking down of a task into independent steps in your application that can execute concurrently. For example, consider an intrusion detection system that performs the following checks on a packet stream:

  • Check for scanning attacks
  • Check for denial of service attacks
  • Check for penetration attacks

As long as each step above was an independent task, it would be possible to apply functional decomposition and execute each step concurrently. Figure 6.1 shows sample OpenMP code that uses the section directive to express the parallelism.

When this code is executed, the OpenMP run-time system executes the function in each OpenMP section on a different thread and in parallel (as long as the number of processor cores exceeds the number of threads executing).

In practice, attempting to execute multiple threads on the same data at the same time may result in less than ideal performance due to the cost of synchronizing access to the data.

Pipelining is a category of functional decomposition that reduces the synchronization cost while maintaining many of the benefits of concurrent execution. A case study will be presented later that employs pipelining to enable parallelism.

Data decomposition is the breaking down of a task into smaller pieces of work based upon the data that requires processing. For example, consider an image processing application where multiple images need to be converted from one format to another format. Each conversion takes on the order of seconds to complete and the processing of each image is independent of the processing of the other images. This application lends itself quite naturally to data decomposition. Figure 6.2 shows sample OpenMP code using the parallel for directive.

When this code is executed, the OpenMP run-time system will divide the processing of the images between the allocated threads for parallel execution (assuming the number of processor cores is greater than 1). If the processing of each individual image consumed a great deal of time, it may make sense to multithread the processing of the individual image and execute the processing of the subimages by different threads. A case study will be presented later that employs data decomposition in order to multi-thread image rendering.

In general, it is easier to scale applications by employing data decomposition than it is using functional decomposition. In practice you may find a combination of different decompositions works well for your particular application.


Scalability is the degree to which an application benefits from additional processor cores.

As the number of cores the application uses is increased, it would be nice if performance of the application increased as well. There are natural limits to how much of an application can be executed in parallel and this limits the obtained performance benefit of parallelism. Amdahl’s Law is used to compute the limits of obtained performance and is expressed [2]: where Fraction e = the amount of the application that executes in parallel; and Speedup e = how many times faster the parallel portion executes compared to the original. For example, consider an application that executes in parallel 50% of the time. Also, assume the application executes on a system with four processor cores and was threaded in such a way that the performance scales with the number of processor cores. In this example, Fraction e = 0.5 and Speedup e = 4, and therefore Speedup = 1.6.

Efficiency is a measure of how effectively the total number of processor cores is being employed in running the application in parallel; the goal is a 100% measure of efficiency. In the previous example, three processor cores are idle for 4/5 of the execution time, which means 12/20 of the four processor cores ’ time is idle and thus 8/20 of the time the processor cores are active with 40% efficiency.

Consider another example involving the aforementioned image processing problem with the following constraints:

  • Ten seconds of initialization that must run serially
  • One second to process one image (processing of different images can be accomplished in parallel)
  • Ten seconds of post-processing that must run serially

Table 6.1 shows the calculations of scalability and efficiency for a number of different processor cores. One observable trend is that as the number of processor cores increases, the corresponding decrease in execution time is not as significant. In other words, the speedup does not scale linearly with the number of processor cores. For example, the scalability with 32 processor cores is 10.89 and with 300 processors is 15.24, not 108.9 (10 X 10.89). This trend occurs because the serial portion of the application is beginning to dominate the overall execution time. With 16 processor cores, the execution time of the image processing step is 300/16 = 18.75 s. With 300 processor cores, the execution time of the image processing step is 300/300 = 1 s. Furthermore, 299 processor cores are active for only 1 s out of the 21 s of total execution time. Thus, efficiency is 5.1%. The conclusion is: maximize the benefits of parallelism by parallelizing as much of the application as possible.

One other point worth mentioning is that scalability should always be compared against the best achievable time on one processor core. For example, if the use of a new compiler and optimization settings resulted in a decrease in execution time when run on one processor core, the scalability numbers and efficiency percentages should be recalculated.

Max Domeika is a senior staff software engineer in the Software Products division at Intel®, creating software tools targeting the Intel architecture market.

The reader’s discount offer is as follows: Order this book today and you will receive an additional 15% discount. Click here and be sure to type in 92836 when ordering this book. Or call 1-800-545- 2522 and be sure to mention 92836 when ordering this book. Offer expires 7/31/2008.