Optimization Techniques Aim For Multi-Core Intel® Architecture Processors

Intel® Software-Development Products Help Developers Program and Optimize for Multi-Core Intel® Architecture Processors.

By Max Domeika, Technical Consulting Engineer and Lerie Kane, Technical Marketing Engineer

Microprocessor design is shifting away from a predominant focus on pure performance to a balanced approach that optimizes for power as well as performance. To continue this trend, Intel has introduced multi-core processors that are capable of sharing work and executing tasks concurrently on independent execution cores. Such multi-core processors have evolved from multiprocessor systems. They take advantage of the increasing number of transistors that are available to provide multiple execution cores in one package. In many cases, taking advantage of these processor- performance benefits will require that developers thread their applications. This effort can be reduced by understanding and applying software-development tools from Intel.

Although it's easy to say that threading is the best way to take advantage of multi-core processors, realizing threaded code may not be as straightforward. But following the few simple steps outlined in this paper will provide some guidance on developing threaded code. The Figure provides a graphical overview of the optimization process, which is detailed in the following sections.

One common software adage is that 80% of an application's execution time is actually spent in only 20% of the code. Because this code consumes the bulk of the execution time, this is where threading can provide the greatest performance improvements. One can find these key locations using the Intel® VTuneTM Performance Analyzer. This performance- analysis tool utilizes hardware interrupts to give the developer a true picture of how an application is performing. Available on Microsoft Windows* and various flavors of Linux*, it employs two technologies that are useful when analyzing code for threading opportunities: sampling and call graph.

When looking for the hot spots in an application using the VTune analyzer, the best place to start is sampling on clockticks. It will reveal how much time is being spent in each function of the program. Once the most time-consuming functions are identified, it's time to drill down to the source code to see if threading can be effectively implemented. Some functions may be very resource-intensive, but don't lend themselves to being threaded. The reason could be a lack of parallelizable loops. Or perhaps the code is so simple that it cannot be split into multiple threads.

If a developer finds him- or herself faced with a hot spot that cannot be threaded, the next step is the VTune analyzer call-graph technology. The call graph graphically depicts the call tree through an application. Even when a hot spot isn't amenable to threading, this technology may be able to identify a function further up the call chain that can be threaded. Threading a preceding function can improve performance by allowing multiple threads to call the hot function simultaneously.

The implementation of parallelism in a system can take many forms. One commonly used type is shared-memory parallelism, which implies the following:
  • Multiple threads execute concurrently.
  • The threads share the same address space. In contrast, multiple processes can execute in parallel. But each one has a different address space.
  • Threads coordinate their work.
  • Threads are scheduled by the underlying operating system (OS) and require OS support.
The keys to parallelism are generalized as follows:
  • Identify the concurrent work.
  • Divide the work evenly.
  • Create private copies of commonly used resources.
  • Synchronize access to unique shared resources.
Three classifications of parallel technologies are thread libraries, message- passing libraries, and compiler support. Thread libraries, such as POSIX threads and Windows application-programming- interface (API) threads, enable very explicit control of threads. Message-passing libraries like Message Passing Interface (MPI) enable one application to take advantage of several machines that don't necessarily share the same memory space. The third technology is threading support, which is enabled in the Intel® compilers in the form of OpenMP* and automatic parallelization.

Intel® C++ Compiler 9.0 for Linux*
The Intel® C++ Compiler is an optimizing compiler offered on several operating systems, such as Windows and Linux. It also is available on several architectures, such as IA-32 and Intel® Itanium®, and systems with Intel® EM64T. The strongest advantage of the Intel compiler is its optimization technology and performance feature support, which includes OpenMP and automatic parallelization. (For further information on the Intel compiler, please see the product web page.)

OpenMP is an open, portable, shared-memory multiprocessing API. It is supported by multiple vendors on several operating systems for C and C++. OpenMP simplifies parallel application development by hiding many of the details of thread management and thread communication behind a simplified programming interface. Developers specify parallel regions of code by adding pragmas to the source code. These pragmas also communicate other information, such as the properties of variables and simple synchronization. For example, look at the following pragma:

It specifies that the for loop that follows in the code should be executed by a team of threads. Additionally, the temporary partial results represented by the sum variable should be aggregated at the end of the parallel region by addition. Finally, the variable x is private. In other words, each thread gets its own private copy.

Automatic Parallelization, which also is called autoparallelization, analyzes loops and creates threaded code for the loops that are determined to be beneficial to parallelize. Automatic parallelization is a good first technique to try in parallelizing code, as the effort to do so is fairly low. The compiler will only parallelize loops that are determined to be safe to parallelize.

Parallel-Code Correctness and Performance
Once threading has been added to an application, the developer is potentially faced with a new set of programming bugs. Many of these bugs are difficult to detect. To ensure a correctly running program, extra time and care are required. A few of the more common threading issues include:
  • Data race
  • Synchronization
  • Thread stall
  • Deadlock
  • False sharing
A data race occurs when two or more threads are trying to access the same resource at the same time. The result can be inconsistent results in the running program. In a read/write data race, for example, one thread is attempting to write to a variable while another thread is trying to read the variable. The thread that's reading the variable will get a different result depending on whether or not the write has already occurred. A data race is non-deterministic. A program could run correctly 100 times in a row, but fail the next time.

A data race can be corrected with synchronization. One way to synchronize access to a common resource is through a critical section. If a critical section is placed around a block of code, the threads will be alerted that only one may enter that block of code at a time. Although synchronization is a necessary and useful technique, care should be taken to limit unnecessary synchronization. After all, it will affect the performance of the application. Because only one thread is allowed to access a critical section at a time, any other threads that need to access that section are forced to wait. The result is a potential impact on performance.

A lock is another method of ensuring that shared resources are correctly accessed. In this case, a thread will lock a specific resource while it's in use. This "locking" also denies access to other threads. Two common threading errors can occur when using locks, however. The first is a thread stall. Here, one thread locks a certain resource and then moves on to other work in the program without first releasing that lock. When a second thread tries to access that resource, it is forced to wait for an indefinite amount of time. A developer should make sure that threads release their locks before continuing.

The second threading error is a deadlock. Although it is similar to a stall, a deadlock occurs when using a locking hierarchy. Say, for example, Thread 1 locks variable A and then wants to lock variable B. Meanwhile, Thread 2 is simultaneously locking variable B and then trying to lock variable A. The threads are going to deadlock because both of them are trying to access a variable that the other has locked. In general, complex locking hierarchies should be avoided if possible. In addition, the developer should make sure that locks are acquired and released in the same order.

The final issue is false sharing, which isn't necessarily an error in the program. Yet it is something that's likely to affect performance. It occurs when two threads are manipulating data that occupies the same cache line. On an SMP system, cache coherency is maintained. In other words, modifications to shared cache must be flagged to the memory system so that each processor is aware of the modification. When one thread has changed data on that line of cache, it causes the cache to become invalidated. The second thread will have to wait while the cache is reloaded from memory. If this happens repeatedly, it will severely affect performance. VTune analyzer sampling can be used on L2 cache misses to detect false sharing. If this event occurs frequently in a threaded program, it's likely that false sharing is at fault.

Intel® Thread Checker
Although debugging threaded programs may seem to be a large burden, Intel® Thread Checker vows to ease the effort. This tool is available as a plug-in to the VTune analyzer. It detects threading errors while a program is running. It then displays the errors and correlates them to the offending lines of source code. An error doesn't have to occur in order for it to be detected. For example, the data races mentioned previously are non-deterministic. This aspect makes them very difficult to detect. Intel Thread Checker will pinpoint where a data race can possibly occur even if the code happened to execute correctly while the tool was examining it.

The key to effectively using Intel Thread Checker is ensuring good code coverage when running the program. The Thread Checker cannot detect an error if the region of code containing the error is never executed. It is therefore important to make sure that all of the functions in the program are exercised. (For more information on using Intel Thread Checker, please visit the product web page.)

Intel® Thread Profiler
Once correctness issues are solved, performance tuning can occur. Intel® Thread Profiler leverages the instrumentation technology of the VTune Performance Analyzer to aid in the tuning of applications that are threaded using OpenMP, Windows API, or POSIX threads. With this tool, developers can visually inspect the performance of their threads to answer questions like:
  • Is the work evenly distributed between threads?
  • How much of the program is running in parallel?
  • How does performance increase as the number of processors employed increases?
  • What is the impact of synchronization between threads on execution time?
The answer to these questions can help the developer to further optimize his or her application. For example, say that the developer has determined that the workload wasn't balanced evenly between threads. He or she could implement code changes and iteratively test the application until a balance is confirmed. If synchronization time was observed to be excessive, the code could be analyzed to see how to simplify or safely remove some of the synchronization. Although such techniques are outside of the scope of this article, the main point is that Intel Thread Profiler allows the designer to monitor the effects of the optimization while tuning. (Please consult the getting-started guide for further details on Intel Thread Profiler.)

To deliver performance headroom, the semiconductor industry is moving to multi-core processors. In order to take advantage of these performance gains, it is recommended to increase the parallelism of application software. The best way to extract the full potential from a multi-core processor is through threading. The software tools created by Intel promise to ease this transition. They ensure that applications will be optimally tuned for the hardware that powers them.

Max Domeika is a Senior Staff Software Engineer in the Software Products Division at Intel. He creates software tools targeting the Intel Architecture market. Domeika earned a B.S. in Computer Science from the University of Puget Sound and an M.S. in Computer Science from Clemson University.

Lerie Kane is a Technical Marketing Engineer at Intel specializing in software tools for the Embedded Intel Architecture market. Kane earned a B.S. in Computer Science from Portland State University and a Masters in Business from Arizona State University.