Optimize Intel® Pentium® M Processor-Based Code

Performance analysis and compiler optimization techniques can extract even greater benefits from the Intel® Pentium® M processor.

By Max Domeika

As computer architectures become increasingly complex, more sophisticated analysis methods and optimization tools are required to harness their full performance. Technologies such as event-based sampling and expert systems are now augmenting traditional methods of performance analysis based upon profile and call graph tools. Understanding the basics of performance analysis, as well as the current state- of-the-art software optimization technologies, enables developers to pinpoint and implement solutions to application performance issues.

One sophisticated processor, the Intel® Pentium® M processor, is growing in embedded application usage due to its high performance and low power utilization. The Intel Pentium M processor features Intel MMX™ and Streaming SIMD Extensions (SSE, SSE2) that enable higher performance through parallel computation. Getting the most out of the processor, however, requires that developers take full advantage of these built-in performance enhancements.

Software optimization technology offered by advanced compilers utilizes the enhancements in Intel Pentium M processors in a fashion conducive to embedded development. Compiler technology provides access to these extensions with low development investment while maintaining backward compatibility and minimal code size, two critical challenges in embedded software development. The key to focusing the optimization process, however, is to perform performance analysis.

Performance analysis is the study of application performance on hardware with the end goal of understanding issues and recommending enhancements. Amdahl’s L aw states that performance improvement is limited by the frequency of execution of the improved region and serves as motivation for the following two key insights of performance analysis:

  • Optimize the most frequently executed regions - the best return on investment for performance enhancement is the optimization of these regions.
  • Know when to stop - calculating the limit on overall performance gain balances tradeoffs between meeting performance goals and effort to optimize.

For example, if an application comprises two phases that execute in the same amount of time, optimization efforts aimed at only one phase will inevitably return less than two times overall application speedup. If greater performance is desired, optimization efforts should also target the second phase.

Finding the most frequently executed regions of application code commonly employs profiling and typically involves some kind of runtime monitoring of the application. Traditional profiles return metrics such as the amount of time spent in individual functions and the number of times each function is called. The Intel Pentium M processor features a set of built-in performance monitoring counters that can generate profiles based upon processor events such as instructions retired, branch mis-predictions, and cache misses. To collect these event-based profiles:

  1. The user specifies events to monitor and an interval at which to collect an event sample during the application execution.
  2. The processor executes the application. The application may be the fully optimized build, but mapping events to individual lines of source code requires a build with debug information.
  3. While the application runs, performance monitoring counters on the Intel Pentium M processor keep a running total of the specified events as they occur.
  4. When the performance monitoring counter reaches a predetermined number, it posts an interrupt that the operating system will service.
  5. The performance monitoring interrupt handler records the specified event and instruction pointer’s location when the interrupt occurred.
  6. After the profiling of the application has ended, the counters can display their data.

Several tools, such as the Intel® VTune™ Performance Analyzer, offer the ability to create event-based profiles. An event-based profile, shown in Table 1, displays clock-tick events correlated to source line for a sample code segment. Event-based profiling offers two significant advantages over traditional profiling techniques. One is that it locates performance events; the tool maps events to specific assembly instructions and line numbers in your application code. The other is that it offers lower overhead. Event-based profiling employs a statistical sampling method that has a minimal impact on application runtime.

Table 1 - Event-based profilers map events to specific lines of source code.

Event-based profiles provide invaluable insight into application performance issues. A profile revealing cache misses occurring in the application, for example, can help developers determine code sections that may be executing slower than possible due to poor memory access patterns that negate the advantage of cache locality. Similarly, it can help improve memory performance by revealing a need for restructuring code or data structures. This sort of insight is only revealed by event-based profiling.

Performance analysis requires a breadth and depth of knowledge on topics such as microprocessor architecture, assembly language programming, and computer science. To ease the developer’s burden, some performance analysis tools that allow the collection of application performance data have been augmented with complex expert systems that provide:

  • Automatic recommendation and collection of basic performance statistics such as instructions retired per cycle.
  • Automatic recommendation of what event-based profiles to collect based upon known performance issues for the Intel Pentium M processor.
  • Recommended solutions to common performance issues observed in the subject application code.
  • An encyclopedia for performance analysis and microprocessor terminology

The combination of event-based profiling and expert systems available in current performance analysis tools allows detailed performance analysis of the Pentium M processor. Tools such as the Intel VTune Performance Analyzer support these features with a detailed knowledge base to facilitate developer understanding.

Once performance analysis is complete, developers need to tune their code. Part of the tuning process is effectively utilizing unique instruction extensions a processor may have.

The Intel Pentium M processor, for instance, offers a set of instruction extensions termed MMX, SSE, and SSE2 that enable capabilities such as data prefetching and parallel execution. For higher level languages like C++, compiler technology provides access to these instructions in several forms, including inline assembly language, a C intrinsic library, a C++ class library, and vectorization technology.

One of these forms, vectorization, is an advanced optimization that analyzes loops and determines when it is safe and effective to execute several iterations of the loop in parallel by utilizing MMX, SSE, and SSE2. Figure 1 illustrates vectorized loop that shows four iterations computed with one SSE2 operation. Using vectorization helps optimize application code to take advantage of these extensions when running on the Pentium M processor.

Figure 1 - Vectorization takes advantage of Intel® Pentium® M processor instruction extensions to convert loops into single instructions operating on a vector of values.

Backwards Compatibility

One challenge with extended instructions in embedded applications is satisfying the desire to support several generations of IA-32 processors with one version of the application while still taking advantage of these instructions. One solution is to manually code a cpuid check into the application that calls different versions of a function based upon the processor of execution. A compiler solution, processor dispatch technology, offers language features and an automatic method for accomplishing the same thing. It allows the use of SSE2 instructions when the application is executing on the latest Intel Pentium M processors and designates alternate code paths when the application is executing on a processor that does not support SSE2. Figure 2 illustrates processor dispatch for a user-created function called Kasumi. Using processor dispatch, the compiler inserts code to dispatch the call to a version of the Kasumi function dependent on the processor that is executing the application at runtime.

Figure 2 - Processor Dispatch is a compiler tool that generates different versions of code depending on the processor version being targeted.

Unfortunately, advanced optimizations and technology such as vectorization and processor dispatch can lead to an increase in code size. For example, processor dispatch creates multiple versions of the same function. Reduced code size is important in embedded applications, so employing techniques for reducing code size when using vectorization and processor dispatch is critical.

One standard compiler technique for minimizing code size is to use options such as -O1 or -Os. These optimizations tradeoff execution speed to reduce code size. Under these options the compiler may decide not to perform an optimization if it would lead to an increase in code size. Advanced compilers offer other optimizations, such as interprocedural and profile-guided optimizations, that help mitigate code size increases when using vectorization and processor dispatch by determining the most profitable sections of code to optimize.

Compilers typically process one function at a time and in isolation from other functions in the program. During optimization, the compiler must often make conservative assumptions regarding values in the program to account for side-effects that may occur in other functions, limiting the opportunity for optimization. A compiler with interprocedural optimization optimizes each function with detailed knowledge of other functions in the application. Interprocedural optimization is thus an enabler of other optimizations. These other optimizations are more aggressive, but become safe to use, because of the enhanced interprocedural information. Optimizations that interprocedural optimization enables include:

  • Interprocedural constant propagation - constant values are propagated through function calls, particularly function call arguments.
  • Arguments in registers - passing arguments in registers can reduce call/return overhead.
  • Loop-invariant code motion - increased interprocedural information enables detection of code that can be safely moved outside of loop bodies.
  • Dead code elimination - increased interprocedural information enables detection of code that may be proven unreachable.

Profile-guided optimization enables the compiler to learn from experience. Profile-guided optimization is a three-stage process involving 1) a compile of the application with instrumentation added 2) a profile generation phase where the application is executed and monitored and 3) a recompile where the data collected during the first run aids optimization. Profile-guided optimizations that influence code size include:

  • Basic block & function ordering - place frequently-executed blocks and functions together to take advantage of instruction cache locality
  • Aid inlining decisions - Inline frequently-executed functions so the increase in code size is paid only in areas of highest performance impact.
  • Aid vectorization decisions - vectorize high trip count and frequently-executed loops so the increase in performance mitigates the increase in code size.

A combination of optimizations offers the greatest opportunity to achieve high performance and small code size targeted for your embedded platforms. Ideally, then, a compiler will provide:

  • Vectorization, which enables use of new instructions in the Intel Pentium M processor.
  • Processor Dispatch, which allows use of the new instructions while maintaining backwards compatibility.
  • Code size optimizations, which trade-off code speed for code size, a key constraint in embedded development.
  • Interprocedural and profile-guided optimizations, which guide the compiler to optimize the most frequently-executed regions of code aggressively to gain maximum speed increase for the cost in code size.

The Intel® C++ Compiler is an example of a compiler that features all of these optimizations. Such advanced compilers, along with performance tools like event-based profiles and expert systems, will help developers create highly efficient applications for the Intel Pentium M processor. 

Max Domeika is a staff software engineer in the Software Products Division at Intel, creating software tools targeting the Intel® architecture market and was the project lead for the C++ front end and developer on the optimizer and the IA32 code generator. Max currently provides technical consulting and serves as an instructor with the Intel Software College.