Speeding Security on the Intel® StrongARM

The right approach can enhance AES execution on the Intel® StrongARM, with instruction set extensions providing further improvements.

By MariaGiovanna Sami, Marco Macchetti, and Francesco Regazzoni

With the increasing use of portable and wireless devices in the business and daily life, protecting sensitive information via encryption is becoming more and more crucial. ALaRI (Advanced Learning and Research Institute) has been conducting research aimed at improving the execution of security algorithms in embedded systems. Thanks to a donation from Intel, ALaRI has been able to develop several recommendations for implementing security efficiently on the Intel® StrongARM architecture.

One of ALaRI’s case studies examined the efficiency of the recently established Advanced Encryption Standard, a promising algorithm that is included in many secure communication suites. The aim of the project was twofold, the primary goal being the development of AES code specifically tailored to the Intel StrongARM architecture, and thus more efficient than standard coding approaches. The other goal was to study the effect of the cache-hierarchy on the performance of the algorithm.

AES is an iterated block cipher with a fixed block length of 128 bits and a variable key length that can be 128, 192 or 256 bits. The algorithm, shown in Figure 1, passes plain text through a number of round transformations to produce the cipher. The algorithm allows 10, 12, or 14 transformations, depending on the key length.

Figure 1 - The AES algorithm uses a number of rounds, each of which includes both linear and non-linear transformations, to convert plain text to a cipher.

Each round transformation is composed of three distinct transformations, called layers. The first is the non-linear layer, SubBytes, which interchanges blocks of bytes within the word. The second is the linear mixing layer (ShiftRows + MixColumns), and the third is the key addition layer (AddRoundkey). The transformations are invertible, allowing the cipher text to be converted back to plain text if one has the key.

Several AES software implementation techniques are known, and have been detailed in the original specification of the algorithm and in subsequent works. They fall into two rough categories. Compact implementations are tailored for small platforms, whereas other implementations allow developers to trade memory space and code size for augmented speed. The main distinguishing factor between the categories is the amount of memory reserved for look-up-tables in the first and the second layer.

The case study used an implementation that contained an optimized version of the linear mixing layer. This version requires only 12 logical XORs and 4 Xtime operations per AES round, and is the fastest compact implementation of the AES mixing layer documented so far in the technical literature. The study exploited factors such as cache-locking and usage of the mini-cache to keep performances at maximum level.

Tables Not Needed

Potential approaches to optimizing AES execution speed include the use of various types of tables for the first two layers. ALaRI carried out an evaluation of expanded table performance and made a careful analysis of the trade-off. The results show that the use of word extended substitution tables, as done in several existing implementations, is unnecessary and inefficient on Intel StrongARM processors. The architecture supports load byte instructions, so the extended tables offer no improvement. Similarly, the results show that the use of pre-rotated tables also fails to improve the performance. The processor’s barrel shifter can be combined with data processing instructions to reduce the effective clock cycle cost of rotating data to zero.

Use of such tables, in fact, increases the register pressure and possibility of cache misses, therefore degrading the overall performance. Instead, compact implementations have proven to be the best, in terms of speed, when encrypting small blocks of data (less than 32 bytes) with the Intel StrongARM microprocessor, even though there is the possibility of cache interference from other running applications. In a sense, small code fingerprint and high speed are not mutually exclusive in the StrongARM world, as they are in many other processors.

A second case study evaluated the effect of introducing dedicated hardware modules and related instructions to the architecture of a 32-bit processor, with focus on speed enhancements. The resulting methodologies readily apply to the Intel StrongARM architecture.

The study identified those portions of a typical AES code that are most expensive in terms of time, then considered the introduction of hardware modules to speed up their execution. The goal was to obtain maximum performances for the entire process. Based on the results of the first study, ALaRI focussed its attention on the compact AES implementation where only the non-linear layer uses tables.

Simple Hardware Changes Quadruple Speed

The study produced two possible solutions: one that performs the AES algorithm working at the byte-level and another one that operates on 32-bit words. The main modifications (shown in Figure 2) that enhance performance are:

  • Insert a rotator uphill of the ALU so that the processor can manipulate the first operand without using any additional instructions. This insertion is useful because both the data manipulation phase and the key scheduling phase require a rotation step.
  • Insert a module to perform the non-linear (SubBytes) step of the algorithm. The encryption and the key unrolling phases share this module. In the byte-level solution, the assembler format for the new instruction is “SBox Rs, N”, where the Rs parameter identifies the processor register that is affected, and the N parameter identifies the rotating amount (expressed with a number ranging from 0 to 3).
  • Insert a module used to perform, in a single step, the S-box calculation and the MixColumn transformation. The mnemonic for the instruction is “SMix”. In this context, the module first carries out the S-box operation then the MixCol module processes the data. The module’s output is four bytes that represent the different contributions that the input byte gives to the round calculation. In the byte-level solution, the instruction becomes “SMix Rs,Rd,Index”, where: the Rs parameter identifies the register to read from, the Rd parameter represents the register to write the results to and the Index parameter selects the correct byte, by acting on the rotator block. In the word-oriented solution, the instruction format simply becomes “SMixW Rd”.

Figure 2 - The addition of a few small hardware modules to implement specific algorithm steps can increase the speed of AES execution by nearly 400%.

It is worth noting that such simple modifications to the processor’s data-path can greatly improve performance. The study showed an execution speed-up asymptotically reaching 400%, compared to conventional software implementations.

Case studies and projects focused on the security of mobile embedded systems continue to be carried out at ALaRI in collaboration with Politecnico di Milano. The security of embedded systems is a problem of a crucial importance at ALaRI, which is working to provide solutions to this important aspect of the modern communication society.

MariaGiovanna Sami is the Scientific Director of ALaRI, Marco Macchetti is a Ph.D. Student at Politecnico di Milano, and Francesco Regazzoni is a Ph.D. Student at ALaRI-Università della Svizzera Italiana. Established in 1999, ALaRI (Advanced Learning and Research Institute) promotes research and education in Embedded Systems Design. A full list of the past and ongoing projects, as well as a list of publications, can be found on the ALaRI website www.alari.ch.