Analyze x86 Executables to Improve Software Quality
By Paul Anderson, GrammaTech Inc.Advanced static-analysis tools for source code have become popular because they’ve proven themselves highly effective at improving software quality. These tools can find serious programming defects that are difficult to find using other means, such as manual inspection or testing. Such defects include resource leaks, buffer overruns, race conditions, and null-pointer de-references. Advanced static-analysis tools can find these defects without the need for test cases. Historically, such tools have only been able to work on source code. More recently, however, there has been increasing interest in using these techniques to analyze machine code. Three factors are contributing to this trend. First, more reliance is being placed on third-party code, which is not available in source form. Secondly, there are technical advantages to being able to analyze machine code over source code. Finally, advances by the research community mean that such techniques are becoming feasible.
Source-Only Analyses
The disadvantage of source-only analyses is that it’s very rare that all of the source code for an application is available. Almost all applications link with third-party libraries including operating- system libraries. A source-code analysis tool is blind to any non-source components. As a result, they usually ignore these components entirely or make some simple assumptions about what the components might do in practice. For commonly used libraries, models are sometimes used. These are stubs of code written to approximate the important aspects of the component. This has two effects: The approximations may not be good enough and the analysis may fail to find flaws in those components.
Object-Code Analyses
Even in cases where the source code is available, it’s helpful to analyze the object code instead. After all, computers don’t execute source code. They execute machine code. There may be subtle yet important differences between the apparent semantics of source code and the semantics of the machine code to which it’s compiled. This is known as the What You See Is Not What You eXecute (WYSINWYX) effect [1]. Such effects arise in several ways. Source language definitions are full of ambiguities and inconsistencies. In such cases, the compiler is free to resolve these as it generates the machine code. A source analyzer also will resolve them. But there’s no guarantee that it will resolve them in the same way as the compiler. As a result, there will be a mismatch between what the code actually does and what the analysis thinks it does. Compiler optimizers take advantage of these ambiguities frequently. Thus, the semantics of the source code may even be different depending on the level of optimization used. Finally, the compiler itself may contain flaws and generate incorrect code.
The danger of this kind of effect is illustrated by a simple example found during a 2002 security review at Microsoft [2]. The relevant code was the following:
memset(password,’\0’,len); free(password);
The password variable was a heap-allocated buffer containing sensitive data. The intent was sound: to minimize the lifetime of sensitive data. Before returning the buffer to the heap, the programmer therefore attempted to zero out its contents. Yet the compiler noticed that the value being assigned to password was never used. It optimized the program by removing the call to memset, which meant that the sensitive data was returned unaltered to the heap. As a result, a security vulnerability was introduced that was entirely invisible in the source-code representation.
The WYSINWYX effect can arise in other ways too. The order of the evaluation of arguments is a very common cause. Also, memory layout is important to consider—the location of variables in memory, on the stack, or in registers. Some security exploits depend strongly on particular layouts.
A source analyzer could attempt to model exactly how compilers deal with these constructs. But this is rarely possible, as this behavior isn’t documented. Or they could try to do an analysis that takes into account all possible resolutions of such ambiguities. In practice, however, this is infeasible without giving up performance and precision.
An analyzer that looks at object code suffers from none of these disadvantages. All of the ambiguities and inconsistencies have been resolved by the compiler. In addition, the analysis will consider the code that is actually going to be executed. The analyses of object code can therefore be more precise than similar source-code analyses.
Machine-Code Analysis
Many teams in both industry and academia are working on machine-code analysis techniques and have demonstrated success. Microsoft has tools for finding defects in device drivers. In addition, several researchers at the University of Wisconsin Madison have reported methods for identifying malicious code and security vulnerabilities. Veracode offers a service for scanning machine code for security issues. With these tools, the challenge is to create an intermediate representation (IR) or model of the code that can be used to bring techniques like static analysis or model checking to bear.
Creating an IR for source code is relatively straightforward. But machine-code analysis is much harder. Source code is well structured. In addition, it is easy to identify variables, functions, types, and other high-level constructs. In contrast, machine code is potentially completely unstructured. It may have been generated from any source language by any compiler or have been written by hand. It also may have undergone optimization and been stripped of symbolic information. In a hostile environment, it may even have been obfuscated.

For some programs, it’s impossible to distinguish between code and data. Functions may not have a single entry point and even be contiguous. There’s no guarantee that any particular calling convention is uniformly respected. In addition, control structures may contain indirect jumps—a construct that’s not present in most source languages. The types of values aren’t apparent: A pointer is indistinguishable from an integer or character. Variables have been translated into memory locations and their sizes aren’t immediately available. Disassemblers like IDA Pro can help with some aspects of IR recovery. But they require manual input to help them resolve some of the more complicated constructs.
“A source-code analysis tool is blind to any non-source components.”
Some more advanced techniques for IR recovery are the result of joint research between GrammaTech and the University of Wisconsin Madison. The result of this partnership is CodeSurfer/ x86. Specific IR includes a disassembly listing, the control-flow and call graphs (with indirections resolved), variable and type information, and fine-grained dependences. As well as being useful for finding defects, these representations are useful for reverse engineering. The figure shows CodeSurfer/x86 being used to inspect the behavior of the Nimda worm. Here, the call graph can be seen despite the author’s intent to obfuscate it using indirect function calls.
It’s clear that technologies are starting to become available that will make it possible to analyze machine code for programming flaws and security vulnerabilities. Some tools are already available for limited purposes. Services are available as well. Tools will soon be offered to allow users to do this on their own code. These advances are expected to improve software reliability. They will put pressure on those who supply object-code components to audit those components for both security and quality issues.
References: [1] Balakrishnan, G., Reps, T., Melski, D., and Teitelbaum, T., “WYSINWYX: What You See Is Not What You eXecute,” Proc. IFIP Working Conference on Verified Software: Theories, Tools, Experiments, 2005, Zurich, Switzerland. [2] Howard, M., “Some Bad News and Some Good News,” http://msdn. microsoft.com/library/default.asp?url=/library/en-us/dncode/html/secure10102002. asp.

Paul Anderson is VP of Engineering at GrammaTech, a spin-off of Cornell University that specializes in static analysis. He received his B.Sc. from Kings College, University of London and his Ph.D. in computer science from City University London. Paul manages GrammaTech’s engineering team and is the architect of the company’s static-analysis tools.












