Building a commercial grade static analysis presents a lot of interesting problems. Everything not forbidden is compulsory: language specifications are wonderful documents, but in reality anything the user's compiler and runtime accepts is fair game. Analysis abstractions that scale "except in pathological cases" don't scale: analyzing tens of thousands of code bases that routinely exceed millions of lines of code means that those pathological cases inevitably arise. Build a good analysis that runs overnight, and users will ask you to run it in their IDE for near-immediate feedback. A bug finding tool needs a low false positive rate, but a tool aimed at finding security vulnerabilities needs a low false negative rate. Only analyzing "source" code and only starting from main() is insufficient for understanding modern web and mobile applications: frameworks imply a different programming model with a lot of auto-magical program behavior, often including idiosyncratic configuration regimes and various template languages. We'll talk about these problems and how we tackle them.
The development of a high-quality data-flow analysis---one that is precise and scalable---is a challenging task. A concrete client analysis not only requires data-flow but, in addition, type-hierarchy, points-to, and call-graph information, all of which need to be obtained by wisely chosen and correctly parameterized algorithms. Therefore, many static analysis frameworks have been developed that provide analysis writers with generic data-flow solvers as well as those additional pieces of information. Such frameworks ease the development of an analysis by requiring only a description of the data-flow problem to be solved and a set of framework parameters. Yet, analysis writers often struggle when an analysis does not behave as expected on real-world code. It is usually not apparent what causes a failure due to the complex interplay of the several algorithms and the client analysis code within such frameworks. In this work, we present some of the insights we gained by instrumenting the LLVM-based static analysis framework PhASAR for C/C++ code and show the broad area of applications at which flexible instrumentation supports analysis and framework developers. We present five cases in which instrumentation gave us valuable insights to debug and improve both, the concrete analyses and the underlying PhASAR framework.
Different Java compilers and compiler versions, e.g., javac or ecj, produce different bytecode from the same source code. This makes it hard to trace if the bytecode of an open-source library really matches the provided source code. Moreover, it prevents one from detecting which open-source libraries have been re-compiled and rebundled into a single jar, which is a common way to distribute an application. Such rebundling is problematic because it prevents one to check if the jar file contains open-source libraries with known vulnerabilities. To cope with these problems, we propose the tool SootDiff that uses Soot's intermediate representation Jimple, in combination with code clone detection techniques, to reduce dissimilarities introduced by different compilers, and to identify clones. Our results show that SootDiff successfully identifies clones in 102 of 144 cases, whereas bytecode comparison succeeds in 58 cases only.
Most changes to large systems that have been deployed are quite small compared to the size of the entire system. While standard summary-based analyses reduce the code that is reanalysed, they, nevertheless, analyse code that is not changed. For example, a backward summary-based analysis, will examine all the callers of the changed code even if the callers themselves have not changed. In this paper we present a novel approach of having summaries of the callers (called forward summaries) that enables one to analyse only the changed code. An evaluation of this approach on two representative examples, demonstrates that the overheads associated with the generation of the forward summaries is recovered by performing just one or two incremental analyses. Thus this technique can be used at commit-time where only the changed code is available.
Today's computer systems have become increasingly heterogeneous. Data centers integrate accelerators, CPUs with heterogeneous cores and with various ISAs which exhibit different performance and power characteristics. Mobile phones, following a similar trend, switch between fast and energy-efficient cores. Process migration is an important technique to leverage such specialization and heterogeneity. In this work, we target process migration enabled OS-capable heterogeneous platforms and address how to obtain better performance by program analysis: we address the challenge of defining migration points at which the program state is the same across machines and whether these will match phase changes, changes in the program behavior. Our tool-chain employs both static and dynamic analysis to compensate for disadvantages of both techniques to reduce the analyses overhead. Six out of ten benchmarks from different benchmark suites benefit from migration and the migration cost is compensated by the performance gained from migrating.
Datalog has emerged as a powerful tool for expressing static program analyses. Program analysis researchers have built nontrivial code bases in Datalog, but tool support for working with Datalog itself has been lacking. In this paper, we introduce MetaDL, a language extension to Datalog that enables source-level Datalog program analysis within Datalog. We describe several program analyses implemented in MetaDL and report on initial experiences. Our findings show that the language is effective for real-life Datalog analysis and can simplify working with Datalog source code.