The programming languages research community often develops ideas whose worth is evaluated empirically. Compiler optimizations, static and dynamic analyses, program synthesizers, testing tools, memory management algorithms, new language features, and other research developments each depend on some empirical evidence to demonstrate their effectiveness. This reality raises some important questions. What kind of empirical evidence yields the most reliable conclusions? What are the best practices for putting together an empirical evaluation in PL research? Do PL research papers published in top venues always follow these best practices?
To answer these questions, in August of 2017 the SIGPLAN Executive Committee formed the ad hoc committee on Programming Language Research Empirical Evaluations. The committee is chaired by Steve Blackburn, and its members include Matthias Hauswirth, Emery Berger, and Michael Hicks. Shriram Krishnamurthi has acted as an external collaborator. The committee brings together expertise on empirical evaluation methodology, experience in running workshops and publishing papers on that topic, experience introducing artifact evaluation into SIGPLAN conferences, and experience chairing the PCs of major SIGPLAN conferences.
Since its formation, the committee has examined the literature to identify common forms of empirical evaluation applied to the various kinds of PL research. This examination has identified inadequacies that regularly arise, even in papers published recently in highly regarded venues, including PLDI, POPL, ASPLOS, OOPSLA, and ICFP.
The committee has organized and categorized its findings, producing a 1-page best-practices checklist.
This checklist is organized into seven categories, each with associated examples for illustration. Categories are meant to be comprehensive, applying to the breadth of possible empirical evaluations. Examples in each category highlight specific, important areas in which best practice was frequently not followed. These are meant to be useful and illustrative, but they are neither comprehensive nor applicable to every evaluation.
The goal of the checklist is to help authors produce stronger scholarship, and to help reviewers evaluate such scholarship more consistently. Importantly, the checklist is is meant to support informed judgment, not supplant it. The committee’s hope is that this checklist can put all members of the community literally on the same page.
The current checklist (as of March 23, 2018) should be viewed as beta quality. It and the FAQ below have incorporated feedback we have received since the initial checklist was released in January; thanks to those who commented! We continue to solicit feedback and suggestions for improvement. We are particularly interested in
Our goal is to help ensure that current, accepted best practices are followed. Per the Checklist Manifesto, checklists help to do exactly this. Our interest is the good practices for carrying out empirical evaluations as part of PL research. While some practices are clearly wrong, many require careful consideration: Not every example under every category in the checklist applies to every evaluation – expert judgment is required. The checklist is meant to assist expert judgment, not substitute for it. ‘Failure isn’t due to ignorance. According to best-selling author Atul Gawande, it’s because we haven’t properly applied what we already know.’ We’ve kept the list to a single page to make it easier to use and refer back to.
When best practices are not followed, there is a greater-than-necessary risk that the benefits reported by an empirical evaluation are illusory, which harms further progress and stunts industry adoption. The members of the committee have observed many recent cases in which practices in the present checklist are not followed. Our hope is that this effort will help focus the community on presenting the most appropriate evidence for a stated claim, where the form of this evidence is based on accepted norms.
There are no plans to do so, but in time, doing so may make sense.
We believe the most important use of the checklist is to assist authors in carrying out a meaningful empirical evaluation.
We also view the checklist as a way to remind reviewers of important elements of a good empirical evaluation, which they can take into account when carrying out their assessment. However, we emphasize that proper use of the checklist requires nuance. Just because a paper has every box checked doesn’t mean it should be accepted. Conversely, a paper with one or two boxes unchecked may still merit acceptance. Even whether a box is checked or not may be subject to debate. The point is to organize a reviewer’s thinking about an empirical evaluation to reduce the chances that an important aspect is overlooked. When a paper fails to check a box, it deserves some scrutiny in that category.
The committee examined a sampling of papers from the last several years of ASPLOS, ICFP, OOPSLA, PLDI, and POPL, and considered those that contained some form of empirical evaluation. We also considered past efforts examining empirical work (Gernot Heiser’s “Systems Benchmarking Crimes”, the “Pragmatic Guide to Assessing Empirical Evaluations”, and the “Evaluate Collaboratory”). Through regular discussions over several months, we identified common patterns and anti-patterns, which we grouped into the present checklist. Note that we explicitly did not intend for the checklist to be exhaustive; rather, it reflects what appears to us to be common in PL empirical evaluations.
The larger categories represent the general breadth of evaluations we saw, and the examples are intended to be helpful in being concrete, and common. For less common empirical evaluations, other examples may be relevant, even if not presented in the checklist explicitly. For instance, for work studying human factors, the Adequate Data Analysis category might involve examples focusing on the use of statistical tests to relate outcomes in a control group to those in an experimental group. More on this kind of work below.
The boxes next to each item are not intended to require a binary “yes/no” decision. In our own use of the list, we have often marked entries as partially filling a box (e.g., with a dash to indicate a “middle” value) or by coloring it in (e.g., red for egregious violation, green for pass, yellow for something in the middle).
One concern we have heard multiple times is that the example previously titled “threats to validity” in Category 1, Clearly Stated Claims is not useful. The given reason is that “threats to validity” sections in software engineering papers often mention threats of little significance while ignoring real threats. This is unfortunate, but does not eliminate the need to clearly scope claims, highlighting important limitations. For science to progress, we need to be honest about what we have achieved. Papers often make, or imply, overly strong claims. One way this is done is to ignore important limitations. But doing so discourages or undervalues subsequent work that overcomes those limitations because that progress is not appreciated. Progress comes in steps, rarely in leaps, and we need those steps to be solid and clearly defined.
Another concern is that our recommendation (under the category Principled Benchmark Choice) that standard benchmark suites are to be preferred, when available and appropriate, might lead to work that overfits to that benchmark. While this is a problem in theory, and is well known from the machine learning community, our experience is that PL work more often has the opposite problem. Papers we looked at often subset a benchmark, or cherry-picked particular programs. Doing so calls results into question generally, and makes it hard to compare related systems across papers. We make progress more clearly when we can measure it. Good benchmark suites are important, since only with them can we make generalizable progress. Developing them is something that our community should encourage.
PL research sometimes involves user studies, and these are different in character than, say, work that evaluates a new compiler optimization or test generation strategy. Because user studies are currently relatively infrequent in the papers we examined, we have not included them among the category examples. It may be that new, different examples are required for such studies, or that the present checklist will evolve to contain examples drawn from user studies. Nonetheless, the seven category items are broadly applicable and should be useful to authors of any empirical evaluation for a SIGPLAN conference.
Artifact evaluation typically occurs after reviewing a paper, to check that the claims and evidence given in the paper match reality, in the artifact. The checklist is meant to be used by reviewers while judging the paper, and by authors when carrying out their research and writing their paper.
Our manifesto is: Usage should determine content. We welcome feedback from users of the checklist to indicate how frequently they use certain checklist items or how often papers reviewed adhere to them. We also welcome feedback pointing to papers that motivate the inclusion of new items. As the community increasingly adheres to the guidelines present in the checklist, the need for their inclusion may diminish. We also welcome feedback on presentation: please share points of confusion about individual items, so we can improve descriptions or organization.