Getafix: Learning to Fix Bugs Automatically

如果无法正常显示，请先停止浏览器的去广告插件。

1. Getafix: Learning to Fix Bugs Automatically Johannes Bader Andrew Scott Facebook jobader@fb.com Facebook andrewscott@fb.com Michael Pradel Satish Chandra Facebook michael@binaervarianz.de Facebook satch@fb.com Abstract Most static analyzers look for instances of common bug categories, such as potential null dereferences, incorrect us- ages of popular APIs, or misuses of particular language con- structs. We observe that fixes for a specific bug category often resemble each other: there is a pattern to them. That is, past human fixes of the same bug category may offer in- sights into how future instances of the bug category should be fixed. Given this observation, can we automate fixing bugs identified by learning from past fixes? This paper addresses the problem of automatically fixing instances of common bug categories by learning from past fixes. We assume two inputs: (1) A set of changes that fix a specific kind of bug, e.g., from the version history of a code base. These changes serve as training data to learn fix patterns from. (2) A piece of code with a static analysis warning that we want to fix. Given only these two inputs, the problem is to predict a fix that addresses the static analysis warning in a way similar or equal to what a human developer would do. By automating the fix generation and leaving to the human only the final decision of whether to apply the fix, the overall effort spent on addressing bugs pointed out by static analyzers can be greatly reduced. We focus on kinds of bugs that have non-trivial yet repet- itive fixes. On the one end of the spectrum, there are bug categories that trivially imply a specific fix. For example, for a warning that suggests a field to be final , implementing an automated fix suggestion is straightforward. Such an auto- fix can be defined by the author of that rule in the static analyzer, without knowing the specific context in which the rule is applied; indeed, some of Error Prone [Aftandilian et al. 2012] rules come with auto-fixes. On the other end of the spectrum are bugs that require complex, application-specific fixes, such as an issue with a UI tab not displaying after a specific series of interactions from a user. Here we target bug categories in between these two extremes, where finding a fix is non-trivial, yet typical fixes belong to a set of recur- ring fix patterns. For such bug categories, there often exists more than one way to fix the problem, and the “right” way to address a specific instance of the bug category depends on the context, e.g., the code surrounding the static analysis warning. Static analyzers help find bugs early by warning about re- curring bug categories. While fixing these bugs still remains a mostly manual task in practice, we observe that fixes for a specific bug category often are repetitive. This paper ad- dresses the problem of automatically fixing instances of com- mon bugs by learning from past fixes. We present Getafix, an approach that produces human-like fixes while being fast enough to suggest fixes in time proportional to the amount of time needed to obtain static analysis results in the first place. Getafix is based on a novel hierarchical clustering algo- rithm that summarizes fix patterns into a hierarchy ranging from general to specific patterns. Instead of a computation- ally expensive exploration of a potentially large space of candidate fixes, Getafix uses a simple yet effective ranking technique that uses the context of a code change to select the most appropriate fix for a given bug. Our evaluation applies Getafix to 1,268 bug fixes for six bug categories reported by popular static analyzers for Java, including null dereferences, incorrect API calls, and misuses of particular language constructs. The approach predicts exactly the human-written fix as the top-most suggestion between 12% and 91% of the time, depending on the bug category. The top-5 suggestions contain fixes for 526 of the 1,268 bugs. Moreover, we report on deploying the approach within Facebook, where it contributes to the reliability of software used by billions of people. To the best of our knowl- edge, Getafix is the first industrially-deployed automated bug-fixing tool that learns fix patterns from past, human- written fixes to produce human-like fixes. 1 Introduction Modern production code bases are extremely complex and are updated constantly. Static analyzers can help developers find potential issues (referred to as bugs for the rest of this paper) in their code, which is necessary to keep code quality high in these large code bases. While finding bugs early via static analysis is helpful, the problem of fixing these bugs still remains a mostly manual task in practice, hampering the adoption of static analysis tools [Christakis and Bird 2016]. 1

2. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra Extend existing if condition conjunc- tively: // ... if (key != 0 && mThread != null ) { mThread.cancel(key); } else { ThreadPool.tryCancel(key); } // ... Add an early return: void updateView() { // ... View v = ViewUtil.findView(name); if (v == null) { return; } data.send(((UpdatableView) v) .getContentMgt(), "UPDATE"); // ... } Protect call with conditional expres- sion: @Override public void onClose() { final Window w = win != null ? win.get() : null ; if (w != null && w.getProc().isActive()) { w.getProc().removeListeners(); } } Figure 1. Different real-world fixes of potential null dereference bugs. As an example of a bug category targeted in this work, consider NullPointerException s – still one of the most pre- vailing bugs in Java and other languages. If a static ana- lyzer warns about a potential null dereference, develop- ers may fix the problem in various ways. Figure 1 shows three anonymized examples of fixes of null dereference bugs, which add a conjunct to an existing if condition, replace a call with a ternary operation, and add an early return, re- spectively. While all these fixes introduce some kind of null check, the exact fix heavily depends on the already existing code. Beyond these examples, there are even more ways of fixing null dereference bugs, e.g., by adding a new if state- ment or by extending an existing if condition disjunctively. Learning all these fix patterns and deciding which one to apply to a given piece of buggy code is a non-trivial problem. Our work aims at automating bug fixing in large-scale, industrial software development. This setting leads to several interesting challenges to deal with: based on a novel hierarchical, agglomerative clustering tech- nique that produces a hierarchy of fix patterns ranging from very general to very specific fixes. Third, given a previously unseen bug to fix, Getafix finds suitable fix patterns, ranks all candidate fixes, and suggests the top-most fix(es) to the developer. As a check during the third step, Getafix validates each suggested fix against the static analyzer to ensure that the fix removes the warning. Note that the validation against the static analyzer is a one-time effort per fix, keeping the computational resources within reasonable bounds. The Getafix approach and the constraints that motivated it differ from the assumptions made by automated program repair techniques [Goues et al. 2019], the most practical of which for large systems are generate-and-validate repair systems [Goues et al. 2012; Kim et al. 2013; Le et al. 2016]. 1. Because it must produce fixes quickly, Getafix neither generates, nor validates a large number of fix candi- dates. Getafix produces a ranked list of fix candidates based entirely on past fixes it knows about and on the context in which a fix is applied. This is done without any validation; Getafix’s ranking reflects the confi- dence it has in the fix. It then offers to the developer one (or a few, configurable) top-ranked fix(es) after a validation step. Note that Getafix does not go down its list exploring additional fix candidates until a validated fix is found. 2. In contrast to generate-and-validate repair systems that use a test suite as validation filter, Getafix uses the same static analyzer as the validation filter to make sure the warning goes away. This is because at this point in the development, we cannot assume that the code is ready to pass a test suite. 3. Prior repair techniques rely on statistical fault localiza- tion (e.g., based on coverage of passing and failing test cases) that determines where in the code a bug should be fixed. Getafix avoids this step by exploiting the fact that static analyzers pinpoint the location of the bug. Consequently, it also avoids the need to try applying fix candidates at various potential fault locations. 4. Finally, the goal that Getafix sets for itself is not merely a fix that makes its validation pass, but is as close • To reduce the human time spent on fixing a bug, the ap- proach may propose only a small number of potential fixes, ideally only one fix. • To make this fix acceptable to developers, the sug- gested fix should be human-like: very similar to or exactly the same as a fix a human developer would implement. • To suggest fixes quickly, as well as to keep the com- puting resources required to find fixes within bounds, the approach cannot explore a large space of candidate fixes and validate each of them against a test suite or any other computationally expensive validation rou- tine. Addressing these challenges, we present Getafix, an auto- mated technique that learns recurring fix patterns for static analysis warnings and suggests fixes for future occurrences of the same bug category. Getafix produces human-like fixes, and does so fast enough (typically within 10 seconds) to offer a fix suggestion in roughly the same magnitude of time as a human developer waits for static analysis results anyway. In a nutshell, the approach consists of three main steps. First, it splits a given set of example fixes into AST-level edit steps. Second, it learns recurring fix patterns from these edit steps, 2

3. Getafix: Learning to Fix Bugs Automatically • A novel clustering technique for discovering repetitive fix patterns in a way that is more general than a pre- vious greedy approach and that preserves important context needed for applying fixes. • Simple yet effective statistical ranking technique to select the best fix(es) to suggest, which enables predict- ing human-like fixes among the top few suggestions. • Empirical evidence that the technique works well for a diverse set of bug categories taken from Error Prone and Infer. as possible to what a human would do in that situ- ation. (While a generic null check would suppress a null dereference warning, as Figure 1 shows, a human might choose a very specific kind of fix in a particular context.) By contrast, most program repair systems aim to produce a fix that is functionally correct, but may not match a human fix. Another related line of work is on learning edit patterns, including potential bug fix patterns, from version histo- ries [Brown et al. 2017; Rolim et al. 2017, 2018]. In contrast to Getafix, these techniques learn from all code changes, leav- ing the task of identifying interesting fix patterns to a human. A key insight of our work is that learning from fixes for a specific static analysis warning avoids this human effort. We evaluate Getafix in two ways. One part of our evalu- ation applies the approach to a total of 1,268 bug fixes for six kinds of warnings reported by two publicly available and widely used static analyzers for Java. The bug categories include potential null dereferences, incorrect uses of Java’s reference equality, and common API misuses. After learning fix patterns from several dozens to several hundreds of exam- ples, Getafix predicts exactly the human fix as the top-most suggestion for 12% to 91% of all fixes, depending on the bug category. In a setting where developers are willing to inspect up to five fix suggestions, the percentage of correctly pre- dicted fixes even ranges between 19% and 92%, containing fixes for 526 of the 1,268 bugs. Because these results indicate how often the predicted fix exactly matches the human fix, as opposed to producing any human-acceptable fix, these results provide a lower bound of the effectiveness of Getafix. The other part of our evaluation deploys Getafix to pro- duction at Facebook, where it now contributes to the stability of apps used by billions of people. At Facebook, Getafix cur- rently suggests fixes for bugs found by Infer [Calcagno et al. 2015] 1 , a static analysis tool that identifies issues, such as null dereferences, in Android and Java code. For example, the fixes in Figure 1 have been suggested by Getafix and accepted by developers at Facebook. We find that developers accept around 42% of all fixes suggested by Getafix, helping to save precious developer time by addressing bugs with a single click. To the best of our knowledge, Getafix is the first industrially- deployed automated bug-fixing tool that learns fix patterns from past, human-written commits, and produces human- like fixes in a short amount of time. 2 Overview Getafix consists of three main components, organized in a learning phase and a prediction phase. In the following we will describe their functionality and challenges at a high level, followed by a more detailed description in later sec- tions. Figure 2 gives an overview of the approach. During the learning phase, a set of pairs of bugs and their fixes is given as training data to Getafix. As training data can serve any collection of past human code changes tied to a specific signal, such as a static analysis warning, a type error, a lint message, or simply the fact that a change was suggested during human code review. Our evaluation focuses on static analysis warnings as the signal, i.e., all bugs and fixes have been detected as instances of a specific bug category by a static analyzer, e.g., as potential null dereferences. During the prediction phase, Getafix then takes previously unseen code that comes with the same signal as the training examples and produces a bug fix. The first step of the learning phase is a tree differencer, which identifies changes at the AST level. It generates con- crete edits, which are pairs of sub-ASTs of the original before and after ASTs, representing a specific change that could be replayed on different code by replacing instances of the edit’s before AST with its after AST. Based on the concrete, AST-level edits, the second step is a new way of learning fix patterns. To generalize from specific fix examples, fix patterns have “holes”, i.e., pattern variables, that may match specific subtrees, similar to the represen- tation used by Rolim et al. [2017] and Long et al. [2017]. A technical contribution is a novel hierarchical, agglomerative clustering technique that organizes fix patterns into a hierar- chical structure. The approach derives different variants of fix patterns, ranging from very specific patterns that match few concrete edits to very general fix patterns that match many concrete edits. Another contribution is to include into the fix patterns not only the code changes themselves, but also some surrounding context. This context is important to Contributions This paper makes the following contribu- tions: • Fully automated suggestion of bug fixes for static anal- ysis bug reports, computed in seconds, without re- quiring computationally expensive search over a large space of candidate bug fixes. 1 https://code.fb.com/developer-tools/open-sourcing-facebook-infer- identify-bugs-before-you-ship/ 3

4. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra Code with static analysis warning Static analyzer Examples of bugs and their fixes Version Version history history Mining & hierarchical clustering of fix patterns Tree differencer Apply fix patterns & rank fix candidates Bug fix Validation against static analyzer Data collection Learning Prediction Figure 2. Overview of Getafix. public class Worker { public void doWork() { task.makeProgress(); } public long getRuntime() { return now - start; } } ↣ • a label, e.g., "BinEx" (for binary expression), "Literal", or "Name", • a possibly empty value, e.g., + , 42 , or foo , and • child nodes, each having a location that describes their relationship to the parent node, e.g., "left" and "right" to address the left/right subexpression of a binary ex- pression. More formally, we define the set of trees as follows. public class Worker { private long getRuntime() { return now - start; } public void doWork() { if (task == null) return; task.makeProgress(); } } Figure 3. Concrete edits extracted by the tree differencer. Definition 3.1 (Tree). Tree = Label × Value × Ø (Location × Tree) k k ∈N decide which out of multiple possible fix patterns to apply to a given piece of code. After learning fix patterns, which is a once-per-bug-category effort, the prediction phase of Getafix applies the patterns to previously unseen buggy code to produce a suitable fix. Since Getafix aims at predicting fixes without a computationally expensive validation of many fix candidates, it internally ranks candidate fixes. We present a simple yet effective, sta- tistical ranking technique that uses the additional context extracted along with fix patterns. Before suggesting a fix to the developer, Getafix validates the predicted fix against the same tool that provided the sig- nal for the bug category, e.g., a static analysis, type checker, or linter. Getafix is agnostic to the signal used, so we assume a possibly computationally expensive black box component. 3 where Label, Location and Value are sets of strings. For readability, we represent ASTs using term notation rather than tuples. For example, parsing x = y + 2 results in an AST Assign(x : Name, + : BinEx(y : Name, 2 : Literal)) . Child nodes are listed in parentheses following their parent node. Parentheses are omitted if no child nodes exist. Values, if not empty, are prepended to labels using syntax borrowed from type judgments. 3.2 Given two trees, we define edits as follows. Definition 3.2 (Tree edits). Edit = Tree × Tree × P Mapping Mapping = TreeRef × TreeRef × {mod, unmod} Tree differencer Edits are triples containing the before and after ASTs, fol- lowed by a set of mappings. A mapping is a triple referencing a node of the before and a node of the after AST, as well as a flag indicating whether the pair of subtrees is part of a mod- ification (mod means the node is part a modification, unmod means that the mapped subtree is unmodified). TreeRef is the set of references uniquely identifying a node within an AST. We write edits in source code as before code ↣ after code , e.g., x = y + 2 ↣ x = 3 + y . Combining this notation with our AST notation from above, we write tree edits as before AST ↣ after AST , where we write modified subtrees bold (whenever the distinction The first step of Getafix takes a set of example bug fixes, and decomposes each fix into fine-grained edits. These ed- its provide the basic ingredients for learning and applying fix patterns in later steps. For example, Figure 3 shows a code change and the three fine-grained edits that Getafix extracts: (i) inserting an early return if task is null (shown in green); (ii) changing public to private (shown in red); and (iii) moving the doWork method (shown in blue). 3.1 Tree Edits Trees Getafix extracts fine-grained edits based on ASTs. For our purposes, a node in an AST has: 4

5. Getafix: Learning to Fix Bugs Automatically will not emit a concrete edit rooted at the body of getRuntime as it does not contain a modification. In addition to making concrete edits rooted at mapped nodes, we also create a con- crete edit rooted at the parent of each contiguous region of modified nodes in the before and after trees. This is done to ensure that we still learn patterns from edits which occur near other changes. The approach is designed to extract too many rather than too few concrete edits. The rationale is that the clustering step of Getafix, which is explained in detail in the following, automatically prioritizes patterns with the best level of gran- ularity, since there will exist multiple similar concrete edits. For instance, when combining the concrete edits emitted for Figure 3 with further concrete edits from bug fixes for null dereferences, the insertion of an if statement is likely to appear more often than the move or update. In contrast, concrete edits containing further modifications, e.g., rooted further up in the AST, are likely to be discarded as noise, because there are no similar concrete edits to create a cluster with. is relevant) and give nodes connected by a mapping match- ing indices. For example, two concrete edits emitted for the above code change are + : BinEx 0 (y : Name 1 , 2 : Literal) ↣ + : BinEx 0 (3 : Literal, y : Name 1 ) and Assign 0 (x : Name 1 , + : BinEx 2 (y : Name 3 , 2 : Literal)) ↣ Assign 0 (x : Name 1 , + : BinEx 2 (3 : Literal, y : Name 3 )) . An alternative to tree-based reasoning about edits would be to use a line-based diffing tool. However, line-based diffing reasons about code on a more coarse-grained level, ignoring the structural information provided by ASTs. For example, given the change in Figure 3, line-based diffing would mark both methods as fully removed and inserted, whereas tree- based edits can encode the move and hence also detect the insertion within the moved method as a concrete edit. 3.3 Extracting Concrete Edits To extract edits from a given pair of ASTs, Getafix builds on the GumTree algorithm [Falleri et al. 2014], a tree-based technique to compute fine-grained edit steps that lead from one AST to another. The approach extracts four kinds of edit steps: deletion, insertion, move, and update. An unmapped node in the before or after tree is considered a deletion or insertion, respectively. A mapped pair of nodes whose parents are not mapped to each other is considered a move. A mapped pair of nodes with different values is considered an update and may be part of a move at the same time. Any pairs of nodes involved in one of the above operations is consider to be modified, whereas all other pairs of mapped nodes are considered unmodified. An entire subtree is considered unmodified if all its nodes are unmodified. For the example in Figure 3, the insertion of the if state- ment is an addition, the change from public to private is an update, and method doWork has been moved. The subtree representing the call task.makeProgress() is unmodified. In contrast, the subtree representing the block surrounding this call is modified, due to the insertion of the if statement. As demonstrated with the insertion combined with a move in Figure 3, modifications can be nested and it is unclear what the granularity of a concrete edit should be. Maybe the insertion itself is the fix, maybe the move is important. Our algorithms are language agnostic, so without domain knowledge there is no robust strategy for grouping modi- fications into concrete edits. We address this challenge by extracting an entire spectrum of concrete edits based on the modifications reported by GumTree: Any pair of mapped nodes will be turned into the roots of a new concrete edit, if that concrete edit contains at least one modification. In the example of Figure 3, we would hence extract concrete edits rooted at the body of doWork (contains the insertion), at the method declaration of doWork (contains the insertion as well), at the method declaration of getRuntime (contains the update), at the class body level (contains all modifications), and any of its ancestors for the same reason. By contrast, we 4 Learning Fix Patterns Given the set of fine-grained, tree-level edits extracted by the tree differencer described in the previous section, the next step is to learn fix patterns, i.e., recurring edit patterns observed in fixes for a specific kind of bug. Intuitively, an edit pattern can be thought of as a generalization of multiple edits. To abstract away details of concrete edits, edit patterns may have “holes”, or pattern variables, to represent parts of a tree where concrete edits differ. A key contribution of Getafix is a novel algorithm that derives a hierarchy of edit patterns, which contains edit patterns at varying levels of generality, ranging from concrete edits at the leaves to abstract edit patterns at the root. The algorithm to derive a hierarchy of edit patterns is based on a generalization operation on edit patterns, which we obtain through anti-unification [Kutsia et al. 2014], an existing method of generalizing among different symbolic expressions. We start by presenting the generalization op- eration (Section 4.1) and then show how Getafix uses this operation to guide a hierarchical clustering algorithm (Sec- tion 4.2). Finally, Section 4.3 describes how Getafix augments edit patterns with context information that helps deciding when to apply a learned edit pattern to new code. 4.1 4.1.1 Generalizing Edit Patterns via Anti-Unification Tree Patterns and Tree Edit Patterns A set of concrete trees can be abstracted into a tree pattern. Formally, we extend our definition of trees by adding holes: Definition 4.1 (Tree pattern). Let Hole = (Label ∪ {?}) × N, then the set of tree patterns is Ø Tree + = Label × Value × (Location × Tree + ) k ∪ Hole k ∈N 5

6. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra Holes may have a label α (we write h k : α ) in which case they match an arbitrary subtree that has a root with label α. If the label is omitted (we write h k : ? ) the hole matches any subtree. Holes are indexed, allowing tree patterns to express whether two holes must match identical subtrees. For example, consider the following trees t 1 , t 2 , t 3 ∈ Tree and tree patterns p 1 , p 2 , p 3 , p 4 ∈ Tree + : antiUnify( v : α (a 1 , ... , a n ) , w : β(b 1 , ... , b m ) ) =  v : α (c 1 , ... , c n ) if α = β ∧ v = w ∧ n = m       where c i = antiUnify(a i , b i ) ∀ i ∈ {1, ..., n} k : α  h if α = β ∧ (v , w ∨ n , m)     h k : ? otherwise  where n, m ∈ N, a 1 , ..., a n , b 1 , ..., b m ∈ Tree + t 1 = + : BinEx(x : Name, 42 : Literal) α, β ∈ Label, v, w ∈ Value, t 2 = + : BinEx(x : Name, y : Name) fresh hole index k ∈ N t 3 = + : BinEx(x : Name, x : Name) p 1 = + : BinEx h : Name, h : Literal 0 1 antiUnify( v : α (a 1 , ... , a n ) , h t : β ) = antiUnify( h s : α , w : β(b 1 , ... , b m ) ) = p 2 = h 0 : BinEx p 3 = + : BinEx h 0 : Name, h 1 : ? p 4 = + : BinEx h 0 : ?, h 0 : ? antiUnify( h s : α , h t : β ) = ( h k : α if α = β h k : ? otherwise where n, m, s, t ∈ N, a 1 , ..., a n , b 1 , ..., b m ∈ Tree + Pattern p 1 matches only t 1 , since t 2 and t 3 do not match the label of hole h 1 . Patterns p 2 and p 3 match all three ASTs. Finally, p 4 matches only t 3 , since the pattern requires both operands of the binary expression to be identical. Similar to tree patterns, we define the set Edit + of edit patterns analogous to Edit (see Definition 3.2), but using Tree + instead of Tree. 4.1.2 α, β ∈ Label, v, w ∈ Value, fresh hole index k ∈ N Figure 4. Anti-unification of tree patterns. For conciseness, we assume the antiUnify function performs memoization, so that it reuses holes where possible, without explicitly tracking them. Generalizing Tree Patterns To learn recurring edit patterns, Getafix generalizes concrete edits into patterns through a process called anti-unification. As an intermediate step, we first present how to generalize two tree patterns into a generalization that describes both of them. The generalization is constructed such that each hole has a set of substitutions that can be used to recover the original input tree patterns. There always exists a trivial generalization, which has a single hole at the root. How- ever, that generalization is generally not very interesting and therefore anti-unification seeks to find the least general generalization that can describe the input trees, meaning the generalization retains as much information as possible. Figure 4 formally describes the anti-unification function for tree patterns. The function is defined recursively on the tree structure. It merges subtrees that have the same labels, the same values, and the same number of children into a combined subtree, and replaces them by a hole otherwise. Merging a subtree with a subtree represented by a hole yields another hole. When merging two holes, the function keeps the label of the hole whenever it matches. For example, let be the ASTs for assignment statements a = a + a; and b = b + 2; , respectively. Then antiUnify(a, b) = Assign h 0 : Name, + : BinEx h 0 : Name, h 1 : ? Note how hole h 0 : Name can be reused since in both instances it substitutes variables a and b . While h 1 : ? also substitutes a in AST a, it does not substitute b in AST b, so a fresh hole is created. Because the hole’s substitutions have mismatching labels, ? is used. Instead of the above pattern, Assign h 0 : Name, + : BinEx h 1 : Name, h 2 : ? also gener- alizes a and b, but the information that h 0 : Name and h 1 : Name substitute identical subtrees would be lost. Pat- tern Assign h 0 : ?, + : BinEx h 1 : ?, h 2 : ? would be an even more general generalization, dropping the information about labels even though they matched. Instead of computing these generalizations, anti-unification produces the least general generalization of the given patterns. 4.1.3 Generalizing Edit Patterns We now extend the idea of generalizing tree patterns to tree edit patterns, which yields the generalization operation at the core of Getafix’s learning algorithm. Figure 5 formally defines anti-unification of edit patterns. The basic idea is to first anti-unify the before trees and then the after trees, while using a single set of substitutions between the before a = Assign(a : Name, + : BinEx(a : Name, a : Name)) and b = Assign(b : Name, + : BinEx(b : Name, 2 : Literal)) 6

7. Getafix: Learning to Fix Bugs Automatically and after trees, to indicate that a hole in the before tree cor- responds to the same AST node in the after tree. To focus the generalized edit patterns on the modified parts of a tree, the approach uses a helper function stripUnmod to drop un- modified subtrees before anti-unifying the trees. Afterwards, Getafix populates the unmodified nodes back and maps them to each other. As an example, assume the training data contains two source edits: { f(); g(); x = 1; } ↣ { f(); x = 1; if (c) g(); } and { return; y = 2; } ↣ { y = 2; if (c) onResult(); } . The corresponding concrete edits are the following: anti-unifies this unmodified node and adds it back, giving the final edit pattern before д ′′ ↣ after д ′′ : before д ′′ = Block 10 (h 0 : ?, Assign 11 h 2 : Name, h 3 : Num ) after д ′′ = Block 10 (Assign 11 h 2 : Name, h 3 : Num , If c : Name, h 1 : Call ) 4.2 Hierarchical Clustering of Edit Patterns Based on the generalization operation described above, Getafix uses a hierarchical clustering algorithm to derive a hierarchy of recurring edit patterns. The basic idea is to start with all concrete edits and to iteratively combine pairs of edits until all edits are combined into a single edit pattern. At each iter- ation, the algorithm picks a pair of edit patterns to combine, so that their anti-unification yields the least loss of concrete information. The sequence of generalization steps eventu- ally yields a hierarchy of edit patterns, where leaf nodes represent concrete edits and higher-level nodes represent increasingly generalized edit patterns. Before delving into our algorithm, consider the example hierarchy in Figure 6, which visualizes a learned hierarchy of edit patterns as a so-called dendrogram. Each pair of red and green boxes represents one edit pattern. The dendrogram is derived from four concrete edits, which are shown as edit patterns without any holes in them. The left-most edit pattern is the most general description of all four concrete edits, while all edit patterns in between represent different levels of generalization that comprise different subsets of the concrete edits. The main benefit of keeping intermediate edit patterns, instead of greedily merging all edits into a single generalization, is that Getafix can choose the most suitable pattern to apply to a new piece of code (as explained in detail in Section 5). before 1 = Block 0 (f : Call 1 , g : Call 2 , Assign 3 (x : Name 4 , 1 : Num 5 )) after 1 = Block 0 (f : Call 1 , Assign 3 (x : Name 4 , 1 : Num 5 ), If(c : Name, g : Call 2 )) before 2 = Block 6 (Return, Assign 7 (y : Name 8 , 2 : Num 9 )) after 2 = Block 6 (Assign 7 (y : Name 8 , 2 : Num 9 ), If(c : Name, onResult : Call)) We first drop the unmodified subtrees and anti-unify the result: stripUnmod(before 1 ) = Block 0 (g : Call 2 ) stripUnmod(after 1 ) = Block 0 (If(c : Name, g : Call 2 )) stripUnmod(before 2 ) = Block 6 (Return) stripUnmod(after 2 ) = Block 6 (If(c : Name, onResult : Call)) 4.2.1 before д = antiUnify(stripUnmod(before 1 ), stripUnmod(before 2 )) = Block h 0 : ? Clustering Algorithm To obtain a hierarchy of edit patterns from a given set of con- crete edits, Getafix performs a hierarchical, agglomerative clustering algorithm. The algorithm maintains a working set W of edit patterns. Initially, this working set contains one edit pattern for each of the given concrete edits. The main loop of the algorithm repeats the following steps until W is reduced to a singleton set: 1. Pick a pair e 1 , e 2 of edit patterns from W . 2. Generalize the edit patterns using anti-unification, which yields e 3 = antiUnify(e 1 , e 2 ). 3. Remove e 1 and e 2 from W , and add e 3 to W instead. 4. Set e 3 as the parent of e 1 and e 2 in the hierarchy. after д = antiUnify(stripUnmod(after 1 ), stripUnmod(after 2 )) = Block If c : Name, h 1 : Call We then attempt to populate back mappings where possi- ble. Mappings are re-established between any pair of nodes that are the result of anti-unifying mapped nodes. Here, both edits agree about the blocks mapping to each other (map- pings 0 and 6): before д ′ = Block 10 h 0 : ? after д ′ = Block 10 If c : Name, h 1 : Call 4.2.2 Finally, Getafix populates back unmodified nodes where pos- sible. Both edits have an assignment statement, , : Assign 11 (t) hat swaps places with the modified statement. The approach Merge Order An important component of our clustering algorithm is how to pick the pair e 1 , e 2 of edit patterns to generalize. Getafix aims at minimizing the loss of concrete information at each 7

8. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra antiUnify(before 1 ↣ after 1 , before 2 ↣ after 2 ) = populate(before д ↣ after д , before 1 ↣ after 1 , before 2 ↣ after 2 ) where before д = antiUnify(stripUnmod(before 1 ), stripUnmod(before 2 )) after д = antiUnify(stripUnmod(after 1 ), stripUnmod(after 2 )) for before 1 ↣ after 1 , before 2 ↣ after 2 ∈ Edit + stripUnmod(v : α (a 1..i −1 , a i , a i +1..n )) = stripUnmod(v : α (a 1..i −1 , a i +1..n )) (drop unmodified children) stripUnmod(v : α (a 1 , ... , a n )) = v : α (stripUnmod(a 1 ), ... , stripUnmod(a n )) (if no more children are unmodified, recurse) for n, i ∈ N, a 1 , ..., a n ∈ Tree + , α ∈ Label, v ∈ Value populate(e д , e 1 , e 2 ) = Add back unmodified nodes and mappings Figure 5. Anti-unification of edit patterns. To reuse holes created for the before and after trees, we assume that antiUnify uses memoization across calls to the function. Definition 4.3 (Partial order of edit patterns). ⟨Edit + , antiUnify⟩ induces a partial order ⊑ on Edit + (read “more precise than”) by defining e ⊑ e ′ ⇐⇒ antiUnify(e, e ′ ) = e ′ . anti-unification step. To this end, we exploit the fact that the anti-unification operation itself gives a partial order of edit patterns, which orders patterns by their level of abstraction. Given this partial order of edit patterns, the algorithm picks a pair e 1 , e 2 that produces a minimal generalization w.r.t. the partial order. In the following, we first describe the partial order of edit patterns and then present an efficient approx- imation of it, which our Getafix implementation is based on. The partial order of edit patterns is based on a partial order of tree patterns, which builds upon the anti-unification operation: 4.2.3 Approximations for Efficiency To ensure that Getafix scales to a large number of edits, our implementation performs three approximations. First, since the partial order of edit patterns is computationally expen- sive, Getafix approximates it using the following steps, each acting as a tiebreaker for the previous step. When picking edit pairs to merge, Getafix prefers generalizations that Definition 4.2 (Partial order of tree patterns). Anti-unification represents a least upper bound operation on Tree + . The re- sulting semilattice ⟨Tree + , antiUnify⟩ has Tree as minimal elements and h k : ? (regardless of k) as its greatest ele- ment. ⟨Tree + , antiUnify⟩ hence induces a partial order ⊑ on Tree + (read “more precise than”) by defining p ⊑ p ′ ⇐⇒ antiUnify(p, p ′ ) = p ′ . The definition relies on equality of tree patterns. We con- sider two tree patterns p 1 , p 2 ∈ Tree + equal if they are struc- turally identical, modulo bijective substitution of hole indices. For example, + : BinEx h 0 : Name, h 1 : ? = + : BinEx h 5 : Name, h 3 : ? 1. yield patterns without any unbound holes in the after part over those with such holes, 2. preserve more mappings between modified nodes, 3. have fewer holes and holes generalizing away smaller subtrees, 4. preserve more mappings between unmodified nodes with labels, 5. preserve more error context (explained in Section 4.3), and 6. preserve more mappings between unmodified nodes. Second, Getafix exploits the fact that after performing one merge, the new set of edit patterns and hence also the new set of merge candidates is largely unchanged. To reduce the number of times that the same pair of potential merges is compared, the implementation uses the nearest-neighbor- chain algorithm [Benzécri 1982]. Third, because the clustering algorithm is quadratic in the number of concrete edits, we reduce the number of edits con- sidered at once. Specifically, the implementation partitions all given edits by the labels of their modified nodes in the before and after trees. As edits for which the labels differ + : BinEx h 0 : Name, h 1 : ? = + : BinEx h 1 : Name, h 0 : ? + : BinEx h 0 : Name, h 1 : ? , + : BinEx h 2 : Name, h 2 : ? As a consequence, all single-hole patterns h k : ? are iden- tical. Finally, we define the partial order of edit patterns analo- gous to Definition 4.2: 8

9. Getafix: Learning to Fix Bugs Automatically Figure 6. Dendrogram showing concrete edits merged into more abstract edit patterns. The vertical black bars correspond to levels in the hierarchy, and the edit pattern at the top of each black bar is the pattern obtained by anti-unification of the edit pattern connected by the smaller, thin black lines. get least priority in the merge order, Getafix assigns them to separate partitions and constructs a hierarchy for each partition separately. Finally, the approach generalizes the roots of these hierarchies to form a single dendrogram. 4.3 code in addition to the changed AST nodes, the learned edit patterns also include contextual code. Such contextual code helps deciding when a pattern is applicable, and it helps ap- plying the pattern to new code. Figure 7 gives an example of an edit pattern with an unbound hole that becomes bound thanks to additional code context. In edit pattern (a), h0 is unbound in the after part, leaving it unclear what to insert as h0 when applying the pattern. In contrast, edit pattern (b) binds h0 by including unmodified code in the before part. Additional Context in Edit Patterns In addition to a bare description of modified code, the edit patterns learned by Getafix also include context informa- tion. The motivation is twofold: First, context makes edit patterns more specific, and hence, they match fewer code locations, which reduces the chance to produce invalid fixes. Second, context information can bind otherwise unbound holes, which helps applying a pattern to new code. Getafix uses two kinds of context: code context and error context. Error context. The static analysis warning addressed by Getafix may also provide additional context information. For example, the expression blamed for a NullPointerException gives a hint about how and where to apply a fix pattern. Getafix optionally considers such hints in the form of an error variable. While learning edit patterns, the approach then propagates this information, eventually obtaining patterns Code context. Because Getafix starts with an entire spec- trum of concrete edits, which include different amounts of 9

10. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra void onDestroyView() { mListView.clearListeners(); mListView = null; } ↣ h0 7→ mListView h1 7→ clearListeners void onDestroyView() { if (mListView == null) return; mListView.clearListeners(); mListView = null; } Figure 8. Example fix. (a) The hole h0 is unbound. 5 Applying and Ranking Fix Patterns Given buggy source code and a set of learned fix patterns, the final step of Getafix produces a fixed version of the source code. The main challenges of this step are (i) finding which patterns are applicable to a given piece of code (Section 5.1) and (ii) deciding which of all fix candidates produced by applying these patterns to suggest to the developer (Sec- tion 5.2). (b) Code context binds h0. 5.1 Applying Edit Patterns to Buggy Code To find applicable edit patterns, Getafix matches tree patterns to the tree representation of buggy source code. Definition 5.1 (Matching). Let p ∈ Tree + and t ∈ Tree. Then p matches t if there exists a function subst : Hole −→ Tree so that replacing every hole h in p with subst(h) yields t. Intuitively, subst captures the substitutions one has to perform to concretize p into t. (c) Error context binds h0. Figure 7. Edit pattern with an unbound hole and two ways of creating a bound version by adding context. Given a specific edit pattern p and a buggy tree t, the approach creates a fix using the following steps: 1. Find a subtree t sub in t that matches the before part of p. 2. Consistently instantiate all holes in the before part and the after part of p 3. Replace the subtree t sub with the instantiated after part. As an example, consider the code on the left of Figure 8 and one of the edit patterns mentioned earlier: 2 h0.h1(); ↣ if (h0 == null) return; h0.h1(); The before part of the edit pattern matches the buggy code by substituting h0 with mListView and h1 with clearListeners . Instantiating the after part of the pattern with these substi- tutions and replacing the subtree that matches the before part with the instantiated after part results in the code on the right of Figure 8. where a specific hole represents the error variable. Figure 7 gives an example for such error context. Pattern (c) specifies that h0 is the error variable. As a result, Getafix can apply the pattern even though h0 is not mentioned in the before part. 4.4 Comparison with Prior Work on Inferring Edit Patterns An alternative to our hierarchical clustering is a greedy clus- tering algorithm, as suggested for inferring recurring edit patterns [Rolim et al. 2018]. That approach maintains a single representation of each cluster, and this representation does not contain any context information. Even if one would add context information, the approach could add only context that is present in all of the edits in the training data, which is unlikely to learn edit patterns with the same level of context as Getafix. For example, in Figure 6, the edit at the top of the hierarchy simply says to insert a check for null before some statement h0 . This is the only pattern that would be learned if we had only a single representation of this cluster. Instead, hierarchi- cal clustering enables Getafix to observe the pattern one level down, where the early return is inserted before a method call h0.h1() , which uses the variable that could be null. Such ex- tra context facilitates predicting a human-like fix. Section 6.4 compares our approach with greedy clustering. 5.2 Ranking Fix Candidates By applying edit patterns, Getafix obtains a set F cand of appli- cable fixes, which we call fix candidates. In practice, multiple patterns in a learned hierarchy of edit patterns may apply to a given piece of buggy code. In fact, one of the main ben- efits of learning multiple edit patterns per bug category is 2 To simplify the presentation, we will from now on represent edit patterns as pseudo-code instead of term notation. 10

11. Getafix: Learning to Fix Bugs Automatically • Location score. This score measures what fraction of bugs that were fixed with pattern p were fixed by apply- ing the pattern exactly z lines away from the warning location: |{instances of p at location z in H }| s location = |{instances of p in H }| to enable Getafix to select the most suitable pattern for a given situation. However, having multiple applicable pat- terns leads to the challenge of deciding which fix to suggest to a developer. To illustrate the problem, consider multiple learned fix patterns for null dereferences: p 1 : (empty) ↣ if (h0==null) { return; } where h0 is the error variable p 2 : h0.h1() ↣ h0!=null && h0.h1() p 3 : h0.h1(); ↣ if (h0==null) { return; } h0.h1(); The motivation for this score is that some fix patterns typically occur only at specific locations. For example, fixing a null dereference by inserting an early return often occurs just above the location where the static analyzer warns about the potential null dereference. We describe below how Getafix computes a distribu- tion of line distance for each learned pattern. • Specialization score. This score is higher for more spe- cialized patterns, i.e., patterns that are applicable at fewer code locations. |{all AST nodes}| s specialized = AST subtrees that match | | the before part of p Some patterns match more often than others. Patterns without unmodified nodes, such as p 1 , match often and may even match multiple times within a given buggy piece of code. For the code on the left of Figure 8, p 1 matches in un- intended places, such as after mListView.clearListeners() or even after mListView = null . In contrast, p 2 and p 3 are more specific, because they add the null check relative to existing code fragments. Choosing which of these patterns is the most appropriate is a non-trivial problem. We address the challenge of ranking fix candidates through a simple yet effective ranking algorithm that sorts fix candi- dates by their likelihood to be relevant, based on the human- written fixes that Getafix learns from. Given a bug b and a set P of fix patterns, the problem is to rank all fix candidates F cand that result from applying some p ∈ P to the code that contains b. Let a tuple (b, p, z) represent the application of pattern p to the code that contains b at source code location z. The location z is relative to the location where the static ana- lyzer reports a warning about bug b. For example, a location z = −1 means that the fix pattern is applied one line above the warning location, whereas z = 3 means the fix pattern is applied three lines below the warning. The goal of the rank- ing is that the probability P hfix(b, p, z) | match(b, p, z) is higher for fix candidates (b, p, z) that appear higher up in the ranking, where match indicates that the pattern is ap- plicable and hfix indicates that the fix is the same a human developer would implement. Obviously, precisely computing hfix is impossible in general, because it depends on a human decision. Instead, Getafix estimates the likelihood that a fix candidate will be accepted by a human based on the set H of human fixes that the approach learns from. To estimate how likely a fix candidate matches a human fix, Getafix computes three scores: The intuition is that a more specific pattern contains more context information, and that if such a pattern matches, it is more likely to fit the given bug. We estimate P hfix(b, p, z) | match(b, p, z) ∝ s prevalence ∗s location ∗s specialized , so the product of these scores is used for ranking fix candi- dates, with higher scores being ranked higher. The location score is based on the distribution of locations, relative to the warning location, where a fix pattern is typically ap- plied. Getafix knows this location for each concrete fix in the training data, and then propagates the locations while generalizing fixes into fix patterns. Specifically, we model the location as three statistical values tracked for each pattern during pattern learning. First, we track the ratio of fixes ap- plied before and fixes applied after the warning location. This ratio is useful because some patterns typically apply either before or after the warning. For concrete fixes, we initialize this ratio to 0, 1, and 0.5 if the human fix is above, below, or on the same line as the warning, respectively. Second and third, we track two geometric distributions that model the line distance above and below the warning location. For con- crete fixes, we initialize these distributions as having their expected value exactly at the distance where the concrete fix is applied. Whenever the clustering algorithm (Section 4.2) merges two edit patterns, it statistically propagates the three values from the two merged patterns to the resulting pattern. As an example for the line distance distributions, recon- sider fix pattern p 1 from above, which addresses a potential null deference by inserting an if statement that returns from the method if h0 is null. Because this pattern is typically applied before the warning location, the above/below ratio will be close to zero and the distribution for the line distance above will contain mostly small, positive integers. • Prevalence score. This score is higher for patterns that cover more concrete bug fixes in the set H of human training fixes: s prevalence = |{instances of p in H }| |H | The intuition is that a more common way of fixing a specific kind of bug is more likely to appeal to devel- opers. 11

12. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra Example 5.2 (Ranking among candidates). Assume that we are given the code from Figure 8 to fix and know the three fix patterns p1, p2 and p3 we motivated ranking with. Applying p1 results in three fix candidates p1 1 , p1 2 , p1 3 as mentioned earlier: The "early return" can be inserted above, between and below both statements. p2 and p3 only match on the buggy line, resulting in fix candidates p2 1 and p3 1 . • For p1 1 we assume the following scores: change is. Learning to rank from fixes for the same bug cat- egory allows Getafix to specialize to bug category-specific “features”. Second, our ranking is defined based on ASTs and line numbers, i.e., in a language-agnostic way, making it easy to apply the approach to another language. 6 We address the following research questions: • RQ1: How effective is Getafix at predicting human- like fixes? (Section 6.2) • RQ2: How effective is the ranking of fix candidates? (Section 6.3) • RQ3: How does Getafix compare to simpler variants of the approach? (Section 6.4) • RQ4: How does the amount of available training data affect the approach? (Section 6.5) • RQ5: How efficient is the approach? (Section 6.6) • RQ6: How effective is Getafix in an industrial software development setting? (Section 6.7) s prevalence = 0.03 s location = 0.5 s specialized = 20 This would mean that we observed p1 being used 3% of the time. Of those times, the statement was inserted right above the blamed line 50% of the time. Further- more the pattern matches one in twenty AST nodes. • For p1 2 and p1 3 only s location changes since the other scores exclusively depend on the pattern used: s prevalence = 0.03 s location = 0 s specialized = 20 Since insertion of an "early return" was never observed below the buggy line, s location is zero. • We assume p2 was observed 5% of the time, almost always being applied to exactly the buggy line and p2 matches one in forty AST nodes on average. Resulting scores for p2 1 : 6.1 Experimental Setup To address RQ1 to RQ5, we evaluate Getafix with pairs of before and after Java files that address six kinds of prob- lems. Table 1 lists the bug categories and the static analysis tool that detects instances of these categories. We use bugs detected by Infer 3 [Calcagno et al. 2015] and Error Prone [Af- tandilian et al. 2012], two state-of-the-art static analyzers for Java. To gather fixes of potential NullPointerExceptions, we gather fixes made by developers at Facebook as a reaction to warnings reported by Infer. For the other five bug categories, we gather fixes made by developers of open-source devel- opers by searching for commits mentioning those warnings. Since some commits also contain code changes unrelated to fixing the specific bug category, we semi-automatically split and filter commits into pairs of before and after files, so that each pair fixes exactly one instance of the bug category. Overall, the data set contains 1,268 pairs of files. Unless otherwise mentioned, all experiments are performed in a 10-fold cross-validation setup. We configure the pattern learning step to drop fix patterns representing less than 1% of the training set, to avoid polluting the prediction with very uncommon patterns. s prevalence = 0.05 s location = 0.95 s specialized = 40 • As p3 is more specialized than p1, we assume it was used for 2% of fixes, but also matches only one in 200 AST nodes. The specialization contains the buggy method call, so it was also applied right there 90% of the time. Resulting scores for p3 1 : s prevalence = 0.02 s location = 0.9 s specialized = 200 We conclude that Getafix would rank candidates in order p3 1 , p2 1 , p1 1 and so on due to respective overall scores 3.6, 1.9, 0.3 and 0. 5.3 Evaluation Comparison with Other Ranking Techniques The ranking step of Getafix relates to prior work on ranking fix candidates in the context of automated program repair. In principle, other ways of ranking may be plugged into Getafix, either to replace and to complement our current approach. One candidate for such an enhancement is the Prophet rank- ing [Long and Rinard 2016], which uses a set of over 3000 features of code and code changes to predict which fixes are more likely to be adequate. We did not add Prophet to our current prototype because the existing Getafix ranking works well in practice (Section 6.3) and because adapting Prophet’s over 3000 features to the Java programming lan- guage is non-trivial. Conceptually, there are two advantages of our ranking over Prophet. First, Getafix ranks candidate fixes based on human fixes for the same kind of bug, whereas Prophet relies on a generic model of how “natural” a code 6.2 Effectiveness in Finding Human-like Fixes We measure Getafix’s effectiveness in finding human-like fixes by checking whether a predicted fix exactly matches the known human fix. This comparison neglects empty lines but is otherwise precise, so even the addition of a comment or different whitespace choices are judged as a mismatch. The rationale is that we want to make sure the suggestions 3 https://code.fb.com/developer-tools/open-sourcing-facebook-infer- identify-bugs-before-you-ship/ 12

13. Getafix: Learning to Fix Bugs Automatically are acceptable to developers. This measure of success un- derapproximates the effectiveness of Getafix, because there may be other fix suggestions that a developer would have accepted than exactly the fix that the developer has applied, e.g., a semantically equivalent fix. We measure both top-1 ac- curacy, i.e., the percentage of fixes that Getafix predicts as the top-most suggestion, top-5 accuracy, i.e., the percentage of fixes that Getafix predicts as one of the first five suggestions. Table 1 shows the accuracy results for all six bug cate- gories. Depending on the bug category, Getafix suggests the correct fix as the top-most suggestion for 12% to 91% of all fixes, fixing 381 out of a total of 1,268 bugs. The top-5 accu- racy is even higher, ranging between 19% and 92%, fixing a total of 526 out of 1,268 bugs. The differences between bug categories reflect how di- verse and complex typical fixes for the different kinds of static analysis warnings are. For the bug category with high- est accuracy, BoxedPrimitiveConstructor, there exists a rela- tively small set of recurrent fix patterns, which Getafix easily learns and applies. In contrast, the problem of automatically fixing potential NullPointerExceptions is harder, as devel- opers use a wide range of fix patterns. For example, there are at least two fundamentally different ways to address a null deference warning: Either the developer recognizes that the blamed expression may indeed be null at runtime, or the developer is certain that, for reasons inaccessible to the static analyzer, the expression cannot be null. Depending on this decision, the fix will either change logic or control flow to behave reasonably in case of a null value, as, e.g., shown in Figure 1 or add a non-null assertion that convinces the static analyzer about non-nullability. Both cases provide a range of different fix patterns, which Getafix learns and applies successfully in some, but not all cases. Another factor influ- encing the accuracy of Getafix is, as for any learning-based approach, the amount of available training data, which we discuss separately in Section 6.5. 6.2.1 uses a cached instance of frequently used values. The fix in Figure 9b replaces an expression that compares two objects via reference equality with a call to equals . Getafix learns these fix patterns because fixes for these bug categories tend to be repetitive. Yet, finding an appropriate fix is non-trivial because there are different fix variants. For example, other fixes for the BoxedPrimitiveConstructor warning address dif- ferent combinations of wrapped types, e.g., floats or strings, and different types of object wrappers, e.g., Float or Boolean . Likewise, other fixes for the ReferenceEquality warning use an inequality check instead of an equality check, they might use Objects.equals() to do the equality check, or they may be part of a more complex expression. 6.2.2 Examples of Missed Fixes To better understand the limits of Getafix, Figure 10 shows two fixes that the approach does not find. Figure 10a fixes a potential null dereference by throwing an IllegalStateEx- ception in case a variable is null. While finding this addi- tional statement is well in scope Getafix, it fails to predict the custom error message that is passed as a string to the Ille- galStateException. Since the message is application-specific, learning to predict it is a hard problem. Figure 10b fixes a warning about the newInstance API, which is discouraged because it bypasses exception checking. Fixes of this prob- lem often have to adapt the exception handling around the actual fix location. Since these adaptations heavily depend on the existing exception handling, Getafix often fails to predict exactly the correct fix. This limitation could likely be mitigated by a larger set of training examples, which would enable Getafix to learn popular exception handling scenarios. Other reasons why Getafix misses fixes include fixes that rename variables, simply remove buggy code, or that address a deeper root cause of the problem. 6.3 Effectiveness of Ranking An important component of Getafix is the ranking of candi- date fixes, which enables the approach to decide on a small number of fix suggestions, without passing all candidates to any computationally expensive validation step. Figure 11 evaluates the effectiveness of the ranking, showing as a blue line (“our learning, our prediction”) how many of all fixes Getafix predicts correctly within the k highest ranked sug- gestions per bug. The percentage shown on the right of each plot, with k = ∞, indicates for how many bugs the correct fix is somewhere in the entire set of fix candidates. The results show that Getafix’s ranking effectively pushes the correct fix toward the top of all candidate fixes, covering a large fraction of all fixes it can find within the first few suggestions. Examples of Correctly Predicted Fixes To illustrate the strengths and limitations of Getafix, let us consider some examples. Figure 1 from the introduction gives three examples of fixes for NullPointerExceptions that Getafix predicts as the top-most suggestion. These examples illustrate that the approach not only predicts a wide range of different strategies for fixing bugs, but also selects the most suitable strategy for the given code. For example, the first fix in Figure 1, which adds a check to an existing conditional, makes sense only when the code surrounding the bug al- ready has a conditional. The hierarchy of fix patterns, many of which provide some code context, allows our ranking to prioritize the most suitable patterns. Figure 9 shows correctly predicted fixes for two other kinds of bug categories. The fix in Figure 9a addresses a warning about boxing a primitive value into a fresh object, which is inefficient compared to the fixed code that implicitly 6.4 Ablation Study To evaluate the effectiveness of both our hierarchical learn- ing algorithm and our ranking technique, we compare them against baseline approaches. As a baseline for learning, we 13

14. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra Table 1. Accuracy of predicting exactly the human fix for different kinds of bugs. Bug category Static analyzer NullPointerException BoxedPrimitiveConstructor ClassNewInstance DefaultCharSet OperatorPrecedence ReferenceEquality Infer Error Prone Error Prone Error Prone Error Prone Error Prone Examples Total @Override public String toString() { return new Integer(key) .toString(); } ↣ Top-1 accuracy 804 260 21 55 49 79 94 236 6 28 6 22 1,268 381 Top-5 accuracy (12%) (91%) (29%) (51%) (12%) (28%) 156 238 11 42 30 53 (19%) (92%) (52%) (76%) (61%) (67%) 526 @Override public String toString() { return Integer.valueOf(key) .toString(); } (a) Fix for warning about boxing a primitive value into a fresh object. // Ensure that the text hasn't changed. assert view.getText().toString() == metadata.getString() : "The␣text␣'" + view.getText().toString() + "'␣has␣been␣modified" // Ensure that the text hasn't changed. assert view.getText().toString().equals(metadata.getString()) ↣ : "The␣text␣'" + view.getText().toString() + "'␣has␣been␣modified" (b) Fix for warning about comparison with reference equality. Figure 9. Correctly predicted fixes for BoxedPrimitiveConstructor and ReferenceEquality warnings. // ... ↣ bar = x.foo(); // ... // ... if (x == null) throw new IllegalStateException("custom␣message"); bar = x.foo(); // ... (a) Fix for potential null deference. // ... return clazz.cast(in.newInstance() ); // ... ↣ // ... import java.lang.reflect.InvocationTargetException; // ... return clazz.cast(in.getConstructor().newInstance() ); // ... } catch (InvocationTargetException e) { } catch (NoSuchMethodException e) { // ... (b) Fix for ClassNewInstance warning. Figure 10. Fixes that Getafix could not predict correctly. use the greedy clustering algorithm used by Revisar [Rolim et al. 2018]. Revisar first partitions concrete edits by their "d-cap", which means that all concrete edits in one partition will have ASTs that are identical for all nodes up to depth d − 1. Within each partition, Revisar then considers one edit after another, attempting to cluster them into a pattern that becomes more and more general. When the addition of an edit to a cluster would result in a pattern that contains an unbound hole, this edit instead forms a new cluster. We con- servatively approximate this behavior by selecting patterns from our hierarchy of patterns that could have been found by greedy clustering if concrete edits were iterated in an ideal order. We focus our comparison on 1-cap, as it results in more general patterns than d > 1. We also drop additional context (see Section 4.4). As a baseline for prediction, we learn a hierarchy of fix patterns as described in Section 4.2, but then always pick the most common pattern of the hi- erarchy, instead of applying our ranking from Section 5.2. In case this pattern matches at multiple code locations, we apply it to the location closest to the reported issue. When evaluating the top-k accuracy of this baseline, we repeatedly pick the next most common pattern. Figure 11 compares Getafix to the two baselines. For five out of six bug categories, both hierarchical learning and our 14

15. Getafix: Learning to Fix Bugs Automatically 29.7% 20% 91.7% 15% 14.1% 91.5% 40.7% 10% 52.4% 92.1% 92% 40% 38.1% 91.2% 91% 20% 5% 52.4% 1 2 3 4 5 6 7 8 ∞ 90.5% (a) NullPointerException 80% 85.5% 1 2 3 4 5 6 7 ∞ 8 1 2 (b) BoxedPrimitiveConstructor 60% 3 4 5 6 7 8 (c) ClassNewInstance 63.8% 61.2% 60% 74.4% 60% 40% 40% ∞ 50.6% 40% 20% 38.0% 20% 1.8% 0% 1 2 3 4 5 6 7 8 ∞ (d) DefaultCharSet 14.3% 1 2 3 4 5 6 7 8 ∞ (e) OperatorPrecedence 1 2 3 4 5 6 7 8 ∞ (f) ReferenceEquality Figure 11. Fraction of human fixes (y axis) covered by top-k (x axis) highest ranked fix candidates. multi-score ranking are clearly worthwhile. For BoxedPrimi- tiveConstructor, all three approaches perform roughly the same, which we attribute to the fact that this category has relatively few ways of fixing the bug. The most drastic drop in accuracy can be observed without the hierarchical clus- tering, and hence without the valuable inner nodes of the dendrogram. The DefaultCharSet category sees particularly large drop in accuracy because most human fixes involve adding an import statement but the baseline learning and prediction do not support learning and applying multiple learned patterns together. 30 % 10 % 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 t Figure 12. Accuracy of top-1, top-5 and top-∞ predictions (fraction of predicted human NullPointerException fixes) depending on fraction t of data used for training. 6.5 Influence of Available Training Data To measure the influence of the amount of available training data, we run Getafix with subsets of the 804 human fixes for null dereference bugs. Figure 12 shows the accuracy of reproducing the human fix among top-k fix suggestions, depending on the fraction t of data used for training. We use for validation the samples not used for training, and repeat the experiment for each t such that each fix was used at least eight times for both training and validation. As expected for a learning-based approach, more training data yields higher accuracy. The reason is that Getafix learns more patterns, as well as more accurate and representative metadata for those patterns (frequency, line distance information, etc.), which improves the ranking. 6.6 Top-1 accuracy Top-5 accuracy Top-∞ accuracy 20 % correspond to one fold of this experiment. Breaking down the total time to individual predictions of fixes (last column), we find that Getafix takes between 1.7 seconds and 9.2 sec- onds on average to predict a fix. These experiments where conducted on a single machine with 114GB of RAM and Intel Skylake processors with 24 cores running at 2.4GHz. 6.7 Real-World Deployment: Auto-Fix Suggestions for Infer Bug Reports Getafix is deployed in Facebook to automatically suggest fixes for null dereference bugs reported by Infer. For this deployment, we dropped all patterns that add assertions to the code or that just remove code, because these patterns produce a plethora of candidates that are hard to rank. Before displaying auto-fix suggestions, we rerun Infer to ensure that Efficiency Table 2 lists the total time spent on training and prediction for each 10-fold experiment. A more realistic setup would 15

16. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra Table 2. Time taken by Getafix for 10-fold experiment of training and prediction. Bug category Examples Training time Prediction time Single prediction 804 260 21 55 49 79 762s 164s 60s 76s 70s 114s 3132s 442s 195s 240s 138s 563s 3.9s 1.7s 9.2s 4.4s 2.8s 7.2s NullPointerException BoxedPrimitiveConstructor ClassNewInstance DefaultCharSet OperatorPrecedence ReferenceEquality known pattern 10% auto-fix accepted 42% added an assertion that prevents the static analysis warning. As indicated earlier, we have intentionally not rolled out such patterns so far, to not encourage developers to suppress analysis warnings. Around 9% of the time, a semantically equivalent fix to that suggested by Getafix was written. For example, for a bug in an else block, the developer turned the block it into an else if , rather than creating a new if . Learning more specialized patterns from more training data, or manual interpolation of learned patterns, will help to ad- dress such cases. In another 10% of the cases, the issue was resolved by just removing the code that contains the warn- ing. Finally, the remaining human fixes were mostly custom and probably cannot be expressed as a recurring fix pattern. In addition to warnings about null dereferences, we have deployed Getafix for two other Infer warnings, Field Not Nullable and Return Not Nullable. The acceptance rate of fixes suggested by Getafix is around 60% for both warnings. Moreover, displaying an auto-fix next to a warning resulted in a 4% and 12% higher fix rate, respectively. This amounts to around 80 additionally fixed warnings (across both warning types) per month. We also observed that Return Not Nullable warnings are usually fixed around twice as fast if an auto-fix is displayed. Overall, we conclude from our experience of deploying Getafix that automatic repair can be beneficial in practice. To the best of our knowledge, this is the first industrially deploy- ment of an automated repair tool that automatically learns fix patterns. Two take-aways for what has made Getafix use- ful at Facebook are (i) to integrate auto-fixes into existing development tools, e.g., by showing them in the code-review tool, and (ii) to predict fixes fast enough to not slow down the development process, e.g., by ensuring that suggestions do not add much to the time a developer waits anyway for the static analysis reports. assertion 10% semantically equivalent fix 9% buggy code removed 10% custom fix 19% Figure 13. Reactions to about 250 null deference warnings for which Getafix suggested a fix. previously reported warnings have disappeared. Getafix per- forms this validation only for the top-ranked predicted fix, which is the only one shown to the developer (if the valida- tion passes); showing additional fix candidates would require additional runs of Infer, and besides, may be distracting for the developer. For 84% of these validation runs, the top-most predicted fix removes the previously reported Infer warning, and Getafix hence suggests the fix to the developers. Within three months of deployment, developers addressed around 250 null dereference warnings that we showed a fix suggestion for. Figure 13 categorizes the reactions of devel- opers: The developers directly accepted 106 suggested fixes (42%). These fixes include many non-trivial code changes (e.g., see Figure 1), i.e., Getafix helped save precious developer time. Remarkably, the acceptance rate of 42% is significantly higher than the top-1 accuracy during the experiments re- ported in Section 6.2. This difference is because developers are likely to accept more than one possible fix for a given bug, whereas Section 6.2 compares the predicted fix against a single known human fix. For those fix suggestions that were not immediately ac- cepted, Figure 13 categorizes them by their potential to be claimed by Getafix. In 10% of the cases, the developers even- tually wrote a fix that Getafix knows, but that it did not rank highest and hence did not suggest. Better ranking could ad- dress these cases. Another 10% of the time, the developer 7 7.1 Related Work Automated Program Repair Getafix has goals similar to those of existing automated pro- gram repair techniques [Goues et al. 2019], but fills a pre- viously unoccupied spot in the design space. In contrast to generate-and-validate approaches, such as GenProg [Goues et al. 2012], we focus on learning patterns from past fixes for specific kinds of warnings. Specifically, Getafix does not 16

17. Getafix: Learning to Fix Bugs Automatically attempt to find generic solutions from any sort of ingredient space, or by generically mutating the code. While Gene- sis [Long et al. 2017] follows a similar approach of learning transformations between before and after ASTs, its learned templates contain generators, which increase the size of the search space. SketchFix [Hua et al. 2018] also aims at validating as few fix candidates as possible, but relies on run- time information from repeated test executions. In contrast, Getafix does not use any tests. Kim et al. [2013] automatically group human-written fixes by similarity. However, inspect- ing these groups and creating fix patterns from them remains a manual step, whereas Getafix learns fix patterns fully auto- matically. History-driven program repair [Le et al. 2016] has a diffing and mining pipeline similar Getafix, but uses it to infer abstract edit steps (like "insert statement"), which then help rank fix candidates produced by a generate-and-validate repair technique. Prophet learns a generic model of how nat- ural a fix is and uses it for ranking fix candidates [Long and Rinard 2016]. Their approach could be plugged into Getafix, as an alternative or addition to our ranking algorithm. Sec- tion 5.3 discusses their ranking in more detail. CapGen [Wen et al. 2018] uses AST context to select mutation operators and fixing ingredients. Getafix implicitly uses a similar pri- oritization using its clustering and ranking strategy. Soto et al. [2016] observe that fix patterns differ across programming languages and analyze the structure of typical patterns in Java. Martinez et al. [Martinez and Monperrus 2012, 2015, 2018] also performed extensive AST analysis on Java repositories to statistically analyze code change patterns, which can guide program repair. Soto and Goues [2018] propose a probabilistic model-based approach for Java, which produces higher quality fixes more likely. As Getafix works on the level of ASTs and learns from human fixes, it can learn language-specific fix patterns automatically. NPEfix [Cornu et al. 2015] addresses NullPointerExceptions by selecting at runtime from a set of manually defined strategies, such as skipping the statement or replacing the null values with a fresh object. In contrast, Getafix addresses arbitrary bug categories and applies fixes statically. 7.2 operations, i.e., code changes that inject artificial bugs into arbitrary code, whereas Getafix aims at fixing bugs. 7.3 Machine Learning on Code Several approaches delegate parts of or the entire fix gen- eration task to a neural network. SSC combines a set of manually written generators of fix candidates with learned models that decide which candidate to apply [Devlin et al. 2017]. Sarfgen uses embeddings of code to formulate repair as a search for a similar reference solution [Wang et al. 2018]. Instead, Getafix aims at suggesting fixes for newly devel- oped software that has no reference solution. DeepFix learns an end-to-end model for predicting fixes of compilation er- rors [Gupta et al. 2017]. Combing end-to-end learning with our idea of focusing on fixes for particular bug categories may be interesting future work. Getafix relates to a stream of work on machine learning ap- plied to code [Allamanis et al. 2018]. Pradel and Sen [2018] demonstrate the power of identifier names for reasoning about buggy code. A similar technique could benefit our learning and ranking phase, e.g., by using word embeddings of identifiers and literals to guide which variables to sub- stitute holes with, or by mimicking human intuition about return values to pick in case of an early return. Work on learning to represent code edits may also provide a starting point for learning to suggest fixes [Yin et al. 2018]. 8 Conclusion Fixing incorrect code is a long-standing problem that con- sumes significant developer time. Fortunately, many fixes, in particular bugs identified by static analyzers, are instances of recurring patterns. This paper presents Getafix, the first industrially deployed automated repair tool that fully au- tomatically learns fix patterns from past, human-written commits. The approach is based on a novel hierarchical, ag- glomerative clustering algorithm, which summarizes a given set of fix commits into a hierarchy of fix patterns. Based on this hierarchy and a ranking technique that decides which fix pattern is most appropriate for a new occurrence of a bug category, Getafix proposes a ranked list of fixes to the developer. An evaluation with 1,268 real-world bug fixes and our experience of deploying Getafix within Facebook show that the approach accurately predicts human-like fixes for various bugs, reducing the time developers have to spend on fixing recurring kinds of bugs. We believe that the potential of the Getafix approach is broader than fixing only static analysis bugs. As long as the bug category and bug location is known, and a training set of fixes is available, Getafix can offer fixes learned from the training set. Case in point, Getafix suggests fixes — via SapFix 4 [Marginean et al. 2019] — for null pointer exceptions Mining of Edit Patterns Refazer [Rolim et al. 2017] uses PROSE [Polozov and Gul- wani 2015] and a DSL of program transformations to learn repetitive edits. Revisar [Rolim et al. 2018] influenced Getafix by also using anti-unification [Kutsia et al. 2014] to derive edit patterns from concrete edits, but based on greedy clus- tering and without extracting surrounding context. Another difference is that Revisar learns from arbitrary code changes, not fixes of specific bug categories, requiring manual selec- tion of interesting patterns from hundreds of candidates. See Sections 4.4 and 6.4 for a detailed comparison. Brown et al. [2017] mine recurring bug-introducing changes as the re- verse operations of commits. Their goal is to infer mutation 4 https://code.fb.com/developer-tools/finding-and-fixing-software-bugs- automatically-with-sapfix-and-sapienz/ 17

18. Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra detected by Sapienz 5 , an automated testing system for mobile apps. Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and Accurate Source Code Dif- ferencing. In Proceedings of the International Conference on Automated Software Engineering. Västeras, Sweden, 313–324. https://doi.org/10. 1145/2642937.2642982 update for oadoi on Nov 02 2018. Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE Transactions on Software Engineering 38 (2012), 54–72. Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated Program Repair. Commun. ACM (2019). To appear. Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In AAAI. Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. To- wards Practical Program Repair with On-demand Candidate Generation. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 12–23. Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Auto- matic patch generation learned from human-written patches. In Proceed- ings of the 2013 International Conference on Software Engineering. IEEE Press, 802–811. Temur Kutsia, Jordi Levy, and Mateu Villaret. 2014. Anti-unification for Unranked Terms and Hedges. Journal of Automated Reasoning 52, 2 (01 Feb 2014), 155–190. https://doi.org/10.1007/s10817-013-9285-6 Xuan-Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair. 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER) 1 (2016), 213–224. Fan Long, Peter Amidon, and Martin Rinard. 2017. Automatic inference of code transforms for patch generation. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 727–739. Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’16). ACM, New York, NY, USA, 298–312. https://doi.org/10.1145/2837614.2837617 Alexandru Marginean, Johannes Bader, Satish Chandra, Mark Harman, Yue Jia, Ke Mao, Alexander Mols, and Andrew Scott. 2019. SapFix: Automated End-to-End Repair at Scale (ICSE-SEIP ’19). Matias Martinez and Martin Monperrus. 2012. Mining repair actions for guiding automated program fixing. Ph.D. Dissertation. Inria. Matias Martinez and Martin Monperrus. 2015. Mining Software Repair Models for Reasoning on the Search Space of Automated Program Fixing. Empirical Softw. Engg. 20, 1 (Feb. 2015), 176–205. https://doi.org/10.1007/ s10664-013-9282-8 Matias Martinez and Martin Monperrus. 2018. Coming: a Tool for Mining Change Pattern Instances from Git Commits. arXiv:arXiv:1810.08532 Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. In Proceedings of the 2015 ACM SIG- PLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). ACM, New York, NY, USA, 107–126. https://doi.org/10.1145/2814270.2814310 Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Ap- proach to Name-based Bug Detection. CoRR abs/1805.11683 (2018). arXiv:1805.11683 http://arxiv.org/abs/1805.11683 Reudismam Rolim, Gustavo Soares, Loris D’Antoni, Oleksandr Polozov, Sumit Gulwani, Rohit Gheyi, Ryo Suzuki, and Björn Hartmann. 2017. Learning Syntactic Program Transformations from Examples. In Proceed- ings of the 39th International Conference on Software Engineering (ICSE ’17). IEEE Press, Piscataway, NJ, USA, 404–415. https://doi.org/10.1109/ ICSE.2017.44 Reudismam Rolim, Gustavo Soares, Rohit Gheyi, and Loris D’Antoni. 2018. Learning Quick Fixes from Code Repositories. CoRR abs/1803.03806 (2018). arXiv:1803.03806 http://arxiv.org/abs/1803.03806 M. Soto and C. Le Goues. 2018. Using a probabilistic model to predict bug fixes. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), Vol. 00. 221–231. https://doi.org/ Acknowledgments We thank fellow Facebook colleagues for their contributions and help (chronologically): Eric Lippert implemented a cus- tom version of the GumTree tree differencing algorithm and investigated an edit script graph approach to learning code change patterns. Jeremy Dubreil from the Infer team worked with us to integrate Infer fix suggestions into the existing Infer bug reporting workflow. Alexandru Marginean from the Sapienz team integrated Getafix into the SapFix pipeline as one of its strategies to create fix candidates. Austin Luo applied Getafix to Hack (programming language), work- ing on learning lint patterns from past change suggestions made during code review. Waruna Yapa applied Getafix to Objective-C, working on learning fixes for further classes of Infer warnings (for example "Bad Pointer Comparison"). Vijayaraghavan Murali created and trained a classifier that provides an alternative approach for ranking between fix patterns. References Edward Aftandilian, Raluca Sauciuc, Siddharth Priya, and Sundaresan Krish- nan. 2012. Building Useful Program Analysis Tools Using an Extensible Java Compiler. In 12th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2012, Riva del Garda, Italy, Sep- tember 23-24, 2012. 14–23. Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 81. Jean-Paul Benzécri. 1982. Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Cahiers de l’analyse des données 7, 2 (1982), 209–218. http://www.numdam.org/ item/CAD_1982__7_2_209_0 David Bingham Brown, Michael Vaughn, Ben Liblit, and Thomas W. Reps. 2017. The care and feeding of wild-caught mutants. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017. 511–522. Cristiano Calcagno, Dino Distefano, Jérémy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Purbrick, and Dulma Rodriguez. 2015. Moving fast with software verifi- cation. In NASA Formal Methods Symposium. Springer, 3–11. Maria Christakis and Christian Bird. 2016. What developers want and need from program analysis: an empirical study. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3-7, 2016. 332–343. https://doi.org/10. 1145/2970276.2970347 Benoit Cornu, Thomas Durieux, Lionel Seinturier, and Martin Monperrus. 2015. Npefix: Automatic runtime repair of null pointer exceptions in java. arXiv preprint arXiv:1512.07423 (2015). Jacob Devlin, Jonathan Uesato, Rishabh Singh, and Pushmeet Kohli. 2017. Semantic Code Repair using Neuro-Symbolic Transformation Networks. CoRR abs/1710.11054 (2017). arXiv:1710.11054 http://arxiv.org/abs/1710. 11054 5 https://code.fb.com/developer-tools/sapienz-intelligent-automated- software-testing-at-scale/ 18

19. Getafix: Learning to Fix Bugs Automatically 10.1109/SANER.2018.8330211 Mauricio Soto, Ferdian Thung, Chu-Pan Wong, Claire Le Goues, and David Lo. 2016. A Deeper Look into Bug Fixes: Patterns, Replacements, Dele- tions, and Additions. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR ’16). ACM, New York, NY, USA, 512–515. https://doi.org/10.1145/2901739.2903495 Ke Wang, Rishabh Singh, and Zhendong Su. 2018. Search, align, and repair: data-driven feedback generation for introductory programming exercises. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018. 481–495. Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-Aware Patch Generation for Better Automated Program Repair. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 1–11. Pengcheng Yin, Graham Neubig, Marc Brockschmidt Miltiadis Allama- nis and, and Alexander L. Gaunt. 2018. Learning to Represent Edits. CoRR 1810.13337 (2018). 19