Precfix: Large-Scale Patch Recommendation by Mining Defect-Patch Pairs

如果无法正常显示，请先停止浏览器的去广告插件。

1. Precfix: Large-Scale Patch Recommendation by Mining Defect-Patch Pairs Xindong Zhang Chenguang Zhu ∗ Yi Li zxd139922@alibaba-inc.com Alibaba Group cgzhu@utexas.edu University of Texas at Austin yi_li@ntu.edu.sg Nanyang Technological University Jianmei Guo Lihua Liu Haobo Gu jianmei.gjm@alibaba-inc.com Alibaba Group lihua.llh@alibaba-inc.com Alibaba Group haobo.haobogu@alibaba-inc.com Alibaba Group ABSTRACT Patch recommendation is the process of identifying errors in soft- ware systems and suggesting suitable fixes for them. Patch recom- mendation can significantly improve developer productivity by re- ducing both the debugging and repairing time. Existing techniques usually rely on complete test suites and detailed debugging reports, which are often absent in practical industrial settings. In this paper, we propose Precfix, a pragmatic approach targeting large-scale industrial codebase and making recommendations based on previ- ously observed debugging activities. Precfix collects defect-patch pairs from development histories, performs clustering, and extracts generic reusable patching patterns as recommendations. We con- ducted experimental study on an industrial codebase with 10K projects involving diverse defect patterns. We managed to extract 3K templates of defect-patch pairs, which have been successfully applied to the entire codebase. Our approach is able to make rec- ommendations within milliseconds and achieves a false positive rate of 22% confirmed by manual review. The majority (10/12) of the interviewed developers appreciated Precfix, which has been rolled out to Alibaba to support various critical businesses. KEYWORDS Defect detection, patch generation, patch recommendation. ACM Reference Format: Xindong Zhang, Chenguang Zhu, Yi Li, Jianmei Guo, Lihua Liu, and Haobo Gu. 2020. Precfix: Large-Scale Patch Recommendation by Mining Defect- Patch Pairs. In Software Engineering in Practice (ICSE-SEIP ’20), May 23– 29, 2020, Seoul, Republic of Korea. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3377813.3381356 1 INTRODUCTION Patch recommendation is the process of identifying errors in soft- ware systems and suggesting suitable fixes. Recent studies show that on average 49.9% of software developers’ time has been spent ∗ Corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7123-0/20/05. . . $15.00 https://doi.org/10.1145/3377813.3381356 in debugging and about half of the development costs are associated with debugging and patching [5, 11, 49]. Automated patch recom- mendation can significantly reduce developers’ debugging efforts and the overall development costs, improving software quality and system reliability. Recommending patches automatically is a challenging task, es- pecially for large-scale industrial codebase. Many state-of-the-art techniques from the literature make assumptions on the existence of auxiliary development artifacts such as complete test suites and detailed issue tracking and debugging reports, which may not be readily available in the day-to-day development environment. To better understand the specific challenges in applying existing tech- niques in the development environment of Alibaba, we investigated the development practices and the entire codebase at Alibaba, ended up extracting a sample benchmark set which includes over 10,000 software projects, spanning over 15 million commits and 30 million files. We have made the following observations through the study. First, the project codebase often has insufficient documentation and manual labeling of defects and patches is hardly possible, which makes accurate patch mining and generation difficult. For example, recent studies proposed approaches based on clone detection [36], patch mining [17], information retrieval [59], and machine learn- ing [6, 26] for fault localization and repair, which naturally require a large amount of labeled data. On the other hand, existing methods for automatic defect and patch identification suffer from inaccu- racy. For instance, the SZZ algorithm [48] only achieves less than 25% accuracy on our benchmark set, which is mainly due to the inaccurate/missing bug reports and log messages. Second, the patch recommendation process has to be highly re- sponsive in suggesting patches, in order to be useful in the everyday development routine. However, many existing fault localization and patch generation techniques require dynamic execution of test suites. For example, the spectrum-based [1, 18] and mutation- based [35, 40] fault localization techniques both assume strong test cases and utilize test execution results to identify defects. The generate-and-validate approaches [7, 22, 32] for patch generation search for candidate patches and validate their correctness using test suites. The problem with these techniques is that the test suites in our benchmark set may not be complete and the test execution often takes significant amount of time, making responsive online patch recommendation impossible. Finally, many fault localization and patch generation techniques focus on specific domains such as concurrency [30, 31], HTML con- tent generation [46], and memory leaks [10]. Yet, our benchmark

2. ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea set covers a variety of business scenarios from diverse application domains, including e-commerce, logistics, finance, cloud comput- ing, etc. These domain-specific techniques may work well in their targeted cases, but they are often not generalizable to the diverse applications from the company codebase. The above mentioned characteristics of the codebase and de- velopment environment make the adoption of existing techniques unsuitable. In this paper, we propose a pragmatic patch recommen- dation approach Precfix, with customized improvements in terms of both the precision and efficiency when applied on large-scale industrial codebase. First, Precfix does not rely on labeled defects or patches, which are difficult to obtain in practice. Instead, we automatically mine a large number of defect-patch pairs from his- torical changes. To improve the accuracy of the mining results, we introduce optimizations which take into account the characteristics of typical developer behaviors in committing bug fixing changes. We also allow developers in the loop to provide feedback on the quality of the recommended patches, and the feedback is used to improve the precision of future recommendations. Second, since the defects and patches are mined from the entire company codebase and we use generic features when processing and clustering them, Precfix is generally applicable to all company projects written in different languages, handling different business logic, and deployed on different platforms. Finally, Precfix consists of an offline patch discovery phase and an online patch recommendation phase. The online phase is designed to be extremely responsive and can finish recommending patches within milliseconds in practice. We found that being scalable and efficient is extremely important for patch recommendation tools to be integrated into day-to-day interactive and repetitive development tasks, such as code reviewing. Precfix has been implemented and deployed as an internal web service in Alibaba. It is also integrated as a part of the code review process and provides patch recommendations whenever developers commit new changes to the codebase. We evaluated the effective- ness and efficiency of Precfix, and demonstrate its usefulness on a large-scale industrial benchmark with over 10,000 projects, spanning over more than 15 million commits and 30 million files. Contributions. We make the following contributions in this paper. • We propose Precfix— a semi-automated patch recommendation tool for large scale industrial codebase. • Precfix implements customized optimizations in defect-patch pair mining and clustering, which help improve the accuracy of patch recommendation over existing techniques. • Precfix managed to extract 3K defect templates from 10K projects. Our approach is able to make recommendations within millisec- onds and achieves a false positive rate of 22% confirmed by man- ual review. • We conducted a small-scale user study and the majority (10/12) of the interviewed developers appreciated Precfix, which has been rolled out to Alibaba to support various critical businesses. 2 RELATED WORK The techniques presented in this paper intersect with different areas of research. In this section, we compare Precfix with fault local- ization, automated patch generation, and patch recommendation. Xindong Zhang, Chenguang Zhu, Yi Li, Jianmei Guo, Lihua Liu, and Haobo Gu 2.1 Fault Localization Fault localization [41, 51, 54, 60] is the activity of identifying the locations of faults in a program. Many different types of fault lo- calization techniques have been proposed. Spectrum-based fault localization [1, 18] utilizes test coverage information to pinpoint faulty program or statistical techniques. For example, Tarantula [18] uses a homonym ranking metric to calculate the suspiciousness val- ues of statements, which are calculated according to the frequency of the statements in passing and failing test cases. Mutation-based fault localization [35, 40] mutates a program and runs its test cases, using the test results to locate faults. For example, Metallaxis [40] generates a set of mutants for each statement, assigns each mutant a suspiciousness score, and aggregates the scores to yield the sus- piciousness of the statement. Other faults localization techniques identify locations of faults in some alternative ways, including dynamic program dependency analysis [2, 45], stack trace analy- sis [53, 55], conditional expressions mutation analysis [58], infor- mation retrieval [59], and version history analysis [23, 44]. 2.2 Automated Patch Generation Automated patch generation [11, 34, 50, 57] aims to automatically repair software systems by producing a fix that can be validated be- fore it is fully accepted into the system. Automated patch generation techniques can be divided into two main categories: generate-and- validate approaches and semantics-driven approaches. Generate- and-validate approaches [7, 22, 32, 43, 52] iteratively execute two activities: patch generation, which produces candidate patch of the bug by making atomic changes or applying bug fix templates; patch validation, which checks the correctness of the generated solutions by running test cases. For example, GenProg [52] uses ge- netic programming to guide the generation and validation activities. At every iteration, it randomly determines the location to apply an atomic change, according to a probability distribution that matches the suspiciousness of the statements computed with fault localiza- tion algorithms. Every candidate solution is validated running the available test suite. It defines a fitness function that measures the fitness of each program variant based on the number of passing and failing test cases. Semantics-driven approaches [21, 30, 38] formally encode the problem, either as a formula whose solutions correspond to the possible fixes, or as an analytical procedure whose outcome is a fix. A solution found by such approaches is correct by construc- tion, thus no validation is needed. For example, SemFix [38] uses fault localization to identify the statement that should be changed, then tries to synthesize a fix by modifying a branch predicate or changing the right hand side of an assignment. 2.3 Patch Recommendation Patch recommendation [3, 16, 20, 24, 36] suggests a few candidate changes which may repair a given fault. In some cases, the recom- mended patches are perfect fixes, while in other cases some efforts are required from the developers to produce the final fix. Although these techniques do not guarantee a working repair, their results are still useful in assisting developers in deriving the patch. A num- ber of patch recommendation techniques have been proposed so far. For example, Getafix [3] from Facebook learns recurring fix patterns for static analysis warnings and suggests fixes for future

3. Precfix: Large-Scale Patch Recommendation by Mining Defect-Patch Pairs occurrences of the same bug category. It firstly splits a given set of example fixes into AST-level edits, then it learns recurring fix patterns from these edits based on a clustering technique which produces a hierarchy of fix patterns. Finally, given a bug under fix, it finds suitable fix patterns, ranks candidate fixes, and suggests the top-most fixes to developers. As another example, CLEVER [36] from Ubisoft aims to intercept risky commits before they reach the central repository. It first builds a metric-based model to assess the risky level of incoming commits, then it uses clone detection to compare code blocks in risky commits with some known historical fault-introducing commits. All these aforementioned techniques depend on existing patches or already-known bug patterns. In contrast, we do not assume enough debugging reports, and we extract templates of defect- patch pairs through data mining. Open-source dataset such as De- fect4J [19] contains labeled defects and the corresponding patches, which have been examined and analyzed by many researchers. Yet, recent studies [56] indicate that many state-of-the-art patch generation techniques has the problem of over-fitting to specific benchmark set. Therefore, many of the existing techniques can- not be directly applied on the industrial codebase, which is quite different from the open-source dataset in many ways. 3 PRELIMINARY STUDY To better understand the codebase at Alibaba and the challenges in applying existing techniques in the industrial development envi- ronment, we conducted a preliminary study of the usage scenarios of patch recommendation techniques within the company and em- pirically analyzed the characteristics of the company codebase. 3.1 Challenges for Existing Techniques Through manual inspection, interviews with developers, and em- pirical studies, we identified three key challenges for existing fault localization and automated patch generation techniques to be suc- cessfully applied on our benchmark. Insufficient Labeled Data. A lot of fault localization and auto- mated patch generation techniques require labeled defect and patch samples to be able to extract patterns of typical bug fixes. Yet, this is a substantial obstacle in our case, since there exists very few labeled defects or patches in the company codebase. Moreover, due to the widespread legacy code in the codebase, a large number of software projects only have partial debugging reports and very limited test cases. The commit messages may be succinct and do not follow any standard template either. Therefore, it is challenging to label defects and the associated fixes manually, given the size and complexity of the codebase. The business logic and bug fix patterns of the internal company codebase are quite different from that of open source projects [42, 47]. Thus, we decide not to directly use the labeled data from open-source projects. High Responsive Standard. The application scenario of patch recommendation in Alibaba is highly interactive. Patch recommen- dation needs to be run whenever new commits are submitted by developers for code review. The recommended patches are then checked by developers, who may decide to incorporate the sugges- tions into the commits. On average, a developer submits three to four commits per week, and both the submitter and reviewer expect ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea F/C = 1 41.3% 19.6% F/C = 2 22.7% F/C ≥ 5 9.8% 6.6% F/C = 3 F/C = 4 Figure 1: Distribution of the number of changed files per keyword-matched commit (F/C). prompt patch recommendations to avoid delays during the review process. Therefore, the responding time for patch recommendation is supposed to be reasonably low in order to be integrated into the development routine. This renders some automated patch gen- eration approaches inappropriate, since they need to repeatedly compile and execute tests for each identified defect. Generalizability Requirement. Our benchmark set consists of more than 10K projects supporting more than 100 software appli- cations. These applications cover a variety of domains including e-commerce, finance, cloud computing, and artificial intelligence, many of which are used by millions of users on a daily basis. The patch recommendation techniques should be generalizable to cover all different projects and defect types. 3.2 Challenges in Defect-Patch Identification There exists techniques [27–29, 48] which automatically identify changes of certain kinds, e.g., bugs and fixes, from commit histories. These techniques, such as the SZZ algorithm [48], can be useful for labeling defect-patch pairs in the absence of high-quality labeled data. The high-level idea is to first locate bug-fixing commits based on keywords in commit messages and debugging reports. For ex- ample, a commit is considered as bug-fixing commit if it appears in a bug report or its commit message contains keywords such as “bug” and “fix”. For each identified bug-fixing commit, one can trace each line of changed code back in history to locate the bug-inducing commits using version control information such as git-blame [12]. Various optimizations have been introduced in the SZZ algo- rithm to reduce the false positives. For example, the timestamp of a candidate bug-inducing commit is compared with the time when the bug is reported to rule out unreasonable results. Yet, we found that even with these optimizations, the SZZ algorithm still does not perform well on our benchmark (25% true positive rate). To figure out the reason why the SZZ algorithm does not perform well on our codebase, we quantitatively analyzed the dataset and identified two stages in the algorithm where imprecision can be introduced: (1) the identification of bug-fixing commits, and (2) the location of bug-inducing commits via back-tracing in history. Imprecision in Locating Bug-Fixing Commits. The first reason causing imprecision is that the keyword matching approach is not always reliable. For example, through manual inspection, we discovered that a commit with the commit message containing the keyword “fix”, does not change any program logic and only modifies the label of an Android UI button. Even if a commit does fix

4. ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Xindong Zhang, Chenguang Zhu, Yi Li, Jianmei Guo, Lihua Liu, and Haobo Gu Oﬄine Patch Discovery Pattern Clusters Patch Template Database Defect-Patch Pairs ( Extraction , … ) Clustering Generalization Integration Feedback Company Codebase Figure 2: Distributions of the number of bug- inducing commits (BC). a bug, it can also make irrelevant changes to other parts of the code. Fig. 1 shows the distribution of the number of changed files in each commit matched by at least one keyword. The keywords we used include “fix”, “bug”, “repair”, “wrong”, “fail”, “problem”, and their corresponding Chinese translations. About 60% of the commits change more than one file, among which 40% touch over two files. Many such commits are multi-purpose, causing imprecision in locating bug-fixing commits. Imprecision in Locating Bug-Inducing Commits. Additional imprecision can be introduced when there are multiple bug-inducing commits in the history. When back-tracing in history, it is possible to end up with more than one commits, each partially contributing to the defect. We randomly select four projects from the dataset. Fig. 2 plots the distribution of the number of bug-inducing commits identified by the SZZ algorithm for each bug-fixing commit. For example, for project ?1, among all the defects, 78% are induced by one commit, 14% are induced by two commits, 4% are induced by three commits, and 3% are induced by more than three commits. We found that the commit-level back-tracing often introduces too much irrelevant changes for the similar reason discussed before. 4 OUR APPROACH Fig. 3 overviews the workflow of Precfix. Precfix consists of an offline patch discovery component and an online patch recommen- dation component. The patch discovery component first extracts potential defect-patch pairs from commits in the version controlled history, clusters defect-patch pairs based on their similarity, and finally extracts generic patch templates to be stored in a database. The patch recommendation component recommends patch candi- dates to developers and collects their feedback. It removes patches rejected by developers, and includes manually crafted patch tem- plates submitted by developers to improve the patch database. 4.1 Patch Discovery The offline patch discovery component performs three steps to generate patch templates: extracting defect-patch pairs, clustering defect-patch pairs, and collecting generic patch templates. 4.1.1 Extracting Defect-Patch Pairs. The first step is to extract a large number of defect-patch pairs from the codebase. We make Code Reviewers Patch Candidates Online Patch Recommendation Figure 3: Overview of the Precfix workflow. two improvements to the SZZ algorithm, to adapt to the needs of the industrial codebase. Constraining the Number of Changed Files. The first adjust- ment is to set a threshold on the number of files modified in a bug-fixing commit, filtering out any commit that exceed the thresh- old. In this way, we could reduce the false positives that are caused by multi-purpose commits (C.f. Sect. 3). We heuristically chose five as the threshold, which help reduce false positives without discard- ing too many candidate commits (22.7% as indicated in Fig. 1). Identifying Bug-Inducing Code Snippets. The second improve- ment aims to improve the precision of identifying bug-inducing changes. As studied in Sect. 3, commit-level back-tracing in his- tory tends to be imprecise, especially when there are many multi- purpose commits contributing to the defect. Therefore, instead of tracing commits back in time and identify- ing bug-inducing commits, we directly identify bug-inducing code snippets. In fact, the differences before and after the introduction of a bug-fixing commit already contain information about both the de- fects and the patches. Now we describe the improved method-level defect-patch pair extraction in more details. Method-Level Defect-Patch Pair Extraction. To further reduce false positives, we constrain the defects and patches as the removed and inserted lines in a bug-fixing commit, respectively. We confine the scope of defects and patches to be within a single method. Of course, these constraints may rule out some genuine defect-patch pairs, e.g., patches only removing old lines or inserting new lines. Our current bug-fix model balances between the completeness and accuracy, attempting to be simple but applicable to the most com- mon cases. We randomly sampled 90 bug fixes from the benchmark set and confirmed that 68 (76%) of them contain both removed and inserted lines, which fall into our simplified bug model. We found in practice that the compromises made in the recall (24%) help significantly improve the precision. Given a bug-fixing commit and the source code of the project, the workflow of extracting defect-patch pairs is the following. (1) Check out the last-updated snapshot before the bug-fixing com- mit and record inserted and removed lines in the commit for later parsing. (2) Extract inserted lines in the commit, which are patch snippets.

5. Precfix: Large-Scale Patch Recommendation by Mining Defect-Patch Pairs (3) Extract removed lines in the commit, which are defect snippets. (4) Associate each defect snippet with the corresponding patch snippet, forming defect-patch pairs. (5) Filter out the lines of defect snippets that are not in the same method scope as any patch lines, making defect-patch pairs more cohesive. (6) Filter out assertion lines, comments, and logging lines from the defect-patch pairs as they are generally irrelevant to bug fixes. At this point, we have obtained a set of defect-patch pairs, on which we perform the remaining steps of the patch discovery. 4.1.2 Clustering Defect-Patch Pairs. To obtain common defects patterns in the codebase, we group all the extracted defect-patch pairs into a set of clusters. We use density-based spatial clustering of applications with noise (DBSCAN) [9] as our clustering algorithm. DBSCAN is a well- established clustering method without the need to determine the number of clusters (?) in advance. Given a set of points in some vector space, DBSCAN groups points that are closely packed to- gether (nearby neighbors), marking as outliers those points that lie in low-density regions. We choose DBSCAN instead of other clus- tering algorithms, such as KNN [8] and ?-means, because in our case, the number ? could not be known in advance. A well known limitation of the vanilla DBSCAN algorithm is that it is computa- tionally expensive when coping with large-scale dataset [37], since it considers all possible comparisons between every data point. In our application scenario, most of the comparisons are unnecessary, as many code snippets are very loosely related. Therefore, we make three customized improvements to DBSCAN to mitigate this issue. Utilizing SimHash-KDTree Reducers. The first customization is to use SimHash-KDTree reducers to avoid the comparison of some irrelevant data points. SimHash [33] is an algorithm that generates a low-dimensional vector signature for any high-dimensional vector. The generated low-dimensional vector has the same expressiveness as the original high-dimensional vector, but is shorter in length, thus enables faster comparison. During preprocessing, we use the SimHash algorithm to map the contents of code snippets to 16-bit hashcode sequences. Then we run the KDTree algorithm [4] to group all the sequences that have a Hamming distance less than 4 into the same reducer. During the clustering, we only compute similarity of code snippets in the same reducer. Exploiting API Sequence Information. The second improve- ment is to exploit the information from API call sequences to avoid irrelevant comparisons. The intuition behind this is that if two code snippets contain two completely different API call sequences, then they are obviously not belonging to the same patch pattern, thus should not be compared during the clustering. Therefore, we parse the ASTs of defect-patch pairs and extract API call sequences for each code snippet in advance. During clustering, even if two code snippets are in the same reducer, as long as their API call sequences do not match, we skip the comparison of them. We name this filtering process as APISeq. Normalizing Code Snippets. We also improve the clustering ac- curacy of DBSCAN by normalizing code snippets. Some common coding patterns have multiple equivalent ways of expression. These ways, although semantically equivalent, are syntactically different, ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Algorithm 1: Similarity computation of defect-patch pairs. : ⟨? 1 , ? 1 ⟩, ⟨? 2 , ? 2 ⟩– two defect-patch pairs; ? – the set of reducers obtained from SimHash-KDTree; ? – the weight of Jaccard similarity coefficient; output :??? – the similarity score of the two input defect-patch pairs; if ? ∈ ? s.t. ⟨? 1 , ? 1 ⟩ ∈ ? and ⟨? 2 , ? 2 ⟩ ∈ ? then return 0; input 1 2 3 4 5 6 7 8 9 10 ??? 1 ← ExtractAPISeq( ⟨? 1 , ? 1 ⟩); ??? 2 ← ExtractAPISeq( ⟨? 2 , ? 2 ⟩); if ??? 1 ∩ ??? 2 = ∅ then return 0; ??? ? ← ComputeSimilarityScore(? 1 , ? 2 , ?); ??? ? ← ComputeSimilarityScore(? 1 , ? 2 , ?); ??? ← (??? ? + ??? ? ) ÷ 2; return ???; 11 12 13 Function ComputeSimilarityScore(? 1 , ? 2 , ?): return ? × ComputeJac(? 1 , ? 2 ) + (1 − ?) × ComputeLev(? 1 , ? 2 ); which decreases the accuracy of clustering. For instance, a string value could be expressed as either a single string literal or the concatenation of several string literals. When we compare code snippets, we want to consider these equivalent expressions as the same. Therefore, we normalize the code snippets to convert these semantically equivalent snippets into syntactically same format. Our normalization rules are as follows: (1) merge consecutive calls of the same API into one API call sequence, concatenated by the dot (“.”) operator; (2) change “.append(string)” to the concatenation of strings; and (3) merge the concatenation of strings into a single string. DBSCAN computes the distance between two defect-patch pairs based on their similarity. The computation of similarity is described in Algorithm 1. Given two defect-patch pairs, the algorithm first checks whether they are in the same reducer. The reducers are ob- tained using the customized hashing algorithm (SimHash + KDTree). If they are not in the same reducer, then the similarity computation is skipped (Line 2). Otherwise, the algorithm invokes APISeq, addi- tional filtering based on the API sequence used, on each defect-patch pair to extract API sequences (Lines 3-4). If there is no intersec- tion between the two API sequences, the algorithm also skips the similarity computation (Line 6). The remaining defect-patch pairs after the filtering proceed to the similarity computation. The similarity function we used in Algorithm 1 is a weighted sum of two similarity metrics: the Jac- card similarity coefficient [39] and the Levenshtein distance [25] (Lines 7-9). We choose these two similarity metrics because the Jac- card similarity coefficient is able to capture the token overlapping rate while the Levenshtein distance has the ability to capture the ordering information of the token list. 4.1.3 Collecting Generic Patch Templates. After the clustering step finishes, for each pattern cluster, Precfix tries to extract a generic patch template, which summarizes the common pattern of defect- patch pairs in that cluster. Each template encodes a pattern of defect and its corresponding patch.

6. ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Xindong Zhang, Chenguang Zhu, Yi Li, Jianmei Guo, Lihua Liu, and Haobo Gu Patch1 sb.append(token); if (sb.length() > 1) { sb.deleteCharAt(sb.length() - 1; } Patch2 if (builder.length() > 1) { builder.deleteCharAt(builder.length() - 1; } sb.append(token); if (sb.length() > 1) { sb.deleteCharAt(sb.length() - 1; } (a) Inline view of recommended patch. if (builder.length() > 1) { builder.deleteCharAt(builder.length() - 1; } 1 2 3 if (@P ara{sb, builder}.length() > 1) { @P ara{sb, builder}.deleteCharAt(@P ara{sb, builder}.length()-1); } Generic Patch Template Figure 4: The process of extracting templates. The goal of template generalization is to make the patch gener- ally applicable in different contexts and understandable to develop- ers. Although all the defect-patch pairs from the same cluster are similar, they differ from each other in some context-specific way, e.g., variable names. We abstract away all context-specific contents and only preserve the generic templates. The first step of template generalization is to convert code lines into a list of tokens, each token is either a symbol or a parameter. Symbols are context-independent, they together form the common patterns among all the defect-patch pairs in a cluster; parameters are context-specific that represent the abstraction of variable names. When a template is applied at a specific context, its parameters are supposed to be replaced by the actual variable names. The second step is to extract templates from the token lists. We use the recursive longest common substring (RLCS) algorithm [15] to perform matching between token lists. RLCS recursively calls longest common substring matching until all the tokens are matched. The highlighted parts in Fig. 4 are the matched tokens after the RLCS process (Step 1). We adjust the vanilla RLCS so that different parameter names are not matched while the other symbols are. Finally, we collect all the parameter names that are matched in the same cluster to a list and store them in our self-defined parame- ter format. Step 2 in Fig. 4 shows that different parameter names are recognized and collected, and stored in the standard format: @????{parameter name list}. Finally, for each defect-patch pair, we change the parameter names in the patch snippet to match the parameter names in the corresponding defect snippet. After this step, Precfix constructs a patch template database, which contains all the extracted templates. 4.2 Patch Recommendation The online patch recommendation component is triggered when- ever developers commit new code changes. During code review, Precfix matches each of developers’ newly committed code snip- pet with the templates database. If a code snippet matches with a template, Precfix replaces the generic parameter placeholders in the template with the actual parameter names used in the specific (b) Detailed view of defect-patch pairs and the template suggestion form. Figure 5: Precfix user interface for patch recommendation. contexts. If multiple templates are matched, they are ranked based on the frequency of the corresponding defect-patch pairs before clustering. By default, Precfix recommends the most popular patch candidate. The patch candidate is then sent to the code submitters and reviewers for feedback. We collect feedback from developers to validate the patch candidates and improve the template database of Precfix. More specifically, the defects and the corresponding rec- ommended patches are presented to the developers. As developers inspect each recommended patch, we also collect their reactions on the patch, i.e., accept or reject. If patches from a certain template are frequently rejected, we will manually inspect the template and decide whether to remove it. During the patch validation process, in addition to feedback on the recommended patches, Precfix also accepts patch templates created by developers and is able to integrate them into Precfix’s template database. We encourage developers to make custom ad- justments to the existing patches or build their own patch templates based on their experience. We believe this contribution mechanism can enrich the template database in the long run. Fig. 5 shows the Precfix user interface for patch recommen- dation. Developers interact with it during code reviews. Fig. 5(a) shows the inline view of an identified defect and the correspond- ing patch recommended. Fig. 5(b) shows the detailed view of the defect-patch pairs (left) and the developer template suggestion form (right). The left side shows the list of defect-patch pairs from which the recommended template was extracted. The details of each defect-patch pair can be expanded, viewed, and voted for or against. Specifically, the two records shown in Fig. 5(b) are from the files “.../Rec***dler.java” and “.../Trip***Impl.java”, respectively. The right side is a form allowing developers to devise their own patches and submit to the database.

7. Precfix: Large-Scale Patch Recommendation by Mining Defect-Patch Pairs 5 IMPLEMENTATION AND EVALUATION In this section, we describe the implementation details of Precfix and evaluate its effectiveness and efficiency on our benchmark set. Precfix is implemented on top of the cloud-based data process- ing platform, MaxCompute [14], developed by Alibaba. The commit history data is preprocessed and stored in data tables on the Alibaba Cloud [13] storage first. The defect-patch pair extraction is imple- mented as a set of SQL scripts and user-defined functions (about 900 LOC). The clustering of defect-patch pairs is highly parallelized (taking about 1 KLOC Java code) and handled by the MapReduce engine of MaxCompute. The online patch recommendation com- ponent is integrated with the internal code review process and is triggered whenever a new commit is submitted. The goal of our empirical evaluation is to answer the following research questions. RQ1: How effective is Precfix in locating de- fects and discovering patches? RQ2: How efficient is Precfix in recommending patches? RQ3: How much do our improvements of DBSCAN increase the efficiency of clustering? RQ4: What kind of patches does Precfix recommend? RQ5: What are the users’ opinions on the usability of Precfix? We run the offline patch discovery component of Precfix on the randomly extracted sample dataset described in Sec 3. The sample dataset includes 10K projects, 15M commits, and 30M files. On this dataset, Precfix extracted 3K bug fix templates, forming the template database. During the clustering stage, 4,098 MB memory was allocated for each reducer. 5.1 ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Effectiveness We use two metrics for measuring the effectiveness of Precfix: False Positive Rate (FPR) and Extractability Score (ES). FPR measures the quality of patches, which is calculated as: ? ?? = (? ????? − ? ??????? )/? ????? (1) where ? ????? denotes the total number of bug fix templates ex- tracted by Precfix, ? ??????? denotes the number of correct tem- plates extracted by Precfix. A low FPR indicates high quality of patches. ES score measures the effectiveness of Precfix in extracting templates from clusters, is calculated as: ?? = ? ????????? /? ???????? (2) where ? ???????? denotes the number of clusters obtained in the clus- tering step of patch discovery, and ? ????????? denotes the number of bug fix templates extracted from the clusters. A high ES score indicates high effectiveness in extracting templates. In the ideal case, Precfix is supposed to extract one template from every clus- ter. However, in practice, there could be some clusters of which the defect-patch pairs are too complex to summarize. Thus, Precfix sometimes could not achieve 100% ES score. We randomly sampled 170 templates extracted by Precfix in our experiment and manually inspected them. To reduce potential bias in our inspection, we applied the following rule of thumb in determining the correctness of a template: (1) if a template is too generic, e.g., modification to a “for-loop” bound, it is often the result of unintended clustering of similarly structured but semantically unrelated code snippets; and (2) if a template is too specific to a certain context, such as replacing an “if” condition using some Figure 6: Execution time of each phase of patch discovery. project-specific APIs, it cannot be easily migrated to other applica- tion contexts. We consider the templates falling into either category incorrect. According to these criteria, the FPR obtained from the inspection is 22%. Note that in our application scenario, this number is supposed to be gradually reduced as developers provide feedback on the discovered patches and contribute their own patches. To have a direct comparison with the SZZ algorithm, we ran our implementation of SZZ on the same sample dataset (10K projects) and compared with Precfix. Since SZZ and Precfix are of differ- ent objectives, and end-to-end comparison is not possible. Instead, we compared the phase of locating defects which is common for both techniques. As introduced in Sec. 3.2, SZZ locates defects in two steps: finding bug-fix commit with keyword matching and identifying the bug-inducing commits with back-tracing. We also manually examined 170 bug-fix commits identified by SZZ, and only reported incorrect ones which are easy to confirm. Examples of such false positives include comments and log file modifications, import statement insertions and deletions, style changes, error mes- sage modifications, etc. The FPR of SZZ is 63% (107 out of 170) according to the inspection. In contrast, Precfix locates defects with the extracted templates. Most of the aforementioned incorrect bug-fixes are caused by keyword matching, which could be effec- tively reduced in the clustering stage. As a result, Precfix achieves better precision in identifying defects (FPR 22%). We also randomly sample and inspect 100 clusters obtained by Precfix and the templates extracted from these clusters. The inspection confirms that Precfix successfully extract templates from 93 clusters. In other words, the ES score obtained by Precfix on the sample dataset is 93%. Answer to RQ 1 Overall, Precfix achieves a false positive rate of 22% in patch discovery and an extractability score of 93%. 5.2 Efficiency As invoking patch recommendation on any code snippet only take several milliseconds, we mainly evaluate the efficiency of Precfix for patch discovery. We measure the time taken by each step on the sample dataset: extracting defect-patch pairs, clustering defect- patch pairs, and extracting templates. Fig. 6 shows the time consumed in each step. In summary, the time consumed in the defect-patch pairs extraction step takes 22 min; the clustering is the most time-consuming step, taking 270 min; the template extraction step takes 5 min.

8. ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea Xindong Zhang, Chenguang Zhu, Yi Li, Jianmei Guo, Lihua Liu, and Haobo Gu API Modification. Templates in this category fix the defect by modifying the call of an API, including: (1) update the caller of an API, (2) add, update, or delete the parameters of an API, and (3) update the return type of an API. Fig. 9 is a template of API modification. It deletes a parameter of buildReqParams() method, and adds two new parameters. 1 2 3 4 5 Figure 7: Execution time of different clustering approaches. 6 7 8 API Modification 9 multipleSource.setParams( MultiSourceConvertUtil.buildReqParams( - itemSku.getItemId().getValue(), + itemSku.getConfigId(), itemSku.getSkuId(), itemSku.getSellerId(), + itemSku.getGpuId()}, multipleSource.getPageSize(), multipleSource.getPageIndex())); Figure 9: An example of API modification. 40.0% 20.0% 26.0% Validation Check Others 14.0% API Wrapping Figure 8: Distribution of patches in each category. Answer to RQ 2 Of the steps of patch discovery, extracting defect-patch pairs, clustering defect-patch pairs, and extracting tem- plates consumes 22 min, 270 min, and 5 min, respectively. To answer RQ3, we investigated how much our improvements of the vanilla DBSCAN improve the efficiency. We compared the execution time of our default SimHash-KDTree + APISeq DBSCAN clustering, and the raw DBSCAN clustering, SimHash-KDTree + DBSCAN clustering. The result of the comparison is shown in Fig. 7. As is shown, the execution time of the default SimHash-KDTree + APISeq DBSCAN is 4.5 hours, which saves 95% of the time compared with the raw DBSCAN. The execution time of SimHash-KDTree DBSCAN is 16.5 hours, which saves 75% compared with the raw DBSCAN. This result confirms that both SimHash-KDTree and APISeq improves the efficiency of clustering. Validation Check. Templates in this category add or modify vali- dation checks. Such checks include: (1) null pointers, i.e., adding “if(xxx==null) return;” above defect snippets, (2) boundary condi- tions, i.e., adding or modifying conditions in condition expression above such as “if” or “while” expressions, and (3) access permissions, i.e., replacing the variable assignment with a ternary operation such as “a = b == null ? 0 : 1”. Fig. 10 is a template of validation check. It adds a condition to check whether accountStatus is null and its length is greater than zero. 1 2 3 Figure 10: An example of validation check. API Wrapping. Templates in this category fix the defect by wrap- ping logic-independent code snippets to form an API (mostly a utility API), which improves code quality and facilitates software maintenance, preventing from introducing defects due to over com- plex code. Fig. 11 is a template of API wrapping. It packages the original code snippets into getUrl() and requestAndCheckThenFillResult(), forming two new APIs. 1 2 3 4 5 6 Answer to RQ 3 Our two improvements of the DBSCAN, SimHash-KDTree and APISeq, improves the efficiency of clustering by 75% and 20%, respectively. 5.3 Patch Categories 7 - - - - - + + String ip = host.getIp(); String url ="http://".concat(ip).concat(flowHost); HttpResponse<String> response = httpRequest.asString(); ErrorUtil.checkError(httpRequest, response, TREND, start); bodys.add(response.getBody()); String url = UrlUtil.getUrl(headHost, flowHost); UrlUtil.requestAndCheckThenFillResult(httpRequest, bodys, TREND, start); Figure 11: An example of API wrapping. Other. Templates in this category fix the defect in some special ways, which usually depend on the context. For example, Fig. 12 shows a template fixing a defect by changing the name of a class. 1 2 We randomly sampled 50 templates from the template database and manually inspected each one. Based on the inspection results, we categorized these templates into four categories: API modifica- tion, validation check, API wrapping, and others. Fig. 8 shows the distribution of templates of each category. There are 20 templates assigned to the API modification category, accounting for 40% of the total number; 13 templates (26%) are assigned to the validation check category; 7 templates (14%) are assigned to the API wrapping category; 10 templates (20%) are assigned to the “others” category. + if (accountStatus != null && accountStatus.length > 0) { query.addCondition(new In(STR, accountStatus)); + } - String cur = CurrencyConvertUtil.getAvailableCurrency(currencyCode); + String cur = MoneyConvertUtil.getAvailableCurrency(currencyCode); Figure 12: An example of other fixes. Answer to RQ 4 89.8% of the patches discovered by Precfix are in one of the three categories: API modification, validation check, and API wrapping.

9. Precfix: Large-Scale Patch Recommendation by Mining Defect-Patch Pairs 5.4 User Study With the template database, we ran patch recommendation of Prec- fix on the whole internal codebase 1 , from which Precfix identified 30K defects from 7K projects in total and provided corresponding patches. Precfix has been deployed in Alibaba for about one year so far. Every week, it recommends about 400 patches to developers on average, and receives about two to three false positive reports. To answer RQ5, we randomly sampled 50 defects identified by Precfix, and sent the recommended patches to the submitters of the defective code snippets for their opinions. In the end, 12 developers responded to our requests, all of whom are maintainers of the corresponding repositories. We interviewed these developers, presented them the defects and patches with two questions: Q1: “Is the patch suggested by Precfix valid? If so, would you like to apply the patch?” Q2: “Would you like to use Precfix during code review?” For Q1, 10 of 12 developers answered “yes”. They confirmed that the defect code found by Precfix are true positive, and agreed to fix them using the suggested patch. For the other two developers who rejected the patches, one developer reported that although the recommended patch provided by Precfix works for local testing configuration but is not valid for global configuration, as it induces problems when running together with other components. She com- mented that the patch was not appropriate for all possible contexts. Another negative case is that the developer has already used a null pointer check inside the called API, so she commented it is not necessary to have another null pointer checking outside the call as Precfix suggested. Overall, these comments from developers confirm the value of Precfix. In special cases, developer could judge the validity of the recommended patch and decide whether to adopt it. For Q2, all of the 12 developers answered “yes”. They agreed on that Precfix would improve the effectiveness of code review, and would like to see Precfix adopted on their projects. As an example, one interviewed developer said “I often struggle finding defects when reviewing code. It would be really helpful when having a patch recommendation technique to assist me.” ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea External. Subjects used in the sample dataset may not be repre- sentative for the entire internal codebase. We manually checked the selected subjects and confirmed that they cover more than 100 diverse application domains, which could help mitigate this threat. We only consider Java programs in our codebase as the current im- plementation of Precfix only handle Java, but the idea of Precfix is essentially independent of languages. Internal. We rely on manual inspection to determine the correct- ness of bug-fix templates and the recommended patches, which may be subjective. To mitigate this threat, we established common criteria before the inspection, as described in Sec. 5.1, and then two authors inspected each case independently. A patch is determined to be incorrect only if they reached an agreement on it. 6 CONCLUSION In this paper, we present Precfix, a patch recommendation tech- nique designed for large-scale industrial codebase. For patch dis- covery, Precfix extracts defect-patch pairs from version controlled histories, clusters these pairs with common patterns, and extracts bug fix templates from clusters. For patch recommendation, Prec- fix relies on the feedback from developers during code review. Moreover, Precfix integrates custom patches created by devel- opers to improve its template database. We run Precfix’s patch generation on a subset (10K projects) of the internal codebase at Alibaba and extract 3K patches. Our manual inspection confirms that Precfix has low false positive rate (22%) while achieves a high extractability score (93%). In addition, our user study on the usability of Precfix’s patch recommendation functionality shows that the majority (10/12) of the interviewed developers appreciated Precfix and would like to see it widely adopted. The current implementation of Precfix still has some incom- pleteness in terms of the type of defects and patches handled. In the future, we would like to lift these restrictions while maintaining the high accuracy of patch recommendation. We would also like to apply machine learning-based approaches to automate the patch template improvement as much as possible. ACKNOWLEDGMENTS Answer to RQ 5 The majority (10/12) of the interviewed developers acknowl- edged the value of the patches provided by Precfix, and all of them would like to see Precfix adopted in practice. This article is partially supported by the Singapore Ministry of Education Academic Research Fund Tier 1 (No. 2018-T1-002-069), National Natural Science Foundation of China (No. 61772200), and Alibaba Group. REFERENCES 5.5 Discussion In the experiment, we have tried multiple combinations of the weight of Jaccard similarity coefficient and Levenshtein distance. We noticed that although Levenshtein distance captures token or- dering information, it is highly sensitive to the change of ordering. As a result, we set 0.9 to the weight of Jaccard similarity coefficient, while 0.1 to the weight of Levenshtein distance, as heuristics. 5.6 Threats to Validity Our experiments are subject to common threats to validity. 1 Due to company policy, we could not report the total number of projects. [1] Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the Accuracy of Spectrum-Based Fault Localization. In Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION. 89–98. [2] Hiralal Agrawal, Joseph R Horgan, Saul London, and W Eric Wong. 1995. Fault Lo- calization Using Execution Slices and Dataflow Tests. In International Symposium on Software Reliability Engineering. 143–151. [3] Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to Fix Bugs Automatically. arXiv:1902.06111 [cs.SE] [4] Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associa- tive Searching. Communications of the ACM 18, 9 (1975), 509–517. [5] Tom Britton, Lisa Jeng, Graham Carver, Paul Cheak, and Tomer Katzenellenbogen. 2013. Reversible Debugging Software. Technical Report. Judge Business School, University of Cambridge. [6] Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2018. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. arXiv:1901.01808 [cs.SE]

10. ICSE-SEIP ’20, May 23–29, 2020, Seoul, Republic of Korea [7] Valentin Dallmeier, Andreas Zeller, and Bertrand Meyer. 2009. Generating Fixes From Object Behavior Anomalies. In International Conference on Automated Software Engineering. 550–554. [8] Belur V Dasarathy. 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Tutorial (1991). [9] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A Density- Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In International Conference on Knowledge Discovery and Data Mining. 226–231. [10] Qing Gao, Yingfei Xiong, Yaqing Mi, Lu Zhang, Weikun Yang, Zhaoping Zhou, Bing Xie, and Hong Mei. 2015. Safe Memory-Leak Fixing For C Programs. In International Conference on Software Engineering. 459–470. [11] Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey. Transactions on Software Engineering 45, 1 (2019), 34–67. [12] Git 2019. git-blame: Show What Revision and Author Last Modified Each Line of a File. https://git-scm.com/docs/git-blame. [13] Alibaba Group. 2019. Alibaba Cloud. https://www.alibabacloud.com. [14] Alibaba Group. 2019. MaxCompute: Conduct Large-Scale Data Warehousing with MaxCompute. https://www.alibabacloud.com/product/maxcompute. [15] Daniel S. Hirschberg. 1977. Algorithms for the Longest Common Subsequence Problem. Journal of the ACM 24, 4 (1977), 664–675. [16] Dennis Jeffrey, Min Feng, Neelam Gupta, and Rajiv Gupta. 2009. BugFix: A Learning-Based Tool to Assist Developers in Fixing Bugs. In International Con- ference on Program Comprehension. 70–79. [17] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping Program Repair Space with Existing Patches and Similar Code. In International Symposium on Software Testing and Analysis. 298–309. [18] James A Jones and Mary Jean Harrold. 2005. Empirical Evaluation of the Tarantula Automatic Fault-Localization Technique. In International Conference on Auto- mated Software Engineering. 273–282. [19] René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In International Symposium on Software Testing and Analysis. 437–440. [20] Shalini Kaleeswaran, Varun Tulsian, Aditya Kanade, and Alessandro Orso. 2014. Minthint: Automated Synthesis of Repair Hints. In International Conference on Software Engineering. 266–276. [21] Yalin Ke, Kathryn T Stolee, Claire Le Goues, and Yuriy Brun. 2015. Repairing Programs with Semantic Code Search. In International Conference on Automated Software Engineering. 295–306. [22] Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Auto- matic Patch Generation Learned from Human-Written Patches. In International Conference on Software Engineering. 802–811. [23] Sunghun Kim, Thomas Zimmermann, E James Whitehead Jr, and Andreas Zeller. 2007. Predicting Faults from Cached History. In International Conference on Software Engineering. 489–498. [24] Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon. 2019. iFixR: Bug Report Driven Program Repair. In Symposium on Foundations of Software Engineering. 314–325. [25] Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet physics doklady 10, 8 (1966), 707–710. [26] Jian Li, Pinjia He, Jieming Zhu, and Michael R. Lyu. 2017. Software Defect Prediction via Convolutional Neural Network. In International Conference on Software Quality, Reliability and Security. 318–328. [27] Yi Li, Chenguang Zhu, Milos Gligoric, Julia Rubin, and Marsha Chechik. 2019. Precise Semantic History Slicing Through Dynamic Delta Refinement. Automated Software Engineering 26, 4 (Dec 2019), 757–793. [28] Yi Li, Chenguang Zhu, Julia Rubin, and Marsha Chechik. 2017. FHistorian: Locating Features in Version Histories. In Proceedings of the 21st International Systems and Software Product Line Conference - Volume A (Sevilla, Spain). ACM, New York, NY, USA, 49–58. [29] Yi Li, Chenguang Zhu, Julia Rubin, and Marsha Chechik. 2017. Semantic Slicing of Software Version Histories. Transactions on Software Engineering 44, 2 (2017), 182–201. [30] Haopeng Liu, Yuxi Chen, and Shan Lu. 2016. Understanding and Generating High Quality Patches for Concurrency Bugs. In Symposium on Foundations of Software Engineering. 715–726. [31] Peng Liu, Omer Tripp, and Charles Zhang. 2014. Grail: Context-Aware Fixing of Concurrency Bugs. In Symposium on Foundations of Software Engineering. 318–329. [32] Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. In Symposium on Principles of Programming Languages. 298–312. [33] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting Near-Duplicates for Web Crawling. In International Conference on World Wide Web. 141–150. [34] Martin Monperrus. 2018. Automatic Software Repair: A Bibliography. ACM Computing Surveys 51, 1 (2018), 17:1–17:24. [35] Seokhyeon Moon, Yunho Kim, Moonzoo Kim, and Shin Yoo. 2014. Ask the Mu- tants: Mutating Faulty Programs for Fault Localization. In International Conference on Software Testing, Verification and Validation. 153–162. Xindong Zhang, Chenguang Zhu, Yi Li, Jianmei Guo, Lihua Liu, and Haobo Gu [36] Mathieu Nayrolles and Abdelwahab Hamou-Lhadj. 2018. CLEVER: Combining Code Metrics with Clone Detection for Just-In-Time Fault Prevention and Reso- lution in Large Industrial Projects. In International Conference on Mining Software Repositories. 153–164. [37] Helmut Neukirchen. 2016. Survey and Performance Evaluation of DBSCAN Spa- tial Clustering Implementations for Big Data and High-Performance Computing Paradigms. Technical Report. Engineering Research Institute, University of Ice- land. [38] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chan- dra. 2013. Semfix: Program Repair via Semantic Analysis. In International Con- ference on Software Engineering. 772–781. [39] Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Su- pachanun Wanapu. 2013. Using of Jaccard Coefficient for Keywords Similarity. In International Multiconference of Engineers and Computer Scientists. 380–384. [40] Mike Papadakis and Yves Le Traon. 2015. Metallaxis-FL: Mutation-Based Fault Localization. Software Testing, Verification and Reliability 25, 5-7 (2015), 605–628. [41] Spencer Pearson, Jose Campos, Rene Just, Gordon Fraser, Rui Abreu, Michael D. Ernst, Deric Pang, and Benjamin Keller. 2017. Evaluating and Improving Fault Localization. In International Conference on Software Engineering. 609–620. [42] Vidyasagar Potdar and Elizabeth Chang. 2004. Open Source and Closed Source Software Development Methodologies. In International Conference on Software Engineering. 105–109. [43] Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for Generate-And-Validate Patch Generation Systems. In International Symposium on Software Testing and Analysis. 24–36. [44] Foyzur Rahman, Daryl Posnett, Abram Hindle, Earl Barr, and Premkumar De- vanbu. 2011. Bugcache for Inspections: Hit or Miss?. In Symposium on Foundations of Software Engineering. 322–331. [45] Manos Renieres and Steven P Reiss. 2003. Fault Localization with Nearest Neigh- bor Queries. In International Conference on Automated Software Engineering. 30–39. [46] Hesam Samimi, Max Schafer, Shay Artzi, Todd Millstein, Frank Tip, and Laurie Hendren. 2012. Automated Repair of HTML Generation Errors in PHP Appli- cations using String Constraint Solving. In International Conference on Software Engineering. 277–287. [47] Guido Schryen. 2009. A Comprehensive and Comparative Analysis of the Patch- ing Behavior of Open Source and Closed Source Software Vendors. In International Conference on IT Security Incident Management and IT Forensics. 153–168. [48] Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller. 2005. When do Changes Induce Fixes?. In International Conference on Mining Software Reposito- ries. 1–5. [49] Undo Software. 2014. Increasing Software Development Productivity with Re- versible Debugging. [50] Yida Tao, Jindae Kim, Sunghun Kim, and Chang Xu. 2014. Automatically Gener- ated Patches as Debugging Aids: A Human Study. In Symposium on Foundations of Software Engineering. 64–74. [51] Shaowei Wang, David Lo, Lingxiao Jiang, Lucia, and Hoong Chuin Lau. 2011. Search-Based Fault Localization. In International Conference on Automated Soft- ware Engineering. 556–559. [52] Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically Finding Patches Using Genetic Programming. In International Conference on Software Engineering. 364–374. [53] Chu-Pan Wong, Yingfei Xiong, Hongyu Zhang, Dan Hao, Lu Zhang, and Hong Mei. 2014. Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis. In International Conference on Software Maintenance. 181–190. [54] W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization. Transactions on Software Engineering 42, 8 (2016), 707–740. [55] Rongxin Wu, Hongyu Zhang, Shing-Chi Cheung, and Sunghun Kim. 2014. CrashLocator: Locating Crashing Faults Based on Crash Stacks. In International Symposium on Software Testing and Analysis. 204–214. [56] He Ye, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2019. A Comprehensive Study of Automatic Program Repair on the Quixbugs Benchmark. In 2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF). IEEE, 1–10. [57] Jooyong Yi, Shin Hwei Tan, Sergey Mechtaev, Marcel Böhme, and Abhik Roy- choudhury. 2018. A Correlation Study between Automated Program Repair and Test-Suite Metrics. In International Conference on Software Engineering. 24. [58] Xiangyu Zhang, Neelam Gupta, and Rajiv Gupta. 2006. Locating faults through automated predicate switching. In Proceedings of the 28th international conference on Software engineering. ACM, 272–281. [59] Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where Should the Bugs be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports. In International Conference on Software Engineering. 14–24. [60] Daming Zou, Jingjing Liang, Yingfei Xiong, Michael D Ernst, and Lu Zhang. 2019. An Empirical Study of Fault Localization Families and Their Combinations. Transactions on Software Engineering (2019), To appear.