Modern Code Review- A Case Study at Google

如果无法正常显示，请先停止浏览器的去广告插件。

相关话题： #Google

1. Modern Code Review: A Case Study at Google Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko Google, Inc. {supertri,emso,lukechurch,sipkom}@google.com ABSTRACT Employing lightweight, tool-based code review of code changes (aka modern code review) has become the norm for a wide variety of open-source and industrial systems. In this pa- per, we make an exploratory investigation of modern code review at Google. Google introduced code review early on and evolved it over the years; our study sheds light on why Google introduced this practice and analyzes its current status, after the process has been refined through decades of code changes and millions of code reviews. By means of 12 interviews, a survey with 44 respondents, and the analysis of review logs for 9 million reviewed changes, we investigate motivations behind code review at Google, current practices, and developers’ satisfaction and challenges. CCS CONCEPTS • Software and its engineering → Software maintenance tools; ACM Reference format: Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko and Alberto Bacchelli. 2018. Modern Code Review: A Case Study at Google. In Proceedings of 40th International Conference on Software Engineering: Software En- gineering in Practice Track, Gothenburg, Sweden, May 27-June 3, 2018 (ICSE-SEIP ’18), 10 pages. DOI: 10.1145/3183519.3183525 1 INTRODUCTION Peer code review, a manual inspection of source code by developers other than the author, is recognized as a valuable tool for improving the quality of software projects [2, 3]. In 1976, Fagan formalized a highly structured process for code reviewing—code inspections [16]. Over the years, researchers provided evidence on the benefits of code inspection, espe- cially for defect finding, but the cumbersome, time-consuming, and synchronous nature of this approach hindered its uni- versal adoption in practice [37]. Nowadays, most organiza- tions adopt more lightweight code review practices to limit the inefficiencies of inspections [33]. Modern code review is (1) informal (in contrast to Fagan-style), (2) tool-based [32], (3) asynchronous, and (4) focused on reviewing code changes. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). ICSE-SEIP ’18, Gothenburg, Sweden © 2018 Copyright held by the owner/author(s). 978-1-4503-5659- 6/18/05. . . $15.00 DOI: 10.1145/3183519.3183525 Alberto Bacchelli University of Zurich bacchelli@ifi.uzh.ch An open research challenge is understanding which prac- tices represent valuable and effective methods of review in this novel context. Rigby and Bird quantitatively analyzed code review data from software projects spanning varying domains as well as organizations and found five strongly convergent aspects [33], which they conjectured can be prescriptive to other projects. The analysis of Rigby and Bird is based on the value of a broad perspective (that analyzes multiple projects from different contexts). For the development of an empirical body of knowledge, championed by Basili [7], it is essential to also consider a focused and longitudinal perspective that analyzes a single case. This paper expands on work by Rigby and Bird to focus on the review practices and characteristics established at Google, i.e., a company with a multi-decade history of code review and a high-volume of daily reviews to learn from. This paper can be (1) prescriptive to practitioners performing code review and (2) compelling for researchers who want to understand and support this novel process. Code review has been a required part of software develop- ment at Google since very early on in the company’s history; because it was introduced so early on, it has become a core part of Google culture. The process and tooling for code review at Google have been iteratively refined for more than a decade and is applied by more than 25,000 developers making more than 20,000 source code changes each workday, in dozens of offices around the world [30]. We conduct our analysis in the form of an exploratory investigation focusing on three aspects of code review, in line with and expanding on the work by Rigby and Bird [33]: (1) The motivations driving code review, (2) the current practices, and (3) the perception of developers on code review, focusing on challenges encountered with a specific review (breakdowns in the review process) and satisfaction. Our research method combines input from multiple data sources: 12 semi-structured interviews with Google developers, an internal survey sent to engineers who recently sent changes to review with 44 responses, and log data from Google’s code review tool pertaining to 9 million reviews over two years. We find that the process at Google is markedly lighter weight than in other contexts, based on a single reviewer, quick iterations, small changes, and a tight integration with the code review tool. Breakdowns still exist, however, due to the complexity of the interactions that occur around code re- view. Nevertheless, developers consider this process valuable, confirm that it works well at scale, and conduct it for several reasons that also depend on the relationship between author and reviewers. Finally, we find evidence on the use of the code review tool beyond collaborative review and corroboration for the importance of code review as an educational tool.

2. ICSE-SEIP ’18, May 27-June 3, 2018, Gothenburg, Sweden 2 Table 1: Convergent review practices [33]. BACKGROUND AND RELATED WORK We describe the review processes investigated in literature, then we detail convergent code review practices across these processes [33]. 2.1 C. Sadowski et al. Code Review Processes and Contexts Code Inspections. Software inspections are one of the first formalized processes for code review. This highly structured process involves planning, overview, preparation, inspection meeting, reworking, and follow-up [16]. The goal of code inspections is to find defects during a synchronized inspec- tion meeting, with authors and reviewers sitting in the same room to examine code changes. Kollanus and Koskinen com- piled the most recent literature survey on code inspection research [25] . They found that the vast majority of studies on code inspections are empirical in nature. There is a consensus about the overall value of code inspection as a defect finding technique and the value of reading techniques to engage the inspectors. Overall, research on code inspection has declined from 2005, in line with the spread of the internet and the growth of asynchronous code review processes. Asynchronous review via email. Until the late 2000s, most large OSS projects adopted a form of remote, asynchronous reviews, relying on patches sent to communication channels such as mailing lists and issue tracking systems. In this setting, project members evaluate contributed patches and ask for modifications through these channels. When a patch is deemed of high enough quality, core developers commit it to the codebase. Trusted committers may have a commit- then-review process instead of doing pre-commit reviews [33]. Rigby et al. were among the first to do extensive work in this setting; they found that this type of review “has little in common [with code inspections] beyond a belief that peers will effectively find software defects” [34]. Kononenko et al. analyzed the same setting and find that review response time and acceptance are related to social factors, such as reviewer load and change author experience [26], which were not present in code inspections. Tool-based review. To bring structure to the process of re- viewing patches, several tools emerged in OSS and industrial settings. These tools support the logistics of the review pro- cess: (1) The author of a patch submits it to the code review tool, (2) the reviewers can see the diff of the proposed code change and (3) can start a threaded discussion on specific lines with the author and other reviewers, and then (4) the author can propose modifications to address reviewers’ comments. This feedback cycle continues until everybody is satisfied or the patch is discarded. Different projects adapted their tools to support their process. Microsoft uses CodeFlow, which tracks the state of each person (author or reviewer) and where they are in the process (signed off, waiting, review- ing); CodeFlow does not prevent authors from submitting changes without approval [33] and supports chats in com- ment threads [4]. Google’s Chromium project (along with several other OSS projects) relies on the externally-available Gerrit [17]; in Chromium, changes are only merged into the id CP 1 CP 2 CP 3 CP 4 CP 5 Convergent Practice Contemporary peer review follows a lightweight, flexible process Reviews happen early (before a change is commit- ted), quickly, and frequently Change sizes are small Two reviewers find an optimal number of defects Review has changed from a defect finding activity to a group problem solving activity master branch after explicit approval from reviewers and automated verification that the change does not break the build [12]. In Gerrit, unassigned reviewers can also make comments. VMware developed the open-source ReviewBoard, which integrates static analysis into the review process; this integration relies on change authors manually requesting anal- ysis and has been shown to improve code review quality [5]. Facebook’s code review system, Phabricator [29], allows re- viewers to “take over” a change and commit it themselves and provides hooks for automatic static analysis or continuous build/test integration. In the context of tool-based reviews, researchers have in- vestigated the relationship between code change acceptance or response time and features of the changed code and au- thors [41], as well as the agreement among reviewers [22]. Qualitative investigations have been also conducted to define what constitutes a good code review according to indus- trial [11] and OSS developers [26]. Pull-based development model. In the GitHub pull request process [18] a developer wanting to make a change forks an existing git repository and then makes changes in their fork. After a pull request has been sent out, it appears in the list of pull requests for the project in question, visible to anyone who can see the project. Gousios et al. qualitatively investigated work practices and challenges of pull-request integrators [21] and contributors [20], finding analogies to other tool-based code reviews. 2.2 Convergent Practices in Code Review Rigby and Bird presented the first and most significant work that tries to identify convergent practices across several code review processes and contexts [33]. They considered OSS projects that use email based reviews, OSS projects that use Gerrit, an AMD project that uses a basic code review tool, and Microsoft with CodeFlow. They analyzed the process and the data available from these projects to describe several angles, such as iterative development, reviewer selection, and review discussions. They identified five practices of modern code review to which all the considered projects converged (Table 1). We will refer to these practices using their id, e.g., CP 1 . Substantially, they found an agreement in terms of a quick, lightweight process (CP 1 ,CP 2 ,CP 3 ) with few people involved (CP 4 ) who conduct group problem solving (CP 5 ).

3. Modern Code Review: A Case Study at Google 3 METHODOLOGY This section describes our research questions and settings; it also outlines our research method and its limitations. 3.1 Research Questions The overall goal of this research is to investigate modern code review at Google, which is a process that involves thousands of developers and that has been refined over more than a decade. To this aim, we conduct an exploratory investigation that we structure around three main research questions. RQ 1 : What are the motivations for code review at Google? Rigby and Bird found the motivations to be one of the converging traits of modern code review (CP 5 ). Here we study what motivations and expectations drive code reviews at Google. In particular, we consider both the historical reasons for introducing modern code review (since Google is one of the first companies that used modern code review) and the current expectations. RQ 2 : What is the practice of code review at Google? The other four convergent practices found by Rigby and Bird regard how the process itself is executed, in terms of flow (CP 1 ), speed and frequency (CP 2 ), size of the analyzed changes (CP 3 ), and number of reviewers (CP 4 ). We analyze these aspects at Google to investigate whether the same findings hold for a company that has a longer history of code review, an explicit culture, and larger volume of reviews compared to those analyzed in previous studies [4, 33]. RQ 3 : How do Google developers perceive code review? Fi- nally, in our last research question, we are interested in understanding how Google developers perceive modern code review as implemented in their company. This exploration is needed to better understand practices (since perceptions drive behavior [39]) and to guide future research. We focus on two aspects: breakdowns of the review process developers experience during specific reviews and whether developers are satisfied with review despite these challenges. 3.2 Research Setting We briefly describe our research setting for context on our methodology. See Section 5.1 for a comprehensive description of Google’s code review process and tooling. Most software development at Google occurs in a mono- lithic source repository, accessed via an internal version con- trol system [30]. Since code review is required at Google, every commit to Google’s source control system goes first through code review using Critique: an internally developed, centralized, web-based code review tool. The development workflow in Google’s monolithic repository, including the code review process, is very uniform. As with other tools described in Section 2, Critique allows reviewers to see a diff of the proposed code change and start a threaded discussion on specific lines with the author and other reviewers. Cri- tique offers extensive logging functionalities; all developer interactions with the tool are captured (including opening the tool, viewing a diff, making comments, and approving changes). ICSE-SEIP ’18, May 27-June 3, 2018, Gothenburg, Sweden 3.3 Research Method To answer our research questions, we follow a mixed quali- tative and quantitative approach [13], which combines data from several sources: semi-structured interviews with employ- ees involved in software development at Google, logs from the code review tool, and a survey to other employees. We use the interviews as a tool to collect data on the diversity (as opposed to frequencies [23]) of the motivations for conducting code review (RQ 1 ) and to elicit developers’ perceptions of code review and its challenges (RQ 3 ). We use Critique logs to quantify and describe the current review practices (RQ 2 ). Finally, we use the survey to confirm the diverse motivations for code review that emerged from the interviews (RQ 1 ) and elicit the developers’ satisfaction with the process. Interviews. We conducted a series of face-to-face semi- structured [13] interviews with selected Google employees, each taking approximately 1 hour.The initial pool of possible participants was selected using snowball sampling, starting with developers known to the paper authors. From this pool, participants were selected to ensure a spread of teams, tech- nical areas, job roles, length of time within the company, and role in the code review process. The interview script (in appendix [1]) included questions about perceived motivations for code review, a recently reviewed/authored change, and best/worst review experiences. Before each interview we re- viewed the participant’s review history and located a change to discuss in the interview; we selected these changes based on the number of interactions, the number of people involved in the conversation and whether there were comments that seemed surprising. In an observation part of the interview, the participant was asked to think-aloud while reviewing the pending change and to provide some explicit informa- tion, such as the entry point for starting the review. The interviews continued until saturation [19] was achieved and interviews were bringing up broadly similar concepts. Overall, we conducted 12 interviews with staff who had been working at Google from 1 month to 10 years (median 5 years), in Software Engineering and Site Reliability Engineering. They included technical leads, managers and individual contrib- utors. Each interview involved three to four people: The participant and 2-3 interviewees (two of which are authors of this paper). Interviews were live-transcribed by one of the interviewers while another one asked questions. Open Coding on Interview Data. To identify the broad themes emerging from the interview data we performed an open coding pass [27]. Interview transcripts were discussed by two authors to establish common themes, then converted into a coding scheme. An additional author then performed a closed coding of the notes of the discussions to validate the themes. We iterated this process over one of the interviews until we had agreement on the scheme. We also tracked in what context (relationship between reviewer and author) these themes were mentioned. The combination of the design of the questions and the analysis process means that we can discuss stable themes in the results, but cannot meaningfully discuss relative frequencies of occurrence [23].

4. ICSE-SEIP ’18, May 27-June 3, 2018, Gothenburg, Sweden Analysis of Review Data. We analyzed quantitative data about the code review process by using the logs produced by Critique. We mainly focus on metrics related to convergent practices found by Rigby and Bird [33]. To allow for com- parison, we do not consider changes without any reviewers, as we are interested in changes that have gone through an explicit code review process. We consider a “reviewer” to be any user who approves a code change, regardless of whether they were explicitly asked for a review by the change author. We use a name-based heuristic to filter out changes made by automated processes. We focus exclusively on changes that occur in the main codebase at Google. We also exclude changes not yet committed at the time of study and those for which our diff tool reports a delta of zero source lines changed, e.g., a change that only modifies binary files. On an average workday at Google, about 20,000 changes are committed that meet the filter criteria described above. Our final dataset includes the approximately 9 million changes created by more than 25,000 authors and reviewers from January 2014 until July 2016 that meet these criteria, and about 13 million comments collected from all changes between September 2014 and July 2016. Survey. We created an online questionnaire that we sent to 98 engineers who recently submitted a code change. The code change had already been reviewed, so we customized the questionnaire to ask respondents about how they per- ceived the code review for their specific recent change; this strategy allowed us to mitigate recall bias [35], yet collect comprehensive data. The survey consisted of three Likert scale questions on the value of the received review, one multi- ple choice on the effects of the review on their change (based on the expectations that emerged from the interviews) with an optional ‘other’ response, and one open-ended question eliciting the respondent’s opinion on the received review, the code review tool, and/or the process in general. We received 44 valid responses to our survey (45% response rate, which is considered high for software engineering research [31]). 3.4 Threats to Validity and Limitations We describe threats to validity and limitations of the results of our work, as posed by our research method, and the actions that we took to mitigate them. Internal validity - Credibility. Concerning the quantitative analysis of review data, we use heuristics to filter out robot- authored changes from our quantitative analysis, but these heuristics may allow some robot authored changes; we miti- gated this as we only include robot-authored changes that have a human reviewer. Concerning the qualitative investiga- tion, we used open coding to analyze the interviewees’ answers. This coding could have been influenced by the experience and motivations of the authors conducting it, although we tried to mitigate this bias by involving multiple coders. The employees that decided to participate in our interviews and the survey freely decided to do so, thus introducing the risk of self-selection bias. For this reason, results may have been different for developers who would not choose to participate; C. Sadowski et al. to try to mitigate this issue, we combine information from both interviews and survey. Moreover, we used a snowball sampling method to identify engineers to interview, this is at the risk of sampling bias. Although we attempted to mitigate this risk by interviewing developers with a range of job roles and responsibilities, there may be other factors the developers we interviewed share that would not apply across the company. To mitigate moderator acceptance bias, the researchers involved in the qualitative data collection were not part of the Critique team. Social desirability bias may have influenced the answers to align more favorably to Google culture; however, at Google people are encouraged to criticize and improve broken workflows when discovered, thus reducing this bias. Finally, we did not interview research scientists or developers that interact with specialist reviewers (such as security reviews), thus our results are biased towards general developers. Generalizability - Transferability. We designed our study with the stated aim of understanding modern code review within a specific company. For this reason, our results may not generalize to other contexts, rather we are interested in the diversity of the practices and breakdowns that are still occurring after years and millions of reviews of refinement. Given the similarity of the underlying code review mechanism across several companies and OSS projects, it is reasonable to think that should a review process reach the same level of maturity and use comparable tooling, developers would have similar experiences. 4 RESULTS: MOTIVATIONS In our first research question, we seek to understand motiva- tions and expectations of developers when conducting code review at Google, starting by investigating what led to the introduction of this process in the first place. 4.1 How it All Started Code review at Google was introduced early on by one of the first employees; the first author of this paper interviewed this employee (referred to as E in the following) to better under- stand the initial motivation of code review and its evolution. E explained that the main impetus behind the introduction of code review was to force developers to write code that other developers could understand; this was deemed important since code must act as a teacher for future developers. E stated that the introduction of code review at Google signaled the transition from a research codebase, which is optimized to- wards quick prototyping, to a production codebase, where it is critical to think about future engineers reading source code. Code review was also perceived as capable of ensuring that more than one person would be familiar with each piece of code, thus increasing the chances of knowledge staying within the company. E reiterated on the concept that, although it is great if reviewers find bugs, the foremost reason for introducing code review at Google was to improve code understandability and

6. ICSE-SEIP ’18, May 27-June 3, 2018, Gothenburg, Sweden Code Review Flow. The flow of a review is tightly coupled with Critique and works as follows: 1. Creating: Authors start modifying, adding, or deleting some code; once ready, they create a change. 2. Previewing: Authors then use Critique to view the diff of the change and view the results of automatic code analyzers (e.g. from Tricorder [36]). When they are ready, authors mail the change out to one or more reviewers. 3. Commenting: Reviewers view the diff in the web UI, draft- ing comments as they go. Program analysis results, if present, are also visible to reviewers. Unresolved comments represent action items for the change author to definitely address. Resolved comments include optional or informational com- ments that may not require any action by a change author. 4. Addressing Feedback: The author now addresses comments, either by updating the change or by replying to comments. When a change has been updated, the author uploads a new snapshot. The author and reviewers can look at diffs between any pairs of snapshots to see what changed. 5. Approving: Once all comments have been addressed, the reviewers approve the change and mark it ‘LGTM’ (Looks Good To Me). To eventually commit a change, a developer typically must have approval from at least one reviewer. Usually, only one reviewer is required to satisfy the afore- mentioned requirements of ownership and readability. We attempt to quantify how “lightweight” reviews are (CP 1 ). We measure how much back-and-forth there is in a review, by examining how many times a change author mails out a set of comments that resolve previously unresolved comments. We make the assumption that one iteration corresponds to one instance of the author resolving some comments; zero iterations means that the author can commit immediately. We find that over 80% of all changes involve at most one iteration of resolving comments. Suggesting Reviewers. To identify the best person to re- view a change, Critique relies on a tool that analyzes the change and suggests possible reviewers. This tool identifies the smallest set of reviewers needed to fulfill the review requirements for all files in a change. Note that often only one reviewer is required since changes are often authored by someone with ownership and/or readability for the files in question. This tool prioritizes reviewers that have recently edited and/or reviewed the included files. New team members are explicitly added as reviewers since they have not yet built up reviewing/editing history. Unassigned reviewers can also make comments on (and can potentially approve) changes. Tool support for finding reviewers is typically only necessary for changes to files beyond those for a particular team. Within a team, developers know who to send changes to. For changes that could be sent to anyone on the team, many teams use a system that assigns reviews sent to the team email address to configured team members in a round-robin manner, taking into account review load and vacations. Code Analysis Results. Critique shows code analysis re- sults as comments along with human ones (although in differ- ent colors). Analyzers (or reviewers) can provide suggested C. Sadowski et al. edits, which can be both proposed and applied to the change via Critique. To vet changes before they are committed, de- velopment at Google also includes pre-commit hooks: checks where a failure requires an explicit override by the developer to enable a commit. Pre-commit checks include things like basic automated style checking and running automated test suites related to a change. The results of all pre-commit checks are visible in the code review tool. Typically, pre- commit checks are automatically triggered. These checks are configurable such that teams can enforce project-specific invariants and automatically add email lists to changes to raise awareness and transparency. In addition to pre-commit results, Critique displays the result of a variety of automated code analyses through Tricorder [36] that may not block com- mitting a change. Analysis results encompass simple style checks, more complicated compiler-based analysis passes and also project-specific checks. Currently, Tricorder includes 110 analyzers, 5 of which are plugin systems for hundreds of additional checks, in total analyzing more than 30 languages. Finding 3. The Google code review process is aligned with the convergent practice of being lightweight and flexible. In contrast to other studied systems, however, ownership and readability are explicit and play a key role. The review tool includes reviewer recommenda- tion and code analysis results. 5.2 Quantifying The Review Process We replicate quantitative analysis that Rigby and Bird con- ducted that led to the discovery CP 2−4 so as to compare these practices to the traits that Google has converged to. Review frequency and speed. Rigby and Bird found that fast-paced, iterative development also applies to modern code review: In their projects, developers work in remarkably consistent short intervals. To find this, they analyzed review frequency and speed. At Google, in terms of frequency, we find that the median developer authors about 3 changes a week, and 80 percent of authors make fewer than 7 changes a week. Similarly, the median for changes reviewed by developers per week is 4, and 80 percent of reviewers review fewer than 10 changes a week. In terms of speed, we find that developers have to wait for initial feedback on their change a median time of under an hour for small changes and about 5 hours for very large changes. The overall (all code sizes) median latency for the entire review process is under 4 hours. This is significantly lower than the median time to approval reported by Rigby and Bird [33], which is 17.5 hours for AMD, 15.7 hours for Chrome OS and 14.7, 19.8, and 18.9 hours for the three Microsoft projects. Another study found the median time to approval at Microsoft to be 24 hours [14]. Review size. Rigby and Bird argued that quick review time could only be enabled by smaller changes to review and subsequently analyzed review sizes. At Google, over 35% of the changes under consideration modify only a single file and

7. Modern Code Review: A Case Study at Google about 90% modify fewer than 10 files. Over 10% of changes modify only a single line of code, and the median number of lines modified is 24. The median change size is significantly lower than reported by Rigby and Bird for companies such as AMD (44 lines), Lucent (263 lines), and Bing, Office and SQL Server at Microsoft (somewhere between those boundaries), but in line for change sizes in open source projects [33]. Number of reviewers and comments. The optimal number of reviewers has been controversial among researchers, even in the deeply investigated code inspections [37]. Rigby and Bird investigated whether the considered projects converged to a similar number of involved reviewers. They found this number to be two, remarkably regardless of whether the reviewers were explicitly invited (such as in Microsoft projects, who invited a median of up to 4 reviewers) or whether the change was openly broadcasted for review [33]. At Google, by contrast, fewer than 25% of changes have more than one reviewer, 1 and over 99% have at most five reviewers with a median reviewer count of 1. Larger changes tend to have more reviewers on average. However, even very large changes on average require fewer than two reviewers. Rigby and Bird also found a “minimal increase in the number of comments about the change when more [than 2] reviewers were active” [33], and concluded that two reviewers find an optimal number of defects. At Google, the situation is different: A greater number of reviewers results in a greater average number of comments on a change. Moreover, the average number of comments per change grows with the number of lines changed, reaching a peak of 12.5 comments per change for changes of about 1250 lines. Changes larger than this often contain auto-generated code or large deletions, resulting in a lower average number of comments. Finding 4. Code review at Google has converged to a process with markedly quicker reviews and smaller changes, compared to the other projects previously investigated. Moreover, one reviewer is often deemed as sufficient, compared to two in the other projects. 6 RESULTS: DEVELOPERS’ PERCEPTIONS Our last research question investigates challenges and satis- faction of developers with code review at Google. 6.1 Code Review Breakdowns at Google Previous studies investigated challenges in the review process overall [4, 26] and provided compelling evidence, also corrobo- rated by our experience as engineers, that understanding code to review is a major hurdle. To broaden our empirical body of knowledge, we focus here on challenges encountered in specific reviews (“breakdowns”), such as delays or disagreements. 1 30% of changes have comments by more than one commenter, meaning that about 5% of changes have additional comments from someone that did not act as an approver for the change. ICSE-SEIP ’18, May 27-June 3, 2018, Gothenburg, Sweden Five main themes emerged from the analysis of our inter- view data. The first four themes regard breakdowns concern- ing aspects of the process: Distance: Interviewees perceive distance in code review from two perspectives: geographical (i.e., the physical distance between the author and reviewers) and organizational (e.g., between different teams or different roles). Both these types of distance are perceived as the cause of delays in the review process or as leading to misunderstandings. Social interactions: Interviewees perceive the communication within code review as possibily leading to problems from two aspects: tone and power. Tone refers to the fact that some- times authors are sensitive to review comments; sentiment analysis on comments has provided evidence that comments with negative tone are less likely to be useful [11]. Power refers to using the code review process to induce another person to change their behavior; for example, dragging out reviews or withholding approvals. Issues with tone or power in reviews can make developers uncomfortable or frustrated with the review process. Review subject: The interviews referenced disagreements as to whether code review was the most suitable context for re- viewing certain aspects, particularly design reviews. This re- sulted in mismatched expectations (e.g., some teams wanted most of the design to be complete before the first review, others wanted to discuss design in the review), which could cause friction among participants and within the process. Context: Interviewees allowed us to see that misunderstand- ings can arise based on not knowing what gave rise to the change; for example, if the rationale for a change was an urgent fix to a production problem or a “nice to have” improvement. The resulting mismatch in expectations can result in delays or frustration. The last theme regarded the tool itself: Customization: Some teams have different requirements for code review, e.g., concerning how many reviewers are re- quired. This is a technical breakdown, since arbitrary cus- tomizations are not always supported in Critique, and may cause misunderstandings around these policies. Based on feedback, Critique recently released a new feature that allows change authors to require that all reviewers sign off. 6.2 Satisfaction and Time Invested To understand the significance of the identified concerns, we used part of the survey to investigate whether code review is overall perceived as valuable. We find (Table 2) that code review is broadly seen as being valuable and efficient within Google – all respondents agreed with the statement that code review is valuable. This senti- ment is echoed by internal satisfaction surveys we conducted on Critique: 97% of developers are satisfied with it. Within the context of a specific change, the sentiment is more varied. The least satisfied replies were associated with changes that were very small (1 word or 2 lines) or with changes which were necessary to achieve some other goal, e.g., triggering a process from a change in the source code.

8. ICSE-SEIP ’18, May 27-June 3, 2018, Gothenburg, Sweden C. Sadowski et al. For this change, the review process was a good use of my time Overall, I find code review at Google valuable Strongly disagree 2 0 Too little For this change, the amount of feedback was 2 Table 2: User satisfaction survey results However, the majority of respondents felt that the amount of feedback for their change was appropriate. 8 respondents described the comments as not being helpful, of these 3 provided more detail, stating that the changes under review were small configuration changes for which code review had less impact. Only 2 respondents said the comments had found a bug. To contextualize the answers in terms of satisfaction, we also investigate the time developers spend reviewing code. To accurately quantify the reviewer time spent, we tracked developer interactions with Critique (e.g., opened a tab, viewed a diff, commented, approved a change) as well as other tools to estimate how long developers spend reviewing code per week. We group sequences of developer interactions into blocks of time, considering a “review session” to be a sequence of interactions related to the same uncommitted change, by a developer other than the change author, with no more than 10 minutes between each successive interaction. We sum up the total number of hours spent across all review sessions in the five weeks that started in October 2016 and then calculate the average per user per week, filtering out users for whom we do not have data for all 5 weeks. We find that developers spend an average of 3.2 (median 2.6 hours a week) reviewing changes. This is low compared to the 6.4 hours/week of self-reported time for OSS projects [10]. Finding 5. Despite years of refinement, code review at Google still faces breakdowns. These are mostly linked to the complexity of the interactions that occur around the reviews. Yet, code review is strongly considered a valuable process by developers, who spend around 3 hours a week reviewing. 7 DISCUSSION We discuss the themes that have emerged from this investiga- tion, which can inspire practitioners in the setting up of their code review process and researchers in future investigations. 7.1 A Truly Lightweight Process Modern code review was born as a lighter weight alternative to the cumbersome code inspections [4]; indeed Rigby and Bird confirmed this trait (CP 1 ) in their investigation across systems [33]. At Google, code review has converged to an even more lightweight process, which developers find both valuable and a good use of their time. Median review times at Google are much shorter than 4 0 14 0 11 14 Strongly agree 13 30 2 34 5 Too much 0 in other projects [14, 33, 34]. We postulate that these dif- ferences are due to the culture of Google on code review (strict reviewing standards and expectations around quick turnaround times for review). Moreover, there is a significant difference with reviewer counts (one at Google vs. two in most other projects); we posit that having one reviewer helps make reviews fast and lightweight. Both low review times and reviewer counts may result from code review being a required part of the developer workflow; they can also result from small changes. The median change size in OSS projects varies from 11 to 32 lines changed [34], depending on the project. At companies, this change size is typically larger [33], sometimes as high as 263 lines. We find that change size at Google more closely matches OSS: most changes are small. The size distribution of changes is an important factor in the quality of the code review process. Previous studies have found that the number of useful comments decreases [11, 14] and the review latency increases [8, 24] as the size of the change increases. Size also influences developers’ perception of the code review process; a survey of Mozilla contributors found that developers feel that size-related factors have the greatest effect on review latency [26]. A correlation between change size and review quality is acknowledged by Google and developers are strongly encour- aged to make small, incremental changes (with the exception of large deletions and automated refactoring). These findings and our study support the value of reviewing small changes and the need for research and tools to help developers create such small, self-contained code changes for review [6, 15]. 7.2 Software Engineering Research in Practice Some of the practices that are part of code review at Google are aligned with practices proposed in software engineering research. For example, a study of code ownership at Microsoft found that changes made by minor contributors should be received with more scrutiny to improve code quality [9]; we found that this concept is enforced at Google through the requirement of an approval from an owner. Also, previous research has shown that typically one reviewer for a change will take on the task of checking whether code matches con- ventions [4]; readability makes this process more explicit. In the following, we focus on the features of Critique that make it a ‘next generation code review tool’ [26]. Reviewer recommendation. Researchers found that review- ers with prior knowledge of the code under review give more useful comments [4, 11], thus tools can add support for re- viewer selection [11, 26]. We have seen that reviewer recom- mendation is tool-supported, prioritizing those who recently

9. Modern Code Review: A Case Study at Google 7.3 Knowledge spreading Knowledge transfer is a theme that emerged in the work by Rigby and Bird [33]. In an attempt to measure knowl- edge transfer due to code review, they built off of prior work that measured expertise in terms of the number of files changed [28], by measuring the number of distinct files changed, reviewed, and the union of those two sets. They find that developers know about more files due to code review. At Google, knowledge transfer is part of the educational motivation for code review. We attempted to quantify this effect by looking at comments and files edited/reviewed. As developers build experience working at Google, the average number of comments on their changes decreases (Figure 2). Comments vs. tenure at Google 10 edited/reviewed the files under review. This corroborates recent research that frequent reviewers make large contribu- tions to the evolution of modules and should be included along with frequent editors [40]. Detecting the right reviewer does not seem problematic in practice at Google, in fact the model of recommendation implemented is straightforward since it can programmatically identify owners. This is in contrast with other proposed tools that identify reviewers who have reviewed files with similar names [43] or take into account such features as the number of comments included in reviews [42]. At Google, there is also a focus on dealing with reviewers’ workload and temporary absences (in line with a study at Microsoft [4]). Static Analysis integration. A qualitative study of 88 Mozilla developers [26] found that static analysis integration was the most commonly-requested feature for code review. Auto- mated analyses allow reviewers to focus on the understand- ability and maintainability of changes, instead of getting distracted by trivial comments (e.g., about formatting). Our investigation at Google showed us the practical implications of having static analysis integration in a code review tool. Cri- tique integrates feedback channels for analysis writers [36]: Reviewers have the option to click “Please fix” on an analysis- generated comment as a signal that the author should fix the issue, and either authors or reviewers can click “Not useful” in order to flag an analysis result that is not helpful in the review process. Analyzers with high “Not useful” click rates are fixed or disabled. We found that this feedback loop is critical for maintaining developer trust in the analysis results. A Review Tool Beyond Collaborative Review. Finally, we found strong evidence that Critique’s uses extend beyond reviewing code. Change authors use Critique to examine diffs and browse analysis tool results. In some cases, code review is part of the development process of a change: a reviewer may send out an unfinished change in order to decide how to finish the implementation. Moreover, developers also use Critique to examine the history of submitted changes long after those changes have been approved; this is aligned with what Sutherland and Venolia envisioned as a beneficial use of code review data for development [38]. Future work can investigate these unexpected and potentially impactful non-review uses of code review tools. ICSE-SEIP ’18, May 27-June 3, 2018, Gothenburg, Sweden Comments per change Comments per 100 LoC 8 6 4 2 0 0 1 2 3 4 5 Tenure at Google (years) Figure 2: Reviewer comments vs. author’s tenure at Google Files seen vs. tenure at Google 700 Median number of files edited Median number of files reviewed Median number of files edited or reviewed 600 500 400 300 200 100 0 0 3 6 9 12 15 18 Tenure at Google (months) Figure 3: The number of distinct files seen (edited or reviewed, or both) by a full-time employee over time. Developers at Google who have started within the past year typically have more than twice as many comments per change. Prior work found that authors considered comments with questions from the reviewer as not useful, and the number of not useful comments decreases with experience [11]. We postulate that this decrease in commenting is a result of reviewers needing to ask fewer questions as they build famil- iarity with the codebase and corroborates the hypothesis that the educational aspect of code review may pay off over time. Also, we can see that the number of distinct files edited and reviewed by engineers at Google, and the union of those two sets, increase with seniority (Figure 3) and the total number of files seen is clearly larger than the number of files edited. For this graph, we bucket our developers by how long they have been at the company (in 3-month increments) and then compute the number of files they have edited and reviewed. It would be interesting in future work to better understand how reviewing files impacts developer fluency [44]. 8 CONCLUSION Our study found code review is an important aspect of the development workflow at Google. Developers in all roles see it as providing multiple benefits and a context where developers can teach each other about the codebase, maintain

10. ICSE-SEIP ’18, May 27-June 3, 2018, Gothenburg, Sweden the integrity of their teams’ codebases, and build, establish, and evolve norms that ensure readability and consistency of the codebase. Developers reported they were happy with the requirement to review code. The majority of changes are small, have one reviewer and no comments other than the authorization to commit. During the week, 70% of changes are committed less than 24 hours after they are mailed out for an initial review. These characteristics make code review at Google lighter weight than the other projects adopting a similar process. Moreover, we found that Google includes several research ideas in its practice, making the practical implications of current research trends visible. ACKNOWLEDGMENTS The authors would like to thank Elizabeth Kemp for her assis- tance with the qualitative study, Craig Silverstein for context, and past and present members of the Critique team for feedback, including in particular Ben Rohlfs, Ilham Kurnia, and Kate Fitzpatrick. A. Bacchelli gratefully acknowledges the support of the Swiss National Science Foundation through the SNF Project No. PP00P2_170529. REFERENCES [1] 2017. Appendix to this paper. https://goo.gl/dP4ekg. (2017). [2] A.F. Ackerman, L.S. Buchwald, and F.H. Lewski. 1989. Software inspections: An effective verification process. IEEE Software 6, 3 (1989), 31–36. [3] A.F. Ackerman, P.J. Fowler, and R.G. Ebenau. 1984. Software inspections and the industrial production of software. In Sym- posium on Software validation: inspection-testing-verification- alternatives. [4] Alberto Bacchelli and Christian Bird. 2013. Expectations, out- comes, and challenges of modern code review. In ICSE. [5] Vipin Balachandran. 2013. Reducing Human Effort and Improving Quality in Peer Code Reviews Using Automatic Static Analysis and Reviewer Recommendation. In ICSE. [6] Mike Barnett, Christian Bird, João Brunet, and Shuvendu K Lahiri. 2015. Helping developers help themselves: Automatic decomposition of code review changesets. In ICSE. [7] V.R. Basili, F. Shull, and F. Lanubile. 1999. Building knowledge through families of experiments. IEEE Trans. on Software Eng. 25, 4 (1999), 456–473. [8] Olga Baysal, Oleksii Kononenko, Reid Holmes, and Michael W Godfrey. 2013. The influence of non-technical factors on code review. In WCRE. [9] Christian Bird, Nachiappan Nagappan, Brendan Murphy, Harald Gall, and Premkumar Devanbu. 2011. Don’t touch my code!: examining the effects of ownership on software quality. In FSE. [10] Amiangshu Bosu and Jeffrey C Carver. 2013. Impact of peer code review on peer impression formation: A survey. In ESEM. [11] Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of useful code reviews: an empirical study at Microsoft. In MSR. [12] chromium 2016. Chromium developer guidelines. https://www. chromium.org/developers/contributing-code. (August 2016). [13] J.W. Creswell. 2009. Research design: Qualitative, quantitative, and mixed methods approaches (3rd ed.). Sage Publications. [14] Jacek Czerwonka, Michaela Greiler, and Jack Tilford. 2015. Code Reviews Do Not Find Bugs: How the Current Code Review Best Practice Slows Us Down. In ICSE (SEIP). [15] Martín Dias, Alberto Bacchelli, Georgios Gousios, Damien Cas- sou, and Stéphane Ducasse. 2015. Untangling fine-grained code changes. In SANER. [16] M.E. Fagan. 1976. Design and code inspections to reduce errors in program development. IBM Systems Journal 15, 3 (1976), 182–211. [17] gerrit 2016. https://www.gerritcodereview.com/. (August 2016). [18] githubpull 2016. GitHub pull request process. https://help.github.com/articles/using-pull-requests/. (2016). C. Sadowski et al. [19] Barney G Glaser and Anselm L Strauss. 2009. The discovery of grounded theory: Strategies for qualitative research. Transaction Books. [20] Georgios Gousios, Margaret-Anne Storey, and Alberto Bacchelli. 2016. Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective. In ICSE. [21] Georgios Gousios, Andy Zaidman, Margaret-Anne Storey, and Arie van Deursen. 2015. Work Practices and Challenges in Pull- Based Development: The Integrator’s Perspective. In ICSE. [22] Toshiki Hirao, Akinori Ihara, Yuki Ueda, Passakorn Phannachitta, and Ken-ichi Matsumoto. 2016. The Impact of a Low Level of Agreement Among Reviewers in a Code Review Process. In IFIP International Conference on Open Source Systems. [23] Harrie Jansen. 2010. The logic of qualitative survey research and its position in the field of social research methods. In Forum Qualitative Sozialforschung, Vol. 11. [24] Yujuan Jiang, Bram Adams, and Daniel M German. 2013. Will my patch make it? and how fast?: Case study on the linux kernel. In MSR. [25] Sami Kollanus and Jussi Koskinen. 2009. Survey of software inspection research. The Open Software Engineering Journal 3, 1 (2009), 15–34. [26] Oleksii Kononenko, Olga Baysal, and Michael W Godfrey. 2016. Code review quality: How developers see it. In ICSE. [27] Matthew B Miles and A Michael Huberman. 1994. Qualitative data analysis: An expanded sourcebook. Sage. [28] Audris Mockus and James D Herbsleb. 2002. Expertise browser: a quantitative approach to identifying expertise. In ICSE. [29] phabricator 2016. Meet Phabricator, The Witty Code Review Tool Built Inside Facebook. https://techcrunch.com/2011/08/07/oh- what-noble-scribe-hath-penned-these-words/. (August 2016). [30] Rachel Potvin and Josh Levenburg. 2016. Why Google Stores Billions of Lines of Code in a Single Repository. Commun. ACM (2016). [31] Teade Punter, Marcus Ciolkowski, Bernd Freimut, and Isabel John. 2003. Conducting on-line surveys in software engineering. In Empirical Software Engineering. [32] P. Rigby, B. Cleary, F. Painchaud, M.A. Storey, and D. German. 2012. Open Source Peer Review–Lessons and Recommendations for Closed Source. IEEE Software (2012). [33] Peter C Rigby and Christian Bird. 2013. Convergent software peer review practices. In FSE. [34] Peter C. Rigby, Daniel M. German, Laura Cowen, and Margaret- Anne Storey. 2014. Peer Review on Open-Source Software Projects: Parameters, Statistical Models, and Theory. IEEE Transactions on Software Engineering 23, 4, Article 35 (2014). [35] David L Sackett. 1979. Bias in analytic research. Journal of chronic diseases 32, 1-2 (1979), 51–63. [36] Caitlin Sadowski, Jeffrey van Gogh, Ciera Jaspan, Emma Söder- berg, and Collin Winter. 2015. Tricorder: building a program analysis ecosystem. In ICSE. [37] F. Shull and C. Seaman. 2008. Inspecting the History of In- spections: An Example of Evidence-Based Technology Diffusion. IEEE Software 25, 1 (2008), 88–90. [38] Andrew Sutherland and Gina Venolia. 2009. Can peer code reviews be exploited for later information needs?. In ICSE. [39] William Isaac Thomas and Dorothy Swaine Thomas. 1928. The methodology of behavior study. The child in America: Behavior problems and programs (1928), 553–576. [40] Patanamon Thongtanunam, Shane McIntosh, Ahmed E Hassan, and Hajimu Iida. 2016. Revisiting code ownership and its rela- tionship with software quality in the scope of modern code review. In ICSE. [41] Patanamon Thongtanunam, Shane McIntosh, Ahmed E Hassan, and Hajimu Iida. 2017. Review participation in modern code review. Empirical Software Engineering 22, 2 (2017), 768–817. [42] Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula, Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who should review my code? A file location-based code-reviewer recommendation approach for modern code review. In SANER. [43] Motahareh Bahrami Zanjani, Huzefa Kagdi, and Christian Bird. 2016. Automatically recommending peer reviewers in modern code review. IEEE Transactions on Software Engineering 42, 6 (2016), 530–543. [44] Minghui Zhou and Audris Mockus. 2010. Developer Fluency: Achieving True Mastery in Software Projects. In FSE.