原文链接:http://google-engtools.blogspot.com/...at-google.html

What's the problem?


Here at Google, we have thousands of engineers working on our code base every day. In fact, as previously noted,

50% of the Google code base changes every month. That’s a lot of code
and a lot of people. In order to ensure that our code base stays
healthy, Google primarily employs unit testing and code review for all
new check-ins. When a piece of code is ready for submission, not only
should all the current tests pass, but new tests should also be written
for any new functionality. Once the tests are green, the code reviewer
swoops in to make sure that the code is doing what it is supposed to,
and stamps the legendary “LGTM” (Looks Good To Me) on the submission,
and the code can be checked in.


However,

Googlers work every day on increasingly more complex problems,
providing the features and availability that our users depend on. Some
of these problems are necessarily difficult to grapple with, leading to
code that is unavoidably difficult. Sometimes, that code works very
well, and is deployed without incident. Other times, the code creates
issues again and again, as developers try to wrestle with the problem.
For the sake of this article, we'll call this second class of code “hot
spots”. Perhaps a hot spot is resistant to unit testing, or maybe a very
specific set of conditions can lead the code to fail. Usually, our
diligent, experienced, and fearless code reviewers are able to spot any
issues and resolve them. That said, we're all human, and sneaky bugs are
still able to creep in. We found that it can be difficult to realize
when someone is changing a hot spot versus generally harmless code.
Additionally, as Google's code base and teams increase in size, it
becomes more unlikely that the submitter and reviewer will even be aware
that they're changing a hot spot.


In order to help identify these hot spots and warn developers, we looked at bug prediction.
Bug prediction uses machine-learning and statistical analysis to try to
guess whether a piece of code is potentially buggy or not, usually
within some confidence range. Source-based metrics that could be used
for prediction are how many lines of code, how many dependencies are
required and whether those dependencies are cyclic. These can work well,
but these metrics are going to flag our necessarily difficult, but
otherwise innocuous code, as well as our hot spots. We're only worried
about our hot spots, so how do we only find them? Well, we actually have
a great, authoritative record of where code has been requiring fixes:
our bug tracker and our source control commit log! The research (for
example,
FixCache) indicates that predicting bugs from the source history works very well, so we decided to deploy it at Google.


How it works

In the literature, Rahman et al.

found that a very cheap algorithm actually performs almost as well as
some very expensive bug-prediction algorithms. They found that simply
ranking files by the number of times they've been changed with a
bug-fixing commit (i.e.
a commit which fixes a bug) will find the hot spots in a code base.
Simple! This matches our intuition: if a file keeps requiring bug-fixes,
it must be a hot spot because developers are clearly struggling with
it.


Aside from the speed of execution, this algorithm is also very attractive as
it's easy to communicate to others: files are flagged if they have
attracted a large number of bug-fixing commits, no more and no less.
Some bug prediction algorithms use a large number of metrics and perform
many calculations before they output a result, but how do we know it's
not a false positive? We don't! Once developers start feeling a tool is
lying to them, they'll quickly stop using it. With the Rahman algorithm,
whether a developer agrees with the prediction or not is up for debate,
but no one can argue with the actual number it outputs.


We implemented the Rahman algorithm by creating a program that hooked into
our source control system, and pulls out all the changes which had a
bug attached to them. It looks at each bug number, and verifies with the
bug-tracking database that it was really a bug, and filters out
everything else, such as feature requests. It then looks at all the
files that appeared in these changes, and filters out those that have
been deleted and are no longer at HEAD. For each file, the number of
bug-fixing changes it's been in is calculated, and we output the files
which were ranked in the top 10%.


We showed output to the development teams (you know, just to make sure). The response?

"Hey guys, this list looks great, but there's a couple of files that used to
be a problem, but we fixed them, so they shouldn't be on here now."


It turns out that while the Rahman algorithm shows us where hot spots are,
it doesn't adapt to changes readily. If a development team manages to
nail down a hot spot and get it fixed, it'll still appear in the list
because of all the bug-fixing commits it created in the past.


What we needed was a way of prioritizing newer bug-fixing commits, and
downgrading the value of old ones, so fixed files begin to fall down the
list.


After some trial-and-error, we decided to score each file by weighting each
bug-fixing commit by how old it is. As the commit gets older, so its
influence tends towards 0.


Where n is the number of bug-fixing commits, and ti is the timestamp of the bug-fixing commit represented by i.

The timestamp used in the equation is normalized from 0 to 1, where 0
is the earliest point in the code base, and 1 is now (where now is when
the algorithm was run). Note that the score changes over time with this
algorithm due to the moving normalization; it's not meant to provide
some objective score, only provide a means of comparison between one
file and another at any one point in time.


Some of you might wonder why we don't factor in the number of commits in the
algorithm: a file that changes often as it's being developed will get
more bug-fixing commits. Wouldn't it be fairer to look at the ratio of
non-fixing commits to bug-fixing commits? Having trialled this, we found
the results unsatisfying. Code churn has previously been pointed at as
an indicator of the presence of defects (particularly by
Nagappan and Ball), so employing a ratio removes that useful signal.

If we plot our equation, it looks like this:
Running using this scoring algorithm means that as commits get older, they are
worth less and less. The drop-off happens quickly in order to really
push up those newer bug-fixing changes and devalue the older ones. Files
that don't get many bug-fixing commits for a while will end up falling
out of the top 10%.



How we're using it

When a file is predicted to be a hot spot, we place a warning in our code
review system on that file. Whenever a reviewer logs in to review that
code, the warning will appear, which hopefully will encourage them to
spend some more time reviewing the code, or hand off the review to
someone more experienced if need be.



Conclusion

Bug prediction is not an objective measure by any means: the attentive
amongst you will see it's another tool that we can provide Googlers with
in order to help them gain insight into their code. We hope that by
highlighting code hot spots, we'll help to stop tricky bugs making their
way into the code base. We'll be monitoring how developers are engaging
with these reviews in the months to come.



- Chris Lewis and Rong Ou