Monday, December 28, 2009

On commit messages

In the last few weeks, I've had a surprising number of discussions about commit messages. Many of them were with developers new to a project, trying to get them started. So here's a list of things you should do when committing, and why you should do it. Hint: the linux kernel mailing list gets it right, go there to learn.

Any software project is a collaborative project. It has at least two developers, the original developer and the original developer a few weeks or months later when the train of thought has long left the station. This later self needs to reestablish the context of a particular piece of code each time a new bug occurs or a new feature needs to be implemented.

Re-establishing the context of a piece of code is wasteful. We can't avoid it completely, so our efforts should go to reducing it to as small as possible. Commit messages can do exactly that and as a result, a commit message shows whether a developer is a good collaborator.

A good commit message should answer three questions about a patch:

  • Why is it necessary? It may fix a bug, it may add a feature, it may improve performance, reliabilty, stability, or just be a change for the sake of correctness.

  • How does it address the issue? For short obvious patches this part can be omitted, but it should be a high level description of what the approach was.

  • What effects does the patch have? (In addition to the obvious ones, this may include benchmarks, side effects, etc.)



These three questions establish the context for the actual code changes, put reviewers and others into the frame of mind to look at the diff and check if the approach chosen was correct. A good commit message also helps maintainers to decide if a given patch is suitable for stable branches or inclusion in a distribution.

A patch without these questions answered is mostly useless. The burden for such a patch is on each and every reviewer to find out what the patch does and how it fixes a given issue. Given a large number of reviewers and a sufficiently complex patch, this means many man-hours get wasted just because the original developer did not write a good commit message. Worse, if the maintainers of the project enforce SCM discipline, they will reject the patch and the developer needs to spend time again to rewrite the patch, reviewers spend time reviewing it again, etc. The time wasted quickly multiplies and given that a commit message only takes a few minutes to write, it is simply not economically viable to omit them or do them badly.

Consider this is a hint for proprietary software companies too - not having decent SCM discipline costs money!


How to do it better


There's no strict definition of the ideal commit message, but some general rules have emerged.
A commit should contain exactly one logical change. A logical change includes adding a new feature, fixing a specific bug, etc. If it's not possible to describe the high level change in a few words, it is most likely too complex for a single commit. The diff itself should be as concise as reasonably possibly and it's almost always better to err on the side of too many patches than too few. As a rule of thumb, given only the commit message, another developer should be able to implement the same patch in a reasonable amount of time.

If you're using git, get familiar with "git add -p" (or -i) to split up changes into logical commits.

The git commit format


If you're submitting patches for git, the format is mostly standardised. A short one-line summary of the change (the maximum length of the line differs between projects, it's usually somewhere between 50 and 78 characters). This is the line that'll be seen most often, make it count. Many git tools are in one way or another optimised for this format. After that one-line summary, an empty line, then multiple paragraphs explaining the patch in detail (if needed). Don't describe the code, describe the intent and the approach. And keep the log in a present tense.

Learn to love the log


I have used CVS (and SVN to a lesser extent) in the past and log was a tool that was hardly ever used. Mostly because it was pretty useless, both the tool and the information available. These days I look at git logs more often than at code. The git log tool is vastly superior to CVS log and the commit discipline in the projects I'm working on now is a lot better. I grep git logs more often than code files and I use git blame all the time to figure out why a particular piece of code looks the way it does. It's certainly saving me a lot of time and effort. It's come to the point where the most annoying X server bugs are the ones where the git history stops at the original import from XFree86. If you're not using your SCM's log tool yet, I recommend to get more familiar with it.

How not to do it


There's a bunch of common sins that are committed (yay, a pun!) regularly.


  • SCM is not a backup system! My personal pet hate. Developers who use it as such tend to do end-of-day commits, checking in everything at the end of the day. The result is useless, a random diff across the code with changes that are impossible to understand by anyone including the original author once a few months have passed. (On this note: universities, please stop teaching this crap).

  • Per-file commit. More often than not a logical change affects more than one file and it should not be split up into two commits.

  • Lazy commit messages, any commit labelled as "misc fixes and cleanups" or similar. I've seen my fair share of those on non-FOSS projects and they always come back to bite you. Impossible to find when a bug was introduced, hard to bisect and makes it harder for anyone else to keep track of what's happening in the project.

  • Two changes in one patch. Something like "Fixed bug 2345 and renamed all foo to bar". Unless bug 2345 required the renaming, fixes whould be split it up into multiple patches. Others may have to take one of those bug fixes and apply it to a stable branch but not the other one. Picking bad patches apart into useful chunks is one of the most time-consuming and frustrating things I've done since it doesn't actually add any value to the project.

  • Whitespace changes together with code changes. Needle in a haystack is a fun game, but not when you're looking at patches. It's a great way to introduce bugs, though because almost no-one will spot the bug hidden in hundreds of lines that got reindented for fun and profit.

  • The ever-so-lovely code drops. Patches with hundreds of lines of code to dump a new feature into the code while at the same time rewriting half the existing infrastructure to support this feature. As a result, those hundreds of lines of code need to be reviewed every time a bug is discovered that is somehow related to that area of code.
    It's easier and less time consuming to first rework the infrastructure one piece at a time, then plug the new feature on top. As a side-effect, if a project relies on code dumps too often it's discouraging outside developers. Would you like to contribute to a project where the time spent filtering the signal from the noise outweighs the actual contribution to the code?

  • Unrelated whitespace changes in patches. A reviewer needs to get the big picture of a patch into their brains. Whitespace-only hunks just confuse, a reviewer has to look extra hard to check if there's a real change or whether it can be ignored. That's not so bad for empty lines added or removed,it's really bad for indentation changes.



There's plenty of excuses for the above, with the favourite one being "but it works!". It may work, but code is not a static thing. In a few weeks time, that code may have moved, been rewritten, may be called in a different manner or may have a bug in it. At the same time, the original developer may have moved on and no-one knows why the code is that way. In the worst case, everyone is afraid of touching it because nobody knows how it actually works.

Another common excuse is the "but I'm the only one working on it". Not true, any software project is a collaborative project (see above). Assuming that there's no-one else is simply short-sighted. In FOSS projects specifically we rely on outside contributors, be it testers, developers, triagers, users, etc. The harder it becomes for them to join, the more likely the project will fail.

Another, less common excuse these days is that the SCM used is too slow. Distributed SCMs avoid this issue, saving time and by inference money.

15 comments:

Unknown said...

Good patch viewers (gitk) have an option to hide whitespace changes :)
This helps in cases where the reindentation was an important part of the patch, too (like putting 100 lines of code in a if () {...} block).

spbnick said...

Thanks! This is really useful both to me and as a guide to be shown to other developers. Good reasoning too :)

Unknown said...

Hi,

This is really great post. I need to save it somewhere for my future collaborators ;)

Regards,
Michal

uma mahesh varma said...

Very nice post. I will make myself to comment the message of each functionality from now.

Thank You,
Uma.

Anonymous said...

Good stuff, I've seen commit messages like "Added Stuff" from really over worked engineers. lol

Mike Sherov said...

Great post, except the thing about whitespace changes. I see way too many code drops. "Practices of an Agile Developer" says: "13. Keep you project releasable at all times" and "17. Develop in increments."

Wm Bentley said...

Very helpful. Thanks so much for formulating these suggestions. Agree about separating white space changes into separate, smaller commits. They're a logical unit of change, too.

Anonymous said...

When somebody says "but I'm the only one working on it" gently remind him that three months down the road he won't remember the details of why he made such a change as he does right now, without a usable commit log message.

Oliver McFadden said...

Thanks, very good post. I've recommended this to a few people already - and I'm sure there will be more still - and mentioned it in my post about Subversion (and why no one should be using it anymore.)

0andriy said...

@Stefan, if you put 100 lines code under if () {} block you are most likely doing wrong thing.

j.eng said...

@Stefan: not just gitk. -w can be used whereever a diff is generated. `git log -p -w -M -M` is one of the real tools one should know.

Unknown said...

Re "SCM is not a backup system"... it can be, but keep that crap on your own personal branch and only push the final polished version.

fyngyrz said...

Unfortunately, whereas a change may occur over several files, as well as documentation, this...

git add -u

...which is the syntax for uploading those things with changes, puts the *same* commit message on everything on Github, instead of providing a means to supply the appropriate material(s.)

The end result is that the top line, which is all one sees on Github without a click, is generally *not* a meaningful line for at least one of the changed files where there is more than one.

So it's not always about the procedure of the author. Sometimes the tools are flawed.

As a friend mentioned, it'd be much better if, on Github, they simply put "last changed on X date" in there, rather than a header line that is, in many cases, destined to be irrelevant.

I like to make a list of changes to all things and put it in there. But it doesn't show without a click, so the whole overview-thing... not very functional.

Unknown said...

Hi,

one thing that kills me is uses of abbrevations.
By all means decipher _all_ of them at least _once_ in your text.
When you are new to Agile (or whatever),
than those are hard to decipher,
and sometimes search in google can mislead you.
Going to the point: what is SCM ?
First guess (from google) is Software Configuration Management.
Is this correct ?

Ng Kah Fai said...

Appreciated this