Archive for February, 2009

Two steps forward, one step back

Once upon a time, there was RCS, and then CVS. They tracked normal edits to a set of text files reasonbly well, and coupled with telnet or ssh, even made it relatively straightforward for a trusted group of collaborators to share their changes with each other. Some people used other proprietary tools (Perforce, Visual Source Safe, etc.) but they tended to be either a) expensive b) really, really lousy or c) both. Among the open source crowd, at least, CVS dominated the version control space for many years.

Then came Subversion. It improved on many of the failings of CVS — notably, Windows support was dramatically better, repositories could be shared over HTTP, and many operations that just didn’t work in CVS (renames, binary diffs, etc.) performed reasonably well out of the box. To this day, Subversion is a reasonable choice for many projects, especially given the advanced level of support for it in IDEs, graphical repository browsers, and the like.

Much of the reason for that diversity of useful tooling built atop Subversion, of course, is that it was written in C, and built with an eye towards allowing high-level languages to use bindings into the same runtime libraries upon which the ’svn’ command itself relied. In fact, Python, Perl, Java, and Ruby are all supported by the core Subversion maintainers, and additional bindings using those same underlying libraries are available for a number of other languages.

Enter the distributed version control systems: Git, Mercurial, Bazaar, Darcs, and their ilk. The basic workflow they offer is in some ways more like RCS than it is Subversion: each developer works locally against their own copy of a repository, and they share their work via patch files and periodic synchronization. (This is of course a gross over-simplifaction, as all of them offer much more sophisicated change-tracking under the hood than RCS did, but the user-visible behavior is still reminiscent.) However, their ability to maintain change history across many developers and systems without forcing everyone to eventually squash their work down into a single source tree makes a number of new modes of project management possible, or at least much easier than before.

All of the above DVCS systems potentially offer a huge gain of productivity for many developers, since you can easily experiment with changes locally, selectively share only those modifications which you wish to, and continue working without being connected to the central repository. (This is especially significant for those whose employers maintain draconian firewall rules and disallow off-site access to their source control.)

Unfortunately, none of the popular DVCS systems have anything resembling the level of cross-language API support that Subversion does. Mercurial and Bazaar are both implemented in Python, making access from other Python code quite fast, and that from any other language painfully slow. Git is implemented in C, but without a supported and documented core library of functions designed to be used to facilitate access from other languages. Darcs is written in Haskell, which means only crazy mathematicians and CS majors have any ability or interest in using it. (I’m kidding here, but the point remains that Haskell isn’t exactly the most useful substrate for scripting language bindings.)

The fallout from all of this is that we’re left using wrapper libraries which fork out to the command-line tools for each DVCS. Such wrappers have a number of problems: the performance sucks, the internal APIs are usually only as robust as the set of regular expressions you write to parse the output of the commands, and almost no work is shared between the various wrapper implementations.

Don’t get me wrong: as a simple version control tool, I’ve found Git in particular (and distributed version control in general) to be a big step up from the old centralized-repository model. However, the very eighties-esque fork-and-regexp-scrape model for IPC — coupled with the lack of an obvious “best of breed” leader in the DVCS space — means that I (along with anyone else trying to support DVCS in a general-purpose way) end up doing a lot of low-level grunt work when we could be building real value for users.

Even something as simple as a standard dump format for a common subset of the information available from the popular DVCS types would be a start. I do know that, for the time being, I’m stuck supporting a bunch of very brittle code which relies on the various idiosyncratic console output formats of each version-control system.

Playing prognosticator, I would even go so far as to suggest that the first DVCS system to provide supported, documented interfaces in a number of popular programming languages could climb to the top of the dogpile that exists currently and emerge as a clear standard.