2009 Roll-up

Since Jan. 1 2009, I have:

  • Changed jobs – and not just jobs, but entire industry sectors, switching from services for software developers to publishing and good-old-fashioned e-commerce.
  • Bought a house (my first!) with my lovely girlfriend Hannah
  • Bought a car (not my first, but the first in ~5 years) and driven to Montana, Idaho, and Washington for bike rides and other general fun
  • Joined the Citizen Campaign Commission, in order to help oversee Portland’s publicly-funded elections

It’s not an earth-shattering list, but there’s at least some sign of positive movement there, and I’m optimistic about what 2010 should have in store.

Code Reading

We do full code reviews at my shop — no code goes into production without at least two pairs of eyes on every line of the change.

As I switch between projects, I find myself willing to be absolutely ruthless in my code reviews when reading other people’s Python code. When I’m looking at PHP or Ruby source, I expect a certain amount of license to be taken with indentation, naming, and even encapsulation. Python? No dice. If you’re gonna use the language, use it right.

Use a less-than-descriptive variable name? Rejected.

Call an internal implementation method (ex.: _do_stuff()) outside the class that defined it (even in a unit test)? Rejected.

Let a line go over 120 characters in width? Rejected.

Forget to provide a useful comment for the test used in an if: block? Rejected.

I probably should be just as strict with PHP, Ruby, etc., but the culture doesn’t seem to be as forgiving of hard-and-fast style guidelines in those communities.

Django for Rails devs

I’ve recently made the transition from full-time Rails development to a mix of technologies including, in large part, Django. Since I was a Python guy before I ever started using Ruby, the transition has mostly been an easy one, but there are some fairly significant differences in design and philosophy between the two frameworks that are worth thinking about if you’re picking a platform for a new project. Given that most of the comparisons out there on the Intertubes seem to be woefully out of date — the first page of Google results is mostly populated by articles that are 3-4 years old — I thought I might toss out some of my own highly-subjective observations out there anyone else trying to evaluate both stacks.

Similarities

Generally, the two projects are more alike than they are different, at least from the POV of a working web developer. Either one will give you a nice boost in productivity when building non-trivial applications where time-to-market trumps hard performance, runtime platform, or office politics. Much as Ruby and Python offer similar competitive advantage to the teams using them, neither Rails and Django will leave you struggling to keep up with other agile web development teams (or conversely, easily coasting past them).

Both Django and Rails have fairly powerful object-relational layers baked-in, and have good support for popular open source databases, including MySQL, PostgreSQL, and SQLite3. They both offer flexible URL mapping/routing tools, fairly easy-to-learn standard template formats, and (at least for Rails >= 2.3) the ability to insert middleware into the fast path of your request/response cycle, either to manipulate the request data, or to short-circuit loading in cases where you don’t need the whole framework for a particular client request.

In addition, both frameworks benefit from an active, supportive community that will help you get up to speed and answer questions. The online documentation for each project is fairly extensive, although I personally feel that the Django folks have done a better job of pulling the 80% of the docs that most developers need when they’re getting started into one place, with a consistent style and voice.

Django wins

There are a couple of big ones: form classes and idiomatic use of Python modules.

Form classes

This is, without a doubt, my favorite feature of Django that simply has no real equivalent in Rails. Basically, a Django form lets you de-couple the HTML editing UI for a model from its storage and business logic. The big win here is due to the fact that form validation != domain model validation. Depending on your application, you want to allow users to populate forms with more or less information than would be stored in a single model class instance, and validate those forms using their own internal validation logic, rather than delegating all validation decisions to the model class.

As a case in point, consider a simple comment form. If you add a CAPTCHA to the form, you could make your Django form class perform checking of that field, and display errors in CAPTCHA solving alongside those affecting the other comment fields, without forcing the comment model to even be aware that its views relied on such protections. Furthermore, once that form class was implemented, you could re-use it in any number of views without duplicating the display or validation logic. This is simply a better way to build views than the Rails model of helpers and shared controller methods, and I would encourage the Rails community to find some way to provide more support for structured views, especially in the areas of form handling.

Python modules

Ruby and Python both support modules as a namespace construct. However, much as Ruby classes default to having private instance variables and public methods, the Ruby module type is largely opaque except when used as a mixin to a class implementation. Few Ruby libraries would be written to use instances of the Module type directly; rather, they would expect either an instance of some class, or a hash mapping symbols to objects. In Python, however, modules are “open” by default, with any defined names visible both inside and outside the module — just as Python objects default to being public bags of attributes.

This leads to one very natural means to connecting applications and components in a Django project: passing modules (or module names) as first-class constructs. Want to “mount” all the URLs in a pluggable Django app into your project? Use include('myapp.urls'). Want to override the content model classes used in a CMS workflow app? Parameterize the application with an optional model namespace module, and look up the needed model classes in that module at runtime. Dependency injection, “duck typed” polymorphism, etc., can all happen at the module level, and the entire Django framework (and well-written reusable applications that use it) capitalize quite effectively on this capability.

Rails wins

There are two major areas where Rails stomps Django: testing tools and database schema migrations.

Testing tools

This one should be obvious to anyone who compares recent conference talk or blog post titles from the Rails and Django worlds. For those who don’t want to click through: RailsConf had 3-4x as many testing-focused sessions, and it was mentioned in many if not most other talks. The highly-scientific Google Fight also shows a much higher amount of online discussion of Rails testing, adding further anecdotal evidence to support the argument that the Rails community is at least talking about testing a lot more than their Django counterparts.

Rails developers, as a community, have been thoroughly bitten by the testing bug, and are always on the lookup for better ways to write more copious and useful collections of tests for their applications. This has led to the development of great tools like RSpec, Shoulda, WebRat, and Cucumber for authoring tests, along with a huge supporting cast of libraries, reporting tools, and howto guides to make testing as easy as possible for Rails developers. Django has TestCase and TestClient, with a smattering of support from other Python tools like Windmill and python-nose to speed things along.

I recognize that most of the Ruby modules I linked to above are not part of the Rails core, and than there are lots of similarities in the features available for testing in either Ruby or Python. The difference is that most Rails developers I’ve talked to use the full breadth of testing tools available to them, and extol their use to others, while the Django community takes a much more lackadaisical attitude about testing outside of the Django core. (Even the bundled “contrib” apps in Django often have weak test coverage — as an example, there are zero included unit or doctests for the django.contrib.admin in my local django trunk checkout at r11578).

Schema migrations

ActiveRecord migrations are not the solution for all possible database changes in real-world applications, but they cover most cases in a consistent, easy-to-learn way. Once they learn a handful of migration library methods, Rails developers can happily write clear, lightweight database manipulation routines that allow their application database to evolve as requirements change without having to resort to non-portable low-level SQL queries. This is a Good Thing ™, and worth emulating in other frameworks.

South is an entirely-reasonable implementation of a very similar model. The Django core team should adopt it as part of the platform, or implement their own simplified version. This is an obvious case where the perfect is the enemy of the good.

Conclusions

Generally, I’m pretty happy about working with Django instead of Rails these days. Whereas I spent days struggling with obscure classloading issues trying to trivially extend the Rails framework, I’ve been able to make use of the module-driven pluggability of Django to swap out cache backends, template libraries, and entire domain model namespaces in my Django apps without much more than a brief foray into the source for an external library or two. Modulo the testing issues I raised above, and the lack of a really good equivalent to ActiveMerchant, I would call myself a fairly satisfied Django user, at least until the next big thing comes along.

There are some less technical reasons to consider Django over Rails, as well. First, who doesn’t love any typically male-dominated developer community which adopts such a ridiculous mascot? Also, I attended both RailsConf and DjangoCon this year, and have to say that I enjoyed the latter quite a bit more. The difference in location for the two conferences may have altered my perceptions a bit, but I personally had more fun at DjangoCon. I also didn’t overhear anyone describe themselves as a “rockstar” at there, which was just fine with me.

For all our sakes

I’ve switched jobs twice in the last 12 months. It’s certainly not unheard of in my trade to bounce around a bit, and it’s not the first time I’ve had the experience. It has, however, reminded me of many of the unique challenges associated with trying to quickly get up to speed with a body of existing code, and especially those idioms and misfeatures which most complicate the ramp-up process.

Since most of what I work on these days is web application code, the issues below will be focused there, but most of the basic concepts should hold true for most any type of programming.

So, here are my top three recommendations for anyone who expects other people to have to eventually read or maintain their webapp code:

Logic/template separation

Web developers need to be willing to switch between 4-5 languages from moment to moment: HTML, CSS, JS, SQL, and a general-purpose language for business logic. That being said, for the sake of all who will read your code after you write it (including your future self), avoid interleaving languages arbitrarily within a single logical block of code (method, source file, or module).

I still routinely see markup, Javascript, and Python/PHP/Ruby code mixed in the same source file, usually with one language nested inside a loop defined in a another, emitting yet another syntax for consumption by the browser.

If you’re generating dynamic Javascript, create a JSON array which can be iterated over by plain, static JS code, rather than interpolating values directly into JS method calls. Similarly, when producing HTML, minimize the logic in the “template” sections of code. If you’re interleaving database queries, ‘foreach’ loops, and emission of <tr> and <li> elements, it will be nigh-impossible to change the business logic being used without also breaking the layout, and visa versa.

Commented code blocks

If you’re using version control, there should be no need to leave large blocks of code commented out or disabled. (If you’re not using version control, stop reading this immediately and go buy a book on Subversion, Git, or Mercurial. Come back when you have your version control workflow established.) Commit messages, revision diffs, branches, and supplementary documentation (such as a team wiki, another tool in the “must have” category) should provide a sufficient amount of sideband communication about proposed or unfinished code.

If you leave large amounts of inactive code lying around, on the other hand, you’re encouraging bit-rot and cargo-cult design. Your commented-out code will not have test coverage, or be kept up to date with internal API and schema changes.

None of this applies to example code — usage tips provided in comment blocks for documentation purposes are handy, as long as they’re kept up to date. I object specifically to operational code which is disabled en masse rather than removed.

Conventional coding

Even after settling on a programming language and framework, most teams still have a lot of leeway in terms of how to structure their code. Naming conventions, whitespace, inline documentation, and module layout are usually left largely up to you. However, there are some major benefits you can realize simply by imposing some basic rules for consistent coding style across your entire project.

You should start with basic syntactic conventions: 4-space tabstops, braces in K&R style, class names in StudlyCaps, etc. At some point, you may want to make a full cleanup pass across your codebase that does nothing but enforce these standards to avoid polluting your working patches with simple readability cleanups.

From there you can move on to more semantically-meaningful rules: no mutable global variables, JavaDoc/PHPDoc/Python ReST Docstrings for all public API entry points, use bind params instead of string interpolation when building SQL queries, etc.

The payoff

As you move through the code making these changes, you’ll find certain regular and repeated patterns emerge as the line noise of differing style and sloppy coding evaporates. Now you’re ready to start refactoring. Lift duplicated code into utility functions, bundle those functions into classes with shared state or domain knowledge, and then arrange those classes into useful packages.

Most of the responsibility for these tasks falls squarely on the shoulder of the development team. However, management has a critical role to play as well: when your developers begin grumbling about unmaintainable code, before giving them leave to start over or abandon existing working applications, press gently to see if a bit of housekeeping like what I’ve outlined above might let them work a little longer with the current implementation.

There are any number of reasons why an old implementation can and should be abandoned — obsolete technology, dramatically changed requirements, heavy turnover — but “ugly” or “messy” code shouldn’t be sufficient justification on its own.

Open Source Bridge presentation

In case anyone stumbles here looking for the notes and examples from my Open Source Bridge talk, here they are:

osbridge_2009.zip

Note: this is a ~30MB download, since it contains (amongst other things) a full copy of JRuby 1.3.1 and the ActiveMQ runtime. The actual presentation and example code are very light.

You can also just view the talk slides, though they aren’t terribly informative without the code.

Update: video is available on blip.tv now. My apologies for the long delay while everyone downloaded the demo archive in minutes 3:00-8:15 or so.

Better late than never

I was reading a brief but interesting post surveying the current state of the art in security as programming language features, and realized that a lot of the links overlapped with the material from the paper I wrote for my security theory course a while back. Rather than re-post all of those as a blog entry, I thought I should probably just put a link to the finished PDF.

Given that this was a school paper, I hope that folks will forgive the somewhat stilted grammar and obviously-academic format. If you get nothing else from it, though, the bibliography may at least be of interest.

Never do today what you can put off ’til tomorrow

In many ways, this is a golden age for web developers: we have a bunch of good, high-level frameworks for writing apps in highly-productive dynamic languages and a solid corpus of best practices for testing, service API design, and data serialization. We don’t have to deal with dog-slow CGI scripts, complicated J2EE stacks, or proprietary ColdFusion code that only runs atop expensive application servers.

Unfortunately, all is not wine and roses (or scotch and bacon, or whatever). The major dynamic webapp frameworks push you by convention into doing the bulk of your application work syncronously in the request-processing loop, rather than asynchronously in a background thread. All of the accumulated wisdom about building responsive graphical user interfaces gets thrown out and re-discovered by each framework’s user community, resulting in a multitude of solutions for the basic problem of pushing work into a queue and dealing with it later.

As the fine folks at Twitter so famously discovered, synchronous processing puts a hard upper limit on how much (and how quickly) you can scale an application. Even at the much more modest loads my current project at work receives, there are quite a few performance problems that can’t be solved by simply throwing more stuff in memcached and hoping for the best.

Some folks are starting to catch on, and bake asynchronous processing into their frameworks by default, but the solutions tend to either be limited to very particular deployment and application models, or esoteric in the extreme. Meanwhile, desktop application authors continue to politely chuckle at all of our bumbling, and old-skool enterprise developers look at our hackish background-worker implementations and (rightly) consider them to be toys compared to the classic “big boy” message queueing solutions, or even the newer open source alternatives.

The next generation of web application frameworks should be designed around the idea that work is done asynchronously by default, with a fallback to syncronous jobs only in cases where a user needs to see the result immediately. Since applications also need to scale across a potentially large and heterogenous set of CPUs and servers, those delayed jobs also may not be running in the same memory space as the web application itself. That means machine and language-agnostic serialization, fast network IPC, and callback and event-driven programming.

Developers who grok these concepts now will have a leg up on the competition when building tomorrow’s crop of web applications.

Two steps forward, one step back

Once upon a time, there was RCS, and then CVS. They tracked normal edits to a set of text files reasonbly well, and coupled with telnet or ssh, even made it relatively straightforward for a trusted group of collaborators to share their changes with each other. Some people used other proprietary tools (Perforce, Visual Source Safe, etc.) but they tended to be either a) expensive b) really, really lousy or c) both. Among the open source crowd, at least, CVS dominated the version control space for many years.

Then came Subversion. It improved on many of the failings of CVS — notably, Windows support was dramatically better, repositories could be shared over HTTP, and many operations that just didn’t work in CVS (renames, binary diffs, etc.) performed reasonably well out of the box. To this day, Subversion is a reasonable choice for many projects, especially given the advanced level of support for it in IDEs, graphical repository browsers, and the like.

Much of the reason for that diversity of useful tooling built atop Subversion, of course, is that it was written in C, and built with an eye towards allowing high-level languages to use bindings into the same runtime libraries upon which the ’svn’ command itself relied. In fact, Python, Perl, Java, and Ruby are all supported by the core Subversion maintainers, and additional bindings using those same underlying libraries are available for a number of other languages.

Enter the distributed version control systems: Git, Mercurial, Bazaar, Darcs, and their ilk. The basic workflow they offer is in some ways more like RCS than it is Subversion: each developer works locally against their own copy of a repository, and they share their work via patch files and periodic synchronization. (This is of course a gross over-simplifaction, as all of them offer much more sophisicated change-tracking under the hood than RCS did, but the user-visible behavior is still reminiscent.) However, their ability to maintain change history across many developers and systems without forcing everyone to eventually squash their work down into a single source tree makes a number of new modes of project management possible, or at least much easier than before.

All of the above DVCS systems potentially offer a huge gain of productivity for many developers, since you can easily experiment with changes locally, selectively share only those modifications which you wish to, and continue working without being connected to the central repository. (This is especially significant for those whose employers maintain draconian firewall rules and disallow off-site access to their source control.)

Unfortunately, none of the popular DVCS systems have anything resembling the level of cross-language API support that Subversion does. Mercurial and Bazaar are both implemented in Python, making access from other Python code quite fast, and that from any other language painfully slow. Git is implemented in C, but without a supported and documented core library of functions designed to be used to facilitate access from other languages. Darcs is written in Haskell, which means only crazy mathematicians and CS majors have any ability or interest in using it. (I’m kidding here, but the point remains that Haskell isn’t exactly the most useful substrate for scripting language bindings.)

The fallout from all of this is that we’re left using wrapper libraries which fork out to the command-line tools for each DVCS. Such wrappers have a number of problems: the performance sucks, the internal APIs are usually only as robust as the set of regular expressions you write to parse the output of the commands, and almost no work is shared between the various wrapper implementations.

Don’t get me wrong: as a simple version control tool, I’ve found Git in particular (and distributed version control in general) to be a big step up from the old centralized-repository model. However, the very eighties-esque fork-and-regexp-scrape model for IPC — coupled with the lack of an obvious “best of breed” leader in the DVCS space — means that I (along with anyone else trying to support DVCS in a general-purpose way) end up doing a lot of low-level grunt work when we could be building real value for users.

Even something as simple as a standard dump format for a common subset of the information available from the popular DVCS types would be a start. I do know that, for the time being, I’m stuck supporting a bunch of very brittle code which relies on the various idiosyncratic console output formats of each version-control system.

Playing prognosticator, I would even go so far as to suggest that the first DVCS system to provide supported, documented interfaces in a number of popular programming languages could climb to the top of the dogpile that exists currently and emerge as a clear standard.

Inauguration playlist

We had a little family dance party in the street (no, seriously, we did — pictures forthcoming) after re-watching all the coverage of the inauguration tonight.

Our playlist:

  1. The Payback — James Brown
  2. Song 2 — Blur
  3. Gone Daddy Gone — Gnarls Barkley
  4. The Yeah Yeah Yeah Song — The Flaming Lips
  5. The Golden Path (Ewan Pearson Extended Vocal) — The Chemical Brothers Featuring The Flaming Lips
  6. Paper Planes — M.I.A.
  7. All My Friends — LCD Soundsystem
  8. My People — The Presets
  9. Fear Not Of Man — Mos Def
  10. Work On You — MSTRKRFT
  11. Rawnald Gregory Erickson the Second — Starfucker
  12. My Favorite Things — Outkast

The theme is obvious, but we all enjoyed the hell out of it.

Daily git-svn

My team at Sun uses Subversion to host our “authoritative” source repository for Project Kenai. However, since most work is done on the trunk, many of us find it more convenient to work locally with Git, using the git svn subcommand heavily to keep ourselves up-to-date without interfering with others’ work.

When I first started using this combo, I had some early trouble keeping my local Git repository from getting horribly b0rked whenever there were edits made to the same files I had been working on locally. Having used CVS and Subversion for so long, I initially assumed such conflicts (and the manual merge steps they entailed) were simply part of the equation, even when working with a proper DVCS. However, by applying a little more discipline to my use of local branches, I’ve been able to basically eliminate manual merges, except in cases where the exact same line has literally been edited by multiple people.

My first, most critical discovery was to never use the fetch command. Instead, use rebase. Second, never pull the latest changes from Subversion into a working feature branch; instead, switch to your master branch, create a new branch for merging (I usually call mine “svn-merge”), and do your rebase there. After the rebase has finished, merge in your feature branch changes, and then use dcommit to push your changes upstream.

As an example, here are to commands I would use to check out a new local Git clone of the main Subverion repository, work on a single command, and then push it back into SVN:

viper:Work$ git svn clone https://example.com/svn/repo/trunk -r500:HEAD repo
# ... lots of Git output here ...
viper:Work$ cd repo
viper:repo$ git checkout -b issue-123
# ... hack, hack, hack...
viper:repo$ git commit -m "fix for issue #123"
viper:repo$ git checkout master
viper:repo$ git checkout -b svn-merge
viper:repo$ git svn rebase
# ... watch results for conflicts ...
viper:repo$ git merge --squash issue-123
viper:repo$ git commit -m "ISSUE-123: fixed"
viper:repo$ git svn dcommit -e
# ... $EDITOR launches, allows you to write useful commit message for svn ...
viper:repo$ git checkout master
viper:repo$ git merge svn-merge

This may seem like a lot of extra branch switching, localized commits, etc., but the end result has been worth it (for me, at least). If you following this process, you can be relatively certain that your master branch will only ever mirror changes that have been made in Subversion.

Insuring that the master branch is always “clean” (i.e., has no conflicting commits) with regard to the shared svn tree makes it easy to switch temporarily to another feature branch if you have an urgent bugfix or simple change to make, while your bigger changes happily sit on another feature branch waiting to be pushed.

Updated Mar. 4, 2009: Changed merge to use --squash option, so that many local Git commits can be combined into a single upstream revision.