Tag Archive for 'code'

Two steps forward, one step back

Once upon a time, there was RCS, and then CVS. They tracked normal edits to a set of text files reasonbly well, and coupled with telnet or ssh, even made it relatively straightforward for a trusted group of collaborators to share their changes with each other. Some people used other proprietary tools (Perforce, Visual Source Safe, etc.) but they tended to be either a) expensive b) really, really lousy or c) both. Among the open source crowd, at least, CVS dominated the version control space for many years.

Then came Subversion. It improved on many of the failings of CVS — notably, Windows support was dramatically better, repositories could be shared over HTTP, and many operations that just didn’t work in CVS (renames, binary diffs, etc.) performed reasonably well out of the box. To this day, Subversion is a reasonable choice for many projects, especially given the advanced level of support for it in IDEs, graphical repository browsers, and the like.

Much of the reason for that diversity of useful tooling built atop Subversion, of course, is that it was written in C, and built with an eye towards allowing high-level languages to use bindings into the same runtime libraries upon which the ’svn’ command itself relied. In fact, Python, Perl, Java, and Ruby are all supported by the core Subversion maintainers, and additional bindings using those same underlying libraries are available for a number of other languages.

Enter the distributed version control systems: Git, Mercurial, Bazaar, Darcs, and their ilk. The basic workflow they offer is in some ways more like RCS than it is Subversion: each developer works locally against their own copy of a repository, and they share their work via patch files and periodic synchronization. (This is of course a gross over-simplifaction, as all of them offer much more sophisicated change-tracking under the hood than RCS did, but the user-visible behavior is still reminiscent.) However, their ability to maintain change history across many developers and systems without forcing everyone to eventually squash their work down into a single source tree makes a number of new modes of project management possible, or at least much easier than before.

All of the above DVCS systems potentially offer a huge gain of productivity for many developers, since you can easily experiment with changes locally, selectively share only those modifications which you wish to, and continue working without being connected to the central repository. (This is especially significant for those whose employers maintain draconian firewall rules and disallow off-site access to their source control.)

Unfortunately, none of the popular DVCS systems have anything resembling the level of cross-language API support that Subversion does. Mercurial and Bazaar are both implemented in Python, making access from other Python code quite fast, and that from any other language painfully slow. Git is implemented in C, but without a supported and documented core library of functions designed to be used to facilitate access from other languages. Darcs is written in Haskell, which means only crazy mathematicians and CS majors have any ability or interest in using it. (I’m kidding here, but the point remains that Haskell isn’t exactly the most useful substrate for scripting language bindings.)

The fallout from all of this is that we’re left using wrapper libraries which fork out to the command-line tools for each DVCS. Such wrappers have a number of problems: the performance sucks, the internal APIs are usually only as robust as the set of regular expressions you write to parse the output of the commands, and almost no work is shared between the various wrapper implementations.

Don’t get me wrong: as a simple version control tool, I’ve found Git in particular (and distributed version control in general) to be a big step up from the old centralized-repository model. However, the very eighties-esque fork-and-regexp-scrape model for IPC — coupled with the lack of an obvious “best of breed” leader in the DVCS space — means that I (along with anyone else trying to support DVCS in a general-purpose way) end up doing a lot of low-level grunt work when we could be building real value for users.

Even something as simple as a standard dump format for a common subset of the information available from the popular DVCS types would be a start. I do know that, for the time being, I’m stuck supporting a bunch of very brittle code which relies on the various idiosyncratic console output formats of each version-control system.

Playing prognosticator, I would even go so far as to suggest that the first DVCS system to provide supported, documented interfaces in a number of popular programming languages could climb to the top of the dogpile that exists currently and emerge as a clear standard.

Daily git-svn

My team at Sun uses Subversion to host our “authoritative” source repository for Project Kenai. However, since most work is done on the trunk, many of us find it more convenient to work locally with Git, using the git svn subcommand heavily to keep ourselves up-to-date without interfering with others’ work.

When I first started using this combo, I had some early trouble keeping my local Git repository from getting horribly b0rked whenever there were edits made to the same files I had been working on locally. Having used CVS and Subversion for so long, I initially assumed such conflicts (and the manual merge steps they entailed) were simply part of the equation, even when working with a proper DVCS. However, by applying a little more discipline to my use of local branches, I’ve been able to basically eliminate manual merges, except in cases where the exact same line has literally been edited by multiple people.

My first, most critical discovery was to never use the fetch command. Instead, use rebase. Second, never pull the latest changes from Subversion into a working feature branch; instead, switch to your master branch, create a new branch for merging (I usually call mine “svn-merge”), and do your rebase there. After the rebase has finished, merge in your feature branch changes, and then use dcommit to push your changes upstream.

As an example, here are to commands I would use to check out a new local Git clone of the main Subverion repository, work on a single command, and then push it back into SVN:

viper:Work$ git svn clone https://example.com/svn/repo/trunk -r500:HEAD repo
# ... lots of Git output here ...
viper:Work$ cd repo
viper:repo$ git checkout -b issue-123
# ... hack, hack, hack...
viper:repo$ git commit -m "fix for issue #123"
viper:repo$ git checkout master
viper:repo$ git checkout -b svn-merge
viper:repo$ git svn rebase
# ... watch results for conflicts ...
viper:repo$ git merge --squash issue-123
viper:repo$ git commit -m "ISSUE-123: fixed"
viper:repo$ git svn dcommit -e
# ... $EDITOR launches, allows you to write useful commit message for svn ...
viper:repo$ git checkout master
viper:repo$ git merge svn-merge

This may seem like a lot of extra branch switching, localized commits, etc., but the end result has been worth it (for me, at least). If you following this process, you can be relatively certain that your master branch will only ever mirror changes that have been made in Subversion.

Insuring that the master branch is always “clean” (i.e., has no conflicting commits) with regard to the shared svn tree makes it easy to switch temporarily to another feature branch if you have an urgent bugfix or simple change to make, while your bigger changes happily sit on another feature branch waiting to be pushed.

Updated Mar. 4, 2009: Changed merge to use --squash option, so that many local Git commits can be combined into a single upstream revision.

Killfile 2.0

There’s been an persistent blog-wank-fest making the rounds over the last few weeks about the state of the Ruby community: whether it’s become more or less “fun”, “creative”, etc. I’m not going to reward any of the participants with a link, but I do offer the following balm if you, like me, are a bit sick of hearing about it:

%w(open-uri rss resolv uri rubygems hpricot).each {|lib| require lib }

blog_url = ARGV.shift

blog_hdoc = Hpricot.parse(open(blog_url))
rss_links = blog_hdoc / :head / 'link[@type="application/rss+xml"]'

feed_rss = RSS::Parser.parse(rss_links.first['href'])

rants = feed_rss.items.select {|i| t = i.title; t =~ /rant/i && t =~ /ruby/i }

if rants.empty?
  puts "Okay, you get a pass."
else
  puts "Bad blogger! No biscuit!"
  hostname = URI.parse(blog_url).host
  ip_addr = Resolv.getaddress(hostname)
  `sudo route add -host #{hostname} gw 127.0.0.1`
end

Document replication: CouchDB vs. DVCS

My friend (and CouchDB committer) Chris just posted an excellent overview of the application-hosting potential of CouchDB on his blog. My first response was: okay, you’ve convinced me. Post-election, I’m porting the minimal Sinatra app backing Misfict to CouchDB, since it’s really just a minimal JSON storage engine at its core.

My second reaction was to find it a bit funny to see E4X making an appearance in this day and age; like most XML-centric tech, I had sort of assumed that the coming of JSON and YAML had sort of killed it, at least amongst the web-dev early adopters. It guess it just goes to show that everything old is new again, especially in the fast-moving world of web development tools.

Regardless, perhaps the most compelling picture Chris paints in his post is the idea of capitalizing on the off-line replication features of CouchDB to allow groups of people to separately work on a collection of documents, then merge their changes together at some point in the future. He leans heavily on a classroom metaphor, but I think the real potential may be more in the area of groupware and collaborative editing. Knowledge workers have been looking for the “holy grail” tool which combines the power of Word’s “track changes” with mixed on- and off-line authoring for a long time, and I think we’re finally building the infrastructure that will make that class of application relatively easy to build.

Looking over the CouchDB documentation, though, I still think there’s one major piece missing from their replication and conflict-resolution story: automatic merging of non-conflicting edits. Unlike a DVCS like Git, CouchDB still doesn’t (AFAIK) allow multiple contributors to edit different elements of a single document, and then commit those changes, without manually replaying edits from other contributors.

Since JSON is much more structured than raw text (which Git and other DVCS systems deal with handily enough), it seems tractable to examine potentially conflicting updates and to see if they’re isolated to different child nodes of the JSON document. Furthermore, given the degree to which CouchDB has already embrace the map/reduce model, I think you should be able to distill the conflict-resolution algorithm down to two steps: generate a “diff” in the map step, which just notes the original document ID and the changed attribute/subtree elements, and then a “reduce” which attempts to create a new document by applying those changes to the original document.

Regardless, I think it’s an interesting time to be involved in web development. The idea that you could grab just a subset of a larger data store, work with it both on- and off-line, then share your changes with a group of colleagues is a powerful one, and I applaud anyone (like Chris and the rest of the CouchDB team) working to make it possible.

Big changes

So, I’ve been sitting on this for a while now, but finally get to make a wider announcement, now that the “i”s have been dotted and the “t”s have been crossed:

I’m leaving Reed College in a few weeks, and starting a new job at Sun Microsystems to work on Project Kenai. It’s a big change for me — I’ve been hiding out in academia for almost four years now, so switching back into the commercial world is both exciting and scary.

Kenai is a fascinating project, which I hope to talk about a lot more in the near future. I can say already that it’s one of the more ambitious JRuby on Rails projects out there, and that I’m excited to see what we can do with the full Sun hardware + open source software stack underpinning a high-volume Rails site. In addition, I’m going to get the chance to work more closely on UI and interaction design, which is an area in which I look forward to expanding and updating my skills.

Reed has been a great place to work, and I can’t say enough good things about everyone else in the IT organization here. That being said, I’m really psyched about getting to focus almost entirely on writing code and implementing features, and working in a small, distributed group within the larger Sun umbrella.

New toy: misfict

Being home alone with a head cold doesn’t leave one with a lot of excuses not to knock off a quick project. I had been mulling over the idea of building a version of the classic “storytime” party game as a webapp for a long time, and since I also wanted to spend a little more time working with jQuery’s AJAX and JSON support, it seemed reasonable to tackle both at the same time.

So, without further ado, I present misfict, the micro-serial-fiction engine. The process is simple: read the last line someone else wrote, then post your own idea for the next sentence in the story. Eventually, we should end up with a lovely stream-of-consciousness story co-authored by anyone who cares to drop a few words into the bucket.

I may build in some sort of cap for the number of sentences before a story is finished, or periodically declare a “chapter break”, but for the time being, the story will keep going as long as anyone is writing.

PS. any perceived relation between the release of this project and the upcoming start of NaNoWriMo is strictly a coincidence.

PPS. If you’re interested, all the code is available on GitHub

PPPS. (last one, I promise) I got the misfict.com domain, so the link above has been corrected to point there. Also, there’s an RSS feed. Now, go write something.

There are security holes, and security holes…

I was reviewing a Perl CGI script a co-worker sent to me for troubleshooting last week, and came across this little gem (excerpted but not changed in any meaningful way):

use CGI;
use LWP;

my $ua = new LWP;
my $req = new CGI;

my $res_id = $req->param('rid');
my $img = $req->param('img');
my $url = "http://somehost/cgi-bin/fetch.cgi?id=$res_id";

$req->get($url, :content_file => $img);

open FH, $img;
unlink $img;

print $req->header(-type=>'application/octet-stream');

while (<FH>) {
        print $_;
}
close FH;

How horribly bad is this script? Well, it allows no less than the deletion/overwriting of any file writable by the web server user. While that won’t allow injection of shellcode under most configurations, it would allow an attacker to delete logfiles, insert malicious replacements to files in upload directories, and generally mess with your system in all kinds of ways.

Even better, it completely misuses the Content-Type HTTP header to force download instead of inline view, instead of using the semantically-appropriate Content-Disposition: attachment route to force a download dialog box to appear on the client.

There are doubtless millions of lines of code like this out there in the world, and (at least in Perl-land) almost all of them could be caught with the simple addition of the -T (”taint check”) flag to the #!/usr/bin/perl line at the top of the script.

Erlang warts

After more than a year of complaining about the syntax, I’m forcing myself to finally sit down and learn some Erlang. Between CouchDB, EjabberD, and all the other interesting projects people are implementing in Erlang, I would be remiss as a systems engineer to not at least pick up the basics.

Unfortunately, I’m still chafing a bit at a number of little annoyances:

  • The REPL is basically crippled since you can’t define functions. Being forced to think in terms of compilation units (rather than simple expressions) pisses me off.
  • Why oh why do I need to explicitly list the module name in my file header if I’m also bound by the restriction that filenames and module names have to be the same? The old Java package/file path ties were always a big annoyance when I was stuck in that environment.
  • For a functional language, there’s an awful lot of syntactic vinegar for basic operations like map and fold. I appreciate having a concise syntax for lambdas, but writing fun my_function/2 smells a bit.
  • Records (as syntactic sugar for tuples) are a poor substitute for a real type system. Both tutorial and real-world Erlang code I’ve seen is basically full of tagged tuples, which means you get the verbosity of a strongly-typed language without any of the ability of real type checking to catch errors at compilation time.

I want to stick with it long enough to find the real gems underneath all this noise. I mean, if I can sit through extended sessions reading and writing Perl, I should be able to find something to love about Erlang. Furthermore, most of the complaints I make above are inapplicable to mainstream languages — i.e., C and Java dont have an REPL or lambdas, and Ruby and Perl don’t have anything resembling a traditional compiler — not miraculously better.

I definitely think that learning a new language should make you feel a little bit uncomfortable. Unfortunately, right now Erlang leaves me feeling uncomfortable in all the wrong ways: I understand everything that’s going on with the language, and just don’t like it.

I’m going to keep plugging away for at least a little bit longer, though. Next up: reading the source to EJabberD to (hopefully) get a sense for idiomatic language use in a context where its unique features (lightweight concurrency + distributed computing) are a real advantage.

Mixins without monkeypatching

Anyone who’s done much Ruby metaprogramming (or even just skimmed the source code to Rails) should be intimately familiar with the following idiom:

klass = Foo::Bar
# ...
klass.extend(Some::Module)

This is a programmatic mixin — it adds the methods defined in Some::Module to the class Foo::Bar for the duration of the current Ruby process. It also does so for all instances of Foo::Bar, whether defined in some scope known to the caller of extend or not.

If may not look like it, but this is yet another case of monkeypatching. The only reason you don’t see it being denounced up and down the ‘tubes like, say, overriding method_missing on NilClass is that it’s usually confined to an application or framework-specific class, like ActiveRecord::Base.

Now, I’m all for the judicious use of powerful language constructs like open classes, but they can be a problem for large-scale projects, or those where collaboration between team members is less-than-perfect. The global and persistent scope of a class-level #extend call like the above can cause unexpected side-effects, too.

As an alternative, I humbly propose the use of instance-scoped extensions. In cases where not all instances of a class may need the additional functionality provided by the mixin, try calling #extend on just that instance. It keeps your namespace clean, doesn’t introduce as many potential pitfalls for code in other scopes, and is reversible: just nil your current reference to the object, and re-create it without the mixin.

Here’s an example, cribbed from some refactoring I’m doing of a web service implementation:

def change_password
  token = AuthToken.find(params[:token])
  active_user = Account.find(token.identity)
  password = params[:password]

  raise ArgumentError, "passwords did not match" unless password = params[:confirm_password]

  if token.has_privilege?(:password_reset)
    active_user.extend(PrincipalManagement)
    active_user.change_password(password)
  end

  render :x ml => Account.to_xml
end

One reason this is expecially handy for me is that I can re-use the same model core model classes (like Account) in different applications, only some of which may have the privileges necessary to change passwords, delete accounts, etc.

By limiting the mixin to a single model instance within the scope of a single request, I also protect myself from coding errors that might expose dangerous mutators to untrusted callers. Any attempt to call change_password (or a similarly-restricted method) from outside a scope which explicitly included the mixin will raise a MethodNotFound error, without any additional access-control checks on my part.

Happy thoughts

I find it hard to be optimistic on Mondays. Sometimes, I just need a reminder that everything is going to be okay.

So, I put the following code snippet at the end of my ~/.bashrc file:

ruby -rdate -e'puts "#{Date.new(2009, 1, 20) - Date.today} days left until Bush leaves office."'

Now I smile every time I open a terminal on my work machine.

Update: Ten minutes of late-afternoon golfing later, I’m left with this little beauty:

ruby -e'T=Time;puts "%s days, %s minutes, %s hours, and %s seconds left."%[86400,3600,60].inject([(T.local(2009,1,20)-T.now).to_i]){|a,v|p=a.pop;a+=[p/v,p%v]}'