Tag Archive for 'long'

Two steps forward, one step back

Once upon a time, there was RCS, and then CVS. They tracked normal edits to a set of text files reasonbly well, and coupled with telnet or ssh, even made it relatively straightforward for a trusted group of collaborators to share their changes with each other. Some people used other proprietary tools (Perforce, Visual Source Safe, etc.) but they tended to be either a) expensive b) really, really lousy or c) both. Among the open source crowd, at least, CVS dominated the version control space for many years.

Then came Subversion. It improved on many of the failings of CVS — notably, Windows support was dramatically better, repositories could be shared over HTTP, and many operations that just didn’t work in CVS (renames, binary diffs, etc.) performed reasonably well out of the box. To this day, Subversion is a reasonable choice for many projects, especially given the advanced level of support for it in IDEs, graphical repository browsers, and the like.

Much of the reason for that diversity of useful tooling built atop Subversion, of course, is that it was written in C, and built with an eye towards allowing high-level languages to use bindings into the same runtime libraries upon which the ’svn’ command itself relied. In fact, Python, Perl, Java, and Ruby are all supported by the core Subversion maintainers, and additional bindings using those same underlying libraries are available for a number of other languages.

Enter the distributed version control systems: Git, Mercurial, Bazaar, Darcs, and their ilk. The basic workflow they offer is in some ways more like RCS than it is Subversion: each developer works locally against their own copy of a repository, and they share their work via patch files and periodic synchronization. (This is of course a gross over-simplifaction, as all of them offer much more sophisicated change-tracking under the hood than RCS did, but the user-visible behavior is still reminiscent.) However, their ability to maintain change history across many developers and systems without forcing everyone to eventually squash their work down into a single source tree makes a number of new modes of project management possible, or at least much easier than before.

All of the above DVCS systems potentially offer a huge gain of productivity for many developers, since you can easily experiment with changes locally, selectively share only those modifications which you wish to, and continue working without being connected to the central repository. (This is especially significant for those whose employers maintain draconian firewall rules and disallow off-site access to their source control.)

Unfortunately, none of the popular DVCS systems have anything resembling the level of cross-language API support that Subversion does. Mercurial and Bazaar are both implemented in Python, making access from other Python code quite fast, and that from any other language painfully slow. Git is implemented in C, but without a supported and documented core library of functions designed to be used to facilitate access from other languages. Darcs is written in Haskell, which means only crazy mathematicians and CS majors have any ability or interest in using it. (I’m kidding here, but the point remains that Haskell isn’t exactly the most useful substrate for scripting language bindings.)

The fallout from all of this is that we’re left using wrapper libraries which fork out to the command-line tools for each DVCS. Such wrappers have a number of problems: the performance sucks, the internal APIs are usually only as robust as the set of regular expressions you write to parse the output of the commands, and almost no work is shared between the various wrapper implementations.

Don’t get me wrong: as a simple version control tool, I’ve found Git in particular (and distributed version control in general) to be a big step up from the old centralized-repository model. However, the very eighties-esque fork-and-regexp-scrape model for IPC — coupled with the lack of an obvious “best of breed” leader in the DVCS space — means that I (along with anyone else trying to support DVCS in a general-purpose way) end up doing a lot of low-level grunt work when we could be building real value for users.

Even something as simple as a standard dump format for a common subset of the information available from the popular DVCS types would be a start. I do know that, for the time being, I’m stuck supporting a bunch of very brittle code which relies on the various idiosyncratic console output formats of each version-control system.

Playing prognosticator, I would even go so far as to suggest that the first DVCS system to provide supported, documented interfaces in a number of popular programming languages could climb to the top of the dogpile that exists currently and emerge as a clear standard.

Distributed (useful) data

I think that the infrastructure for the next generation of interesting network-aware applications is being built right now, and that it’s going to change the way people think about data, collaboration, and connectivity.

What are these projects? Simple: the distributed storage and merging tools. The most successful example is probably Git, but a bunch of other interesting related projects (CloudRCS, Google Gears, Prophet, etc.) are all exploring a similar space. Basically, the proposition is that, given the right kind of atomic update operations, data can be modified on a bunch of different nodes, and merges can happen asynchronously (if at all) when time, connectivity, or workflow permits it.

To explain why I think these tools are important, and where I think the opportunity exists for the next crop of interesting applications, let me offer examples from two different conferences I’ve attended in recent months.

First, an example of where distributed authoring tools succeeded: at RubyFringe, almost every presentation which featured code included a GitHub link and invitation to fork the project and push back changes. In fact, one of the presenters introduced Gist, a tool designed to allow even throw-away “pasties” of code to be version-controlled and shared using Git as a backend. 

Anyone interested in checking out a copy of the code, making some tweaks, and pushing them back to the original author just needed a copy of Git. They didn’t have to do the work online, and didn’t even have to use GitHub if they didn’t want to, and their work would still be version-tracked and shareable with other developers.

In contrast, while at the NITLE Moodle User Meeting in Tacoma back in June, I struggled to share an outline for a presentation I made with my co-presenter. The hotel where the conference was happening had effectively useless ‘net access, so Google Docs and other online services were out of the question. We eventually just had to sit down next to each other the night before the talk in front of my laptop and edit the outline as plaintext, and then copy the file as a whole to his machine for conversion to Keynote slides.

Now we have a simple text file containing the outline, and a PDF generated from the Keynote slides, but no way to show the evolution of the presentation over the days of the conference, or to track changes for future adaptations of the material. Furthermore, if one of us wants to make changes and share with the other, we’re stuck with email attachments (or the glorified Web 2.0 FTP-clones like Dropbox) to keep up to date.

Programmers have figured out the value of distributed authoring, and understand the necessity for sane merging and conflict-resolution practices from long, painful experience. Knowledge workers in other fields have been using the “track changes” features in Word for just as long, but haven’t yet made the leap to fully-shared authoring. Even when they do, it tends to look more like a Wiki than an offline, sync-able tool like Git.

There’s a huge potential in this space for applications other than source code and other plaintext formats. Tabular data like spreadsheets, outline editors and presentation tools, and PIM and calendaring systems could all benefit from asynchronous, distributed authoring and merging. There’s some work being done at the OS level, (iSync/SyncServicesLive Mesh, etc.) but I think that the really interesting stuff is going to happen out at the edge, amongst the startups and small app developers.

Working

I was listening to OPB on the radio last week, and their morning call-in show Think Out Loud was hosting a discussion on teenagers in the workforce. They had an employment economist who worked for the state of Oregon, an 18-year-old just starting in the workforce, and a variety of parents and teens contribute to the discussion, but I was struck by a weird sort of defeatist tone beneath the entire conversation.

The parents, in particular, seemed to have basically given up on their kids being able to find work, due to the classic “you need experience to get experience” Catch-22. I personally found myself desperately wanting to yell back at them to stop whining, and start helping their kids out with something other than rides to the mall.

Going to school absolutely does not prepare kids for the workforce. Nothing short of work prepares people for work. There is an endemic assumption right now that the only reasonable course for a young person is through the K-12 system, then straight on to college, perhaps with a brief detour into a volunteering stint along the way to help pad the college application. This track produces exactly the kind of clueless, over-privileged 22-year-old that Baby Boomer managers love to complain about.

Admittedly, my own perspective on this is a bit skewed, compared even to a lot of my coworkers and friends. I started working in the summers when I was 14, and year-round by the time I was 16. My first job was running a summer reading program at the local county library branch, and I locked it in before the interview even started by being the only applicant to show up wearing a tie, resumé in hand. I continued working for the year I was in college, and have been supporting myself (and other people as well, occasionally) since I was 18.

Not every one of those jobs has been pleasant — packing boxes for shipping at a Mailboxes, Etc. during the Christmas season, for example, or making salads at a pizza shop — but I’ve learned something useful from each and every customer, project, and boss. 

Of course, no parent would want to listen to the advice of a college dropout who has no kids of his own, right? Obviously, college and a pure white-collar background are the only real path to success.

Randomized primary keys in ActiveRecord

One of the easiest means of cracking a naively-implemented Rails application (or any RDBMS-backed application with auto-incremented primary keys) is to simply alter site URLs and request parameters with the assumption that their IDs will be sequential.

For example, if a user sees a URL like the following in their location bar:

http://example.com/post/2705/comments

They could easily replace the number 2705 with, say, 2706, and see comments for that post. Of course, you’ve already added explicit permissions checks to every controller method that insure such access will never happen, but we all know how easy it can be to forget to audit each and every method for ID-traversal attempts.

Furthermore, using simple sequential IDs exposes other potentially-useful information to would-be attackers. They can see, for example, how many users, data objects, etc., are in active use on your site, and perhaps even reverse-engineer more of your database schema in order to plan SQL-injection or XSS attacks.

Actually using randomized keys does mean overriding the default ActiveRecord behavior, and may be tough to accomplish simply using ActiveRecord migrations. If your database is of a reasonable size, though, or you’re just getting started, it’s easy enough to do. First, you’ll need to decide on a random ID-generation algorithm.

The best way to generate “random” keys isn’t actually random at all. If you want to insure that no one can guess your keys, you’ll want to use a full-strength cryptographic hash. (See my FOSCON slides from last year for more info on that topic.)

If your primary keys are going to be exposed in your URLs at all, though, there are advantages to keeping them simple. Strong hashes generate long, difficult-to-type identifiers, so you’ll want to either slice them to get only a substring, or use a weaker algorithm which generates less randomness.

Regardless, you’ll need a helper method to generate the IDs, as well as some support in your models and migrations to glue it all together. Here’s a simple example, based on an “account invitation” implementation from an application I’m working on now:

# app/models/invite.rb
class Invite < ActiveRecord::Base
  include GenIdHelper

  def before_create
    self.id = gen_id
  end
end

# db/migrate/001_create_invites.rb
class CreateInvites < ActiveRecord::Migration
  def self.up
    create_table :invites do |t|
      t.column :id, :string, :limit => 8
      t.column :email, :string
      # ...
      t.timestamps
    end
  end

  def self.down
    drop_table :invites
  end
end

# lib/gen-id.rb
module GenIdHelper
  # Generates a pseudo-random string in hex format (0..9+A..F)
  # which contains chunk*16 bits of randomness.
  def gen_id(chunks=2)
    ("%04x"*chunks % ([nil]*chunks).map { rand(2**16) }).upcase
  end
end

The end result will be that newly-created instances of the Invite class will have an ID generated by gen_id, and will be much harder for anyone to guess by simply replacing URL components or POST values.

These random IDs can also be considered another form of identification for users. Basically, if they have received the random ID by some trusted channel, (say, SMS message, or snail-mail) it can be treated as a secondary authentication mechanism when creating a new account, performing a password reset, or proving their real world identity.