Tag Archive for 'code'

IMAP ‘append’ example

# imap-append-example.rb
# requires TMail library (http://i.loveruby.net/en/prog/tmail.html)

require 'net/imap'
require 'tmail'

IMAP_HOST = 'imap.example.com'
IMAP_USER = 'user'
IMAP_PASSWORD = 'password'
IMAP_MAILBOX = 'INBOX.scratch'

# Connect and authenticate
imap = Net::IMAP.new(IMAP_HOST)
imap.login(IMAP_USER, IMAP_PASSWORD)

# Create the target folder if needed
imap.create(IMAP_MAILBOX) unless imap.list(IMAP_MAILBOX, "*")

# Insure the message timestamp will be consistent
created = Time.now

# Create our email object
msg = TMail::Mail.new
msg.from = 'ruby@example.com'
msg.to = 'me@example.com'
msg.date = created
msg.subject = 'Inserted by Ruby'

# Finally, append the message to the mailbox. The line ending
# argument is needed to make sure the TMail::Mail#to_s method
# doesn't generate invalid, "\n"-terminated lines
# Also, you can remove the :Seen symbol from the flag list to
# make the message appear unread in other IMAP clients
imap.append(IMAP_MAILBOX, msg.to_s("\r\n"), [:Seen], created)

Randomized primary keys in ActiveRecord

One of the easiest means of cracking a naively-implemented Rails application (or any RDBMS-backed application with auto-incremented primary keys) is to simply alter site URLs and request parameters with the assumption that their IDs will be sequential.

For example, if a user sees a URL like the following in their location bar:

http://example.com/post/2705/comments

They could easily replace the number 2705 with, say, 2706, and see comments for that post. Of course, you’ve already added explicit permissions checks to every controller method that insure such access will never happen, but we all know how easy it can be to forget to audit each and every method for ID-traversal attempts.

Furthermore, using simple sequential IDs exposes other potentially-useful information to would-be attackers. They can see, for example, how many users, data objects, etc., are in active use on your site, and perhaps even reverse-engineer more of your database schema in order to plan SQL-injection or XSS attacks.

Actually using randomized keys does mean overriding the default ActiveRecord behavior, and may be tough to accomplish simply using ActiveRecord migrations. If your database is of a reasonable size, though, or you’re just getting started, it’s easy enough to do. First, you’ll need to decide on a random ID-generation algorithm.

The best way to generate “random” keys isn’t actually random at all. If you want to insure that no one can guess your keys, you’ll want to use a full-strength cryptographic hash. (See my FOSCON slides from last year for more info on that topic.)

If your primary keys are going to be exposed in your URLs at all, though, there are advantages to keeping them simple. Strong hashes generate long, difficult-to-type identifiers, so you’ll want to either slice them to get only a substring, or use a weaker algorithm which generates less randomness.

Regardless, you’ll need a helper method to generate the IDs, as well as some support in your models and migrations to glue it all together. Here’s a simple example, based on an “account invitation” implementation from an application I’m working on now:

# app/models/invite.rb
class Invite < ActiveRecord::Base
  include GenIdHelper

  def before_create
    self.id = gen_id
  end
end

# db/migrate/001_create_invites.rb
class CreateInvites < ActiveRecord::Migration
  def self.up
    create_table :invites do |t|
      t.column :id, :string, :limit => 8
      t.column :email, :string
      # ...
      t.timestamps
    end
  end

  def self.down
    drop_table :invites
  end
end

# lib/gen-id.rb
module GenIdHelper
  # Generates a pseudo-random string in hex format (0..9+A..F)
  # which contains chunk*16 bits of randomness.
  def gen_id(chunks=2)
    ("%04x"*chunks % ([nil]*chunks).map { rand(2**16) }).upcase
  end
end

The end result will be that newly-created instances of the Invite class will have an ID generated by gen_id, and will be much harder for anyone to guess by simply replacing URL components or POST values.

These random IDs can also be considered another form of identification for users. Basically, if they have received the random ID by some trusted channel, (say, SMS message, or snail-mail) it can be treated as a secondary authentication mechanism when creating a new account, performing a password reset, or proving their real world identity.

Ruby is my anti-drug

The following is an actual line from my shell transcript today:

sudo gem install crack

Hee-larious.

Programming Collective Intelligence

Programming Collective Intelligence, 1st Ed. (2007)
Author: Toby Segaran, Published by O’Reilly Media Inc.

From the “long-overdue” dept., here’s my review of Programming Collective Intelligence, which I received for free through the O’Reilly user group distribution channel. Opinions are entirely my own, and probably wrong.

I have a weird bias when it comes to reading technical books. I’m a self-taught programmer, but I’ve taken coursework in the formal mathematical basis for computation, and I’ve always enjoyed seeing an elegant proof or clever algebraic formulation of a hairy coding problem.

Given that perspective, my impressions of Programming Collective Intelligence were mixed. If approached as a “cookbook,” with recipes for analysis applied to the data available from popular social networking sites and web services, you can see real results quickly, without having to understanding the underlying mathematics.

Unfortunately, that also means you miss out on a lot of the potential “Eureka!” moments that can come from thinking a little more deeply about an algorithm. This is especially apparent in the last couple of chapters, when active, rich areas of CS research like support vector machines and genetic programming are covered in a few dozen pages each, which often doesn’t offer space to do much more than tease the reader with the potential power a technique offers.

I think there’s a wide gap between the “blog article” and “peer-reviewed journal” levels of formalism, and I admire Segaran’s effort to span that divide and bring some of the fruits of AI research into the pragmatic domain for us 9-5′ers. On the other hand, I don’t think I walked away with any great new perspective on machine learning, so much as I saw some cool examples of how the author had applied it.

Ideally, I think this book, or another like it, should cover fewer real-world services or problems, and apply more of its algorithms and techniques to each data set. Every page spent describing how to interface with a particular web service (or worse yet, scrape a single site’s HTML structure) screamed “planned obsolescence” to me — APIs change, and sites that are popular today may be dead and gone in a few years, but the analytical tools being discussed are far more timeless.

Overall, I’d give the book a B-, at least for my own uses. If the Chapter 12 and Appendix B content were moved into the main body of the text, and the random dating site scraping techniques dropped or themselves demoted to appendices, I would quickly raise it to a B+ or A-.

I would still recommend the book as-is to anyone who was more interested in getting results tomorrow than gaining deep understanding of the problem space, as well as those with less Internet application programming background, for whom the concrete code examples would have more value.

ActiveLdap, SASL, GSSAPI, and pain

I just wasted a day and a half banging my head against this problem, and while I suspect there are probably only about a half-dozen other sites on the planet that are likely to encounter this problem, I wanted to write down the solution. So, just in case anyone else has been furiously Googling for some combination of “ActiveLdap SASL GSSAPI bind connection error”, here’s one possible solution.

We use MIT Kerberos, OpenLDAP, and Cyrus SASL to provide single-sign-on across our network services at Reed. I’ve been working for a while to port legacy Perl-based systems over to Ruby, and using ActiveLdap for much of that work, but have hit a snag that limits its utility to use in many cases: namely, that GSSAPI-mech SASL binds fail if there is no corresponding LDAP directory entry for the principal under which you’re doing the bind.

Let me explain a bit of background:

OpenLDAP ACLs allow you to specify something like the following:

authz-regexp
uid=(.*),cn=gssapi,cn=auth

This basically says, “for any principal that our authentication backend accepts, treat them as a valid LDAP entity with the DN ‘uid=[principal-name],cn=gssapi,cn=auth’”. This is incredibly handy when coupled with Kerberos keytab files, as it lets us get the same basic benefit as certificate-based authentication without maintaining a certificate authority. (As an aside, we do actually maintain a small CA setup, but since the CA data lives on an external drive in a locked cabinet, rolling a new cert is a huge pain compared to creating a new keytab from kadmin.)

This is all well and good, and lets us set up background processes which have keytab-based privileges to selectively read and edit protected attributes on our directory, without putting passwords into configuration files or source code. It also means we don’t have to create a full-fledged directory entry for every little background service which may need to authenticate.

Unfortunately, ActiveLdap is unable to determine that it has successfully bound to the directory if the SASL GSSAPI principal it uses to bind doesn’t have a corresponding entry. It just loops over the reconnect method until the configured number of attempts is reached, then falls back to an anonymous bind.

If all of that was gibberish to you, rest easy; like I said, our particular mixture of infrastructure and directory-maintenance practices is pretty rare, especially outside academia, so you’ll probably never have to worry about any of this.

Testing string similarity using Ruby and Zlib

I’ve been thinking about ways to do fuzzy string matching lately, and decided to test out an idea based on some bioinfomatics papers that went around a while ago talking about using the gzip algorithm to test for common substrings in datasets.

I mocked up a little Ruby class which basically just does the following:

  1. Given a hash mapping IDs to strings, builds an internal dictionary which maps those IDs to the gzip-compressed version of said string
  2. When passed a new string, appends it to each of the original strings and gzips the result
  3. Returns a list of the IDs for which the increase in size of the compressed version (as a ratio to the new string’s total size) was less than some delta parameter

This actually works suprisingly well on a few simple datasets I’ve tried, with the following caveats:

Since it basically relies on word-level matching, and doesn’t collapse spaces in the strings initially being indexed, some visually similar strings (such as “free geek” and “freegeek”) may not match as you might expect. Specifically, if the concatenated version of the string is the one indexed, and the search is performed on the “split” version, you aren’t guaranteed a match.

Overall, I think the most value would come from using this as one in a set of search algorithms; it’s good as a naive matcher and can handle *any* common substrings (think binary data, source code, etc.) but misses out on any awareness of internal structure or natural language.

require 'zlib'

# this is optimized to store a relatively smaller subset of common index
# terms against which new strings should be tested
class ZlibDistanceCalc
  def initialize
    @c_hash = Hash.new
  end

  def index(key, term)
    @c_hash[key] = [term, compress(term)]
  end

  def test(term)
    coeff = term.size.to_f

    results = {}

    @c_hash.each do |k,v|
      orig_str, orig_cmp = v

      delta = (compress(orig_str + term).size - orig_cmp.size) / coeff
      results[k] = delta
    end

    return results
  end

  def search(term, delta=0.5)
    test_results = Hash[*test(term).select {|k,v| v <= delta }.flatten]

    if block_given?
      test_results.each {|k,v| yield k, v }
    else
      return test_results
    end
  end

  def compress(term)
    Zlib::Deflate.deflate(term.strip.downcase)
  end
end

For those who read the dataset and wonder about the selection of test strings: I’ve been looking at ways to help with de-duping records for the Calagator project, and I suspect that fuzzy string matching is part of the overall solution.

# venue_data.yaml
:cubespace: "CubeSpace 622 SE Grand Ave., Portland OR 97214"
:freegeek: "FreeGeek 1731 SE 10th Avenue, Portland, OR 97214"
:convention: "Oregon Convention Center, 777 NE MLK, Jr. Blvd., Portland, OR 97232"
:armory: "Gerding Theater at the Armory, 128 NW Eleventh Ave, Portland OR 97209"
:aboutus: "AboutUs, 107 SE Washington St, Suite 520, Portland, OR 97214"
:luckylab_se: "Lucky Lab Brew Pub, 915 SE Hawthorne Blvd., Portland, OR 97214"

Here are some example searches at work:

>> lennon@firefly:~/Ruby/machine-learning$ irb
>> require 'zlib_dist_calc.rb'
=> true
>> calc = ZlibDistanceCalc.new
=> #<zlibdistancecalc:0x2b0baf4c1108 @c_hash="{}">
>> require 'yaml'; data = YAML.load(File.read('venue_data.yaml'))
=> ...
>> data.each {|k,v| calc.index(k, v) }
=> ...
>> calc.search 'free geek portland'
=> {:luckylab_se=>0.5, :freegeek=>0.444444444444444, :convention=>0.5,
    :armory=>0.444444444444444, :aboutus=>0.5}
>> calc.search 'freegeek portland'
=> {:freegeek=>0.294117647058824, :armory=>0.411764705882353}
>> calc.search 'luckylab'
=> {}
>> calc.search 'lucky lab'
=> {:luckylab_se=>0.444444444444444}
>> calc.search 'conventioncenter'
=> {:convention=>0.3125}
>> calc.search 'convention center'
=> {:convention=>0.176470588235294}
>>