I’ve been thinking about ways to do fuzzy string matching lately, and decided to test out an idea based on some bioinfomatics papers that went around a while ago talking about using the gzip algorithm to test for common substrings in datasets.
I mocked up a little Ruby class which basically just does the following:
- Given a hash mapping IDs to strings, builds an internal dictionary which maps those IDs to the gzip-compressed version of said string
- When passed a new string, appends it to each of the original strings and gzips the result
- Returns a list of the IDs for which the increase in size of the compressed version (as a ratio to the new string’s total size) was less than some delta parameter
This actually works suprisingly well on a few simple datasets I’ve tried, with the following caveats:
Since it basically relies on word-level matching, and doesn’t collapse spaces in the strings initially being indexed, some visually similar strings (such as “free geek” and “freegeek”) may not match as you might expect. Specifically, if the concatenated version of the string is the one indexed, and the search is performed on the “split” version, you aren’t guaranteed a match.
Overall, I think the most value would come from using this as one in a set of search algorithms; it’s good as a naive matcher and can handle *any* common substrings (think binary data, source code, etc.) but misses out on any awareness of internal structure or natural language.
require 'zlib'
# this is optimized to store a relatively smaller subset of common index
# terms against which new strings should be tested
class ZlibDistanceCalc
def initialize
@c_hash = Hash.new
end
def index(key, term)
@c_hash[key] = [term, compress(term)]
end
def test(term)
coeff = term.size.to_f
results = {}
@c_hash.each do |k,v|
orig_str, orig_cmp = v
delta = (compress(orig_str + term).size - orig_cmp.size) / coeff
results[k] = delta
end
return results
end
def search(term, delta=0.5)
test_results = Hash[*test(term).select {|k,v| v <= delta }.flatten]
if block_given?
test_results.each {|k,v| yield k, v }
else
return test_results
end
end
def compress(term)
Zlib::Deflate.deflate(term.strip.downcase)
end
end
For those who read the dataset and wonder about the selection of test strings: I’ve been looking at ways to help with de-duping records for the Calagator project, and I suspect that fuzzy string matching is part of the overall solution.
# venue_data.yaml :cubespace: "CubeSpace 622 SE Grand Ave., Portland OR 97214" :freegeek: "FreeGeek 1731 SE 10th Avenue, Portland, OR 97214" :convention: "Oregon Convention Center, 777 NE MLK, Jr. Blvd., Portland, OR 97232" :armory: "Gerding Theater at the Armory, 128 NW Eleventh Ave, Portland OR 97209" :aboutus: "AboutUs, 107 SE Washington St, Suite 520, Portland, OR 97214" :luckylab_se: "Lucky Lab Brew Pub, 915 SE Hawthorne Blvd., Portland, OR 97214"
Here are some example searches at work:
>> lennon@firefly:~/Ruby/machine-learning$ irb
>> require 'zlib_dist_calc.rb'
=> true
>> calc = ZlibDistanceCalc.new
=> #<zlibdistancecalc:0x2b0baf4c1108 @c_hash="{}">
>> require 'yaml'; data = YAML.load(File.read('venue_data.yaml'))
=> ...
>> data.each {|k,v| calc.index(k, v) }
=> ...
>> calc.search 'free geek portland'
=> {:luckylab_se=>0.5, :freegeek=>0.444444444444444, :convention=>0.5,
:armory=>0.444444444444444, :aboutus=>0.5}
>> calc.search 'freegeek portland'
=> {:freegeek=>0.294117647058824, :armory=>0.411764705882353}
>> calc.search 'luckylab'
=> {}
>> calc.search 'lucky lab'
=> {:luckylab_se=>0.444444444444444}
>> calc.search 'conventioncenter'
=> {:convention=>0.3125}
>> calc.search 'convention center'
=> {:convention=>0.176470588235294}
>>




A similar idea on E2.
Vruba: Wow, I hope JerboaKolinowski from E2 doesn’t find my little write-up; I’d be worried about my safety.
For the record, I’m entirely aware of methods like Naive Bayes, Bloom filters, and even more sophisticated methods like neural nets. This was just a fun little experiment.
Nah, the scary writeup was the one underneath mine. I’m a bit more relaxed. See the critique in my writeup of the one underneath
How did you end up deduping Calagator? Did you implement this distance algorithm? Did you use a different algorithm?
Then, do you just delete dupes? Or did you need to flag dupes somehow, and let users who submitted the duped data know? Or, are you simply just preventing dupe data from getting put in by showing the user an error during submission?
I really enjoy seeing other developer’s perspectives on their implementation strategies! Your zlib distance algorithm seems interesting.
Thanks,
Kevin