Finding the longest common prefix of an array of strings in Ruby

Here’s the deal. I have an array of path names, and I want to find the longest path prefix that is common to all strings (path names) in the array. Here’s how it’s done:

files = ['/foo/bar/xyzzy_one', '/foo/bar/xyzzy_two', 
  '/foo/bar/xyzzy_three/four']

This above is our assumed base array, consisting entirely of absolute paths. If you have an array of relative paths, stick the results of running File.expand_path on each of them in another array.

lcps = [] 

First we set up the array to hold our guesses for longest common prefix.

files.each do |fi|

Then we start to iterate through all names.

  fr = []
  (0..fi.length).each { |bi| fr << fi[bi..bi] }

What’s that? We’re looping through all characters in the string and adding them to an array. This is available as String#each_char in recent Ruby versions, but not in the baseline Ruby version I’m working in. Using my loop is still as bad as String#each_byte with multi-byte input in naive versions of Ruby, but will work like String#each_char in new, Unicode-aware Ruby versions.

Also note that we’re passing a range to the string. Passing a range instead of a number means “give me a string with this character”, not “give me the ASCII value of this character”. More multi-byte awareness.

  files.each do |fix|
    frx = []
    (0..fix.length).each { |bi| frx << fix[bi..bi] }

Déjà vu. We’re looping through every member of files, our array, on two levels - to make comparisons from every string in the array to every other string in the array. There is an opportunity for optimization here because we will actually be comparing every string to every other string twice, but this is left as an exercise to the reader.

    x = 0
    frx.each do |ch|

Now we’re looping through the characters inside of the inner string, and keeping a tab on the index inside the string.

      if (fr[x] != ch)
        break
      end
      x += 1

This is where the magic happens. As long as every character matches the other character from the ‘outer’ string, x is incremented, but when it’s not, we jump out of the loop. We have our longest common prefix candidate.

    end
    lcps << x
  end
end

Here we add the candidate to our array. At the point after the loops have all been ended, we have in lcps an array of possible candidates. Our longest common prefix is not the longest of the candidates, as this is the longest we could match one single string to another single string - the length of the longest string in the array. No, the longest common prefix is the shortest of our candidates; let’s extract that.

lcpl = lcps.sort[0]
lcp = files[0][0...lcp]

Now we’re all fine and dandy. Up until now, this has been a generic walkthrough, but if you’re specifically dealing with pathnames as I am, there is another final hazard to look out for.

Given my initially postulated files array, our longest common prefix turns out to be “/foo/bar/xyzzy_”. If you want a directory/folder name, this is not what we had in mind. Let’s use File.dirname to remove this final bit.

if (lcp[-1..-1] != "/")
  lcp = File.dirname(lcp)
end

The wrapping if clause makes sure that if we actually got a directory as a prefix (say: “/foo/bar/”), we wouldn’t call File.dirname on that, because that would yield “/foo”.

Now we’re done. This is probably not the most optimal solution, but it was not one I had seen written up before. Array#abbrev might have speeded up the process by cutting down on the effort spent dealing with the parts of the strings that are different in each string, since the Abbrev module provides short unambiguous parts from the strings - several rungs closer on the ladder than the full strings that we’re currently processing.

Additionally, you should spend some thought on multibyte filenames before you use this if you haven’t already - the current Ruby is completely clueless about Unicode.

Do you have a better way of doing this? Tell in the comments.

Update: For those of you who view this as a programming language pissing contest, Peter Hosey shows how Python has a built-in function for this in the os.path part of the os library, its counterpart to Ruby’s Pathname class. For those of you who are actually interested in the algorithm - say, since your language and its standard libraries does not have a function for this - well, Peter Hosey shows his own implementation here too.

Comments [+]

  1. Use Array#abbrev and a simple inject to find the longest word:


    require 'abbrev'
    arr.abbrev.keys.inject{|memo, word| memo.length > word.length ? memo : word}

    By Rein Henrichs · 2007.01.07 01:42

  2. Or alternatively use sort_by:

    require 'abbrev'
    files.abbrev.keys.sort_by{|word| -word.size}.first

    By Rein Henrichs · 2007.01.07 01:47

  3. Upon review of the play, it was found that Array#abbrev unly returns unambiguous abbreviations and will not, in fact, work. At all. Nothing to see here. Move along.

    By Rein Henrichs · 2007.01.07 01:59

  4. Like you said, it won’t work because abbrev just returns unambiguous abbreviations. IE:

    require 'abbrev'
    ['/foo/abc', '/foo/abd'].abbrev.keys.sort_by{|word| -word.size}.first

    returns ‘/foo/abd’, not ‘/foo/ab’, which is the longest common prefix.

    Starting with what abbrev gives us should shorten time on the main loop, though.

    By Jesper · 2007.01.07 02:09

  5. I’m not much of a Ruby programmer, so maybe I’m misunderstanding something, but I don’t see how abbrev helps. From the documentation, it looks like it finds unambiguous, e.g. not common, prefixes. Here’s one way to do it in Python:


    files = [u"/foo/bar/xyzzy_one",
             u"/foo/bar/xyzzy_two",
             u"/foo/bar/xyzzy_three/four"]
    first = files[0]
    i = 0
    while all(i < len(f) and f[i] == first[i] for f in files):
        i += 1
    lcp = first[:i]
    if lcp[-1] != u"/":
        from os.path import dirname
        lcp = dirname(lcp)

    By Michael Tsai · 2007.01.07 02:12

  6. The “unambiguous” point has been driven home by now, but it still stands that the Abbrev module gives us less processing than making character arrays out of strings, assuming that Abbrev itself is efficiently implemented. Even if it’s inefficient, we’re spending less code in the same place to do a seemingly trivial thing, and that makes a lot of difference in a small script.

    By Jesper · 2007.01.07 02:40

  7. My Python responses:

    Finding the longest common prefix of an array of strings in Python

    Finding the longest common prefix of an array of strings in Python, part 2

    By Peter Hosey · 2007.01.07 06:45

Leave a comment

Your e-mail address is never shown. If you type a line break in the comment, it will show up as a line break (naturally). The following HTML is allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(required)

(required)


Please note: Your comment will not show up at once. Unless you're spamming or being abusive, you have nothing to worry about. (Read the full policy.)