there is a great difference

For some reason it occurred to me tonight to find the longest common substring in various classical novels.

There’s a bunch of interesting algorithmics involved here. The naive O(n3) approach is slower than a slow snail on a slow day. Dynamic programming gets you to an easy-to-understand O(n2) solution, but that’s still impractically slow for story-length strings. Things start getting really clever with Ukkonen’s algorithm and now we’re into practical runtime territory. You could get a constant-factor speedup by working on sequences of intern’d words rather than chars, but the above algorithm runs in under a minute as-is so I didn’t bother.

I downloaded and normalised some books from Project Gutenberg and ran my longest-common-substring on them.

  • Emma vs Pride and Prejudice: “when the ladies returned to the drawing room”
  • Dracula vs Frankenstein: “between two and three in the morning”
  • Dracula vs Pride and Prejudice: “but there was much to be talked of”
  • Dracula vs. Fiat Money Inflation in France: “the difficulties and dangers of”
  • Treasure Island vs Dracula: “threw himself on his knees and held”
  • Treasure Island vs Jekyll & Hyde: “and to make assurance doubly sure”
  • US Constitution vs Dracula: “to the best of my ability”

.. and last, but most self-referentially not least …

  • Emma vs Frankenstein: “there is a great difference”.