Previous posts: Part 1 / Part 2
Code: Search.py on GitHub
In my last posts, I showed how the Damarau-Levenshtein distance is a great metric to use for fuzzy string searching. However, the previous examples have been primarily aimed at searching for single words within a list of single words. In practice, it is not realistic to think that the search term will always be a single word, nor should we restrict ourselves to only searching a database of single words.
For example, we may want conduct the following search:
S: quick fox jumps
Should the list of searchable text contain it, we could reasonably expect the following result:
T: The quick brown fox jumps over the lazy dog.
The current algorithm has reasonable performance in this area:
Searching for term 'excavate topsoil' in a list of 39 words.
Excavate by hand / topsoil [1]
Excavate by machine / topsoil [1]
Load surplus excavated [...] temporary spoil, average distance [1]
However, performance completely falls down if order of the words in S differs to the order of words in T. We can see that the results barely have anything to do with the searched terms:
Searching for term 'topsoil excavate' in a list of 39 words.
Load and deposit in temporary spoil heaps, average distance [3]
Break up debris and arisings, [...] aside for re-use [4]
To ensure that our results are reliable regardless of word order, it is important that we treat a sentence as a collection of words, rather than a very long word that happens to contain spaces.
To be continued.