Today's Question:  What's your opinion about Alibaba mooncake incident?        GIVE A SHOUT

Technical Article => Software =>  Software design

Fuzzy search algorithm in Sublime Text

  sonic0002      2013-10-14 22:49:38      3,613    2    0

Fuzzy string searching is the technique of finding strings that match a pattern approximately (rather than exactly. In many editors, we can easily find the Find menu item to do exact matching. While in Sublime Text and some other editors, we can use fuzzy string searching as well if we just want to have an approximate match.

There is some algorithm implemented in Sublime Text to implement achieve this. First the search query is split into characters, join them with a regex wildcard, and then run the regex on your set of strings.

stringsToSearch.forEach(function(string) {

    if (RegExp(string.split('').join('.*?')).test(string)) {

        pick(string);

    }

});

If you disable backtracking, a search across n strings with query length k should run in O(nk) time (perhaps depending on the regex engine).

That doesn't help with highlighting the matches, but for that you can:

  • observe that regex matching is deterministic, so regex-replace on the result set to insert some formatting;
  • or write a very simple regex parser to do custom stuff. Since the only operators are the . and *, you can do it with a DFA.

DEMO:

function fuzzyMatch(searchSet, query) {
    var tokens = query.toLowerCase().split(''),
        matches = [];

    searchSet.forEach(function(string) {
        var tokenIndex = 0,
            stringIndex = 0,
            matchWithHighlights = '',
            matchedPositions = [];

        string = string.toLowerCase();

        while (stringIndex < string.length) {
            if (string[stringIndex] === tokens[tokenIndex]) {
                matchWithHighlights += highlight(string[stringIndex]);
                matchedPositions.push(stringIndex);
                tokenIndex++;

                if (tokenIndex >= tokens.length) {
                    matches.push({
                        match: string,
                        highlighted: matchWithHighlights + string.slice(
                            stringIndex + 1
                        ),

                        // Maybe use this to weight matches?
                        // More consecutive numbers = higher score?
                        positions: matchedPositions
                    });
                    break;
                }
            }else {
                matchWithHighlights += string[stringIndex];
            }
            stringIndex++;
        }
    });

    return matches;
}

function highlight(string) {
    return '<span class="highlight">' + string + '</span>';
}

 

SUBLIME TEXT FUZZY SEARCH

  SAVE AS PDF   MARK AS READ   MARK AS IMPORTANT

Share on Facebook  Share on Twitter  Share on Google+  Share on Weibo  Share on Reddit  Share on Digg  Share on Tumblr    Delicious

  RELATED


  2 COMMENTS


Tobia [Reply]@ 2016-07-26 11:36:09

Inserting .* or .*? between every character is a very bad idea, because the complexity increases exponentially with the length of the search string, both for successful matches and for mismatches. A 2-character search string gives O(n), because there's one .*; a 3-character search string gives O(n^2); and so on. In general the complexity is O(n^k) where n is the length of the subject string and k that of the search string, aka. exponential on the length of the search string.

You say "If you disable backtracking," but don't give any information on how to do so in JavaScript (and I don't think it's possible.)

Here is an alternative implementation with O(n) complexity in the case of successful matches and minimal backtracking in case of mismatches. I think this is the best possible result, short of having anti-backtracking features exposed by the regexp library, or pre-computing some kind of optimized search tree (which would be a RAM vs. CPU optimization.)

// turn 'ABC' into 'A[^B]*B[^C]*C'
var regex = string.split('').map(c => '[^'+c+']*'+c).join('').substring(5)

Whether this is faster in practice than your DFA remains to be seen (and would be a nice benchmark to see) because a purpose-built DFA for this problem could be O(n) even for mismatches (I haven't checked whether yours is) but this Regex-based solution has several orders of magnitude less computing overhead, being written in heavily optimized C.

Tobia [Reply]@ 2016-07-26 11:37:16

<br>? really?

let me try again:

// turn 'ABC' into 'A[^B]*B[^C]*C'
var regex = string.split('').map(c => '[^'+c+']*'+c).join('').substring(5)

  WRITE ARTICLE

When called to fix a production issue

By sonic0002