μFuzzy

A tiny, efficient fuzzy search that doesn't suck

Search Fuzzy search

leeoniya/uFuzzy @leeoniya/ufuzzy

README

▒ μFuzzy

A tiny, efficient fuzzy search that doesn't suck.

This is my fuzzy 🐈. There are many like it, but this one is mine.¹

Overview

uFuzzy is a fuzzy search library designed to match a relatively short search phrase (needle) against a large list of short-to-medium phrases (haystack).

It might be best described as a more forgiving String.indexOf().

Common use cases are list filtering, auto-complete/suggest, and title/name/description/filename/function searches.

In its default configuration, each uFuzzy match must contain all alpha-numeric characters from the needle in the same sequence, so is likely a poor fit for applications like spellcheck or fulltext/document search.

However, its speed leaves ample headroom to match out-of-order terms by combining results from all permutations of the needle.

When held _just right_, it can efficiently match against multiple object properties, too.

Features

- Junk-free, high quality results that are _dataset-independent_. No need to fine-tune indexing options or boosting params to attain some arbitrary relevance score cut-off.

- Precise fuzziness control that follows straightforward rules, without returning unexpected matches.

- Sorting you can reason about and customize using a simple Array.sort() which gets access to each match's stats/counters. There's no composite, black box "score" to understand.

- Concise set of options that don't interact in mysterious ways to drastically alter combined behavior.

- Fast with low resource usage - there's no index to build, so startup is below 1ms with near-zero memory overhead. Searching a three-term phrase in a 162,000 phrase dataset takes 12ms with in-order terms or 50ms with out-of-order terms.

- Micro, with zero dependencies - currently ~4KB min

Demos

NOTE: The testdata.json file is a diverse 162,000 string/phrase dataset 4MB in size, so first load may be slow due to network transfer.

Try refreshing once it's been cached by your browser.

First, uFuzzy in isolation to demonstrate its performance.

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&search=super%20ma

Now the same comparison page, booted with fuzzysort, QuickScore, and Fuse.js:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,fuzzysort,QuickScore,Fuse&search=super%20ma

Here is the full library list but with a reduced dataset (just hearthstone_750, urls_and_titles_600) to avoid crashing your browser:

https://leeoniya.github.io/uFuzzy/demos/compare.html?lists=hearthstone_750,urls_and_titles_600&search=moo

Questions?

Answers:

- https://news.ycombinator.com/item?id=33035580

- https://old.reddit.com/r/javascript/comments/xtrszc/ufuzzyjs_a_tiny_efficient_fuzzy_search_that/

Else: https://github.com/leeoniya/uFuzzy/issues

Installation

Node

```
npm i @leeoniya/ufuzzy
```

``` js
const uFuzzy = require('@leeoniya/ufuzzy');
```

Browser

``` js
<script src="./dist/uFuzzy.iife.min.js"></script>
```

Example

``` js
let haystack = [
'puzzle',
'Super Awesome Thing (now with stuff!)',
'FileName.js',
'/feeding/the/catPic.jpg',
];
let needle = 'feed cat';
let opts = {};
let uf = new uFuzzy(opts);
// pre-filter
let idxs = uf.filter(haystack, needle);
// sort/rank only when <= 1,000 items
if (idxs.length <= 1e3) {
let info = uf.info(idxs, haystack, needle);
// order is a double-indirection array (a re-order of the passed-in idxs)
// this allows corresponding info to be grabbed directly by idx, if needed
let order = uf.sort(info, haystack, needle);
// render post-filtered & ordered matches
for (let i = 0; i < order.length; i++) {
// using info.idx here instead of idxs because uf.info() may have
// further reduced the initial idxs based on prefix/suffix rules
console.log(haystack[info.idx[order[i]]]);
}
}
else {
// render pre-filtered but unordered matches
for (let i = 0; i < idxs.length; i++) {
console.log(haystack[idxs[i]]);
}
}
```

Match Highlighting

Get your ordered matches first:

``` js
let haystack = [
'foo',
'bar',
'cowbaz',
];
let needle = 'ba';
let u = new uFuzzy();
let idxs = u.filter(haystack, needle);
let info = u.info(idxs, haystack, needle);
let order = u.sort(info, haystack, needle);
```

Basic innerHTML highlighter (``-wrapped ranges):

``` js
let innerHTML = '';
for (let i = 0; i < order.length; i++) {
let infoIdx = order[i];
innerHTML += uFuzzy.highlight(
haystack[info.idx[infoIdx]],
info.ranges[infoIdx],
) + '<br>';
}
console.log(innerHTML);
```

innerHTML highlighter with custom marking function (``-wrapped ranges):

``` js
let innerHTML = '';

const mark = (part, matched) => matched ? '<b>' + part + '</b>' : part;

for (let i = 0; i < order.length; i++) {
  let infoIdx = order[i];

  innerHTML += uFuzzy.highlight(
    haystack[info.idx[infoIdx]],
    info.ranges[infoIdx],

    mark,
  ) + '<br>';
}

console.log(innerHTML);
```

DOM/JSX element highlighter with custom marking and append functions:

``` js
let domElems = [];

const mark = (part, matched) => {
  let el = matched ? document.createElement('mark') : document.createElement('span');
  el.textContent = part;
  return el;
};

const append = (accum, part) => { accum.push(part); };

for (let i = 0; i < order.length; i++) {
  let infoIdx = order[i];

  let matchEl = document.createElement('div');

  let parts = uFuzzy.highlight(
    haystack[info.idx[infoIdx]],
    info.ranges[infoIdx],

    mark,
    [],
    append,
  );

  matchEl.append(...parts);

  domElems.push(matchEl);
}

document.getElementById('matches').append(...domElems);
```

How It Works

uFuzzy has two operational modes which differ in matching strategy:

- intraMode: 0 (default) requires all alpha-numeric characters in each search term to exist in the same sequence in all matches. For example, when searching for "cat", this mode is capable of matching the strings below. What is _actually_ matched will depend on additonal fuzziness settings.
  - cat
  - coat
  - scratch
  - cantina
  - tractors are late
- intraMode: 1 allows for a single error in each term of the search phrase, where an error is one of: substitution (replacement), transposition (swap), insertion (addition), or deletion (omission). The search strings with errors below can return matches containing "example". What is _actually_ matched will depend on additonal fuzziness settings. In contrast to the previous mode, searching for "example" will never match "extra maple".
  - example - exact
  - examplle - single insertion (addition)
  - exemple - single substitution (replacement)
  - exmaple - single transposition (swap)
  - exmple - single deletion (omission)
  - xamp - partial
  - xmap - partial with transposition

There are 3 phases to a search:

1. Filter filters the full haystack with a fast RegExp compiled from your needle without doing any extra ops. It returns an array of matched indices in original order.
2. Info collects more detailed stats about the filtered matches, such as start offsets, fuzz level, prefix/suffix counters, etc. It also gathers substring match positions for range highlighting. Finally, it filters out any matches that don't conform to the desired prefix/suffix rules. To do all this it re-compiles the needle into two more-expensive RegExps that can partition each match. Therefore, it should be run on a reduced subset of the haystack, usually returned by the Filter phase. The uFuzzy demo is gated at <= 1,000 filtered items, before moving ahead with this phase.
3. Sort does an Array.sort() to determine final result order, utilizing the info object returned from the previous phase. A custom sort function can be provided via a uFuzzy option: {sort: (info, haystack, needle) => idxsOrder}.

API

A liberally-commented 100 LoC uFuzzy.d.ts file.

Options

Options with an inter prefix apply to allowances _in between_ search terms, while those with an intra prefix apply to allowances _within_ each search term.

Option Description Default Examples

intraMode How term matching should be performed 0 0 MultiInsert
1 SingleError

See How It Works

intraIns Max number of extra chars allowed
between each char within a term 0 Searching "cat"...
0 can match: cat, scat, catch, vacate
1 also matches: cart, chapter, outcast

interIns Max number of extra chars allowed between terms Infinity Searching "where is"...
Infinity can match: where is, where have blah wisdom
5 cannot match: where have blah wisdom

intraSub
intraTrn
intraDel
For intraMode: 1 only,
                Error types to tolerate within terms
0 0 No
1 Yes

intraChars Partial regexp for allowed insert
chars between each char within a term [a-z\d] [a-z\d] matches only alpha-numeric (case-insensitive)
[\w-] would match alpha-numeric, undercore, and hyphen

intraFilt Callback for excluding results based on term & match (term, match, index) => true
                Do your own thing, maybe...
- Length diff threshold
- Levenshtein distance
- Term offset or content

interChars Partial regexp for allowed chars between terms . . matches all chars
[^a-z\d] would only match whitespace and punctuation

interLft Determines allowable term left boundary 0 Searching "mania"...
0 any - anywhere: romanian
1 loose - whitespace, punctuation, alpha-num, case-change transitions: TrackMania, maniac
2 strict - whitespace, punctuation: maniacally

interRgt Determines allowable term right boundary 0 Searching "mania"...
0 any - anywhere: romanian
1 loose - whitespace, punctuation, alpha-num, case-change transitions: ManiaStar
2 strict - whitespace, punctuation: mania_foo

sort Custom result sorting function (info, haystack, needle) => idxsOrder Default: Search sort, prioritizes full term matches and char density
Demo: Typeahead sort, prioritizes start offset and match length

A biased appraisal of similar work

This assessment is extremely narrow and, of course, biased towards my use cases, text corpus, and my complete expertise in operating my own library.
It is highly probable that I'm not taking full advantage of some feature in other libraries that may significantly improve outcomes along some axis;
I welcome improvement PRs from anyone with deeper library knowledge than afforded by my hasty 10min skim over any "Basic usage" example and README doc.

Search quality

Can-of-worms #1.

Before we discuss performance let's talk about search quality, because speed is irrelevant when your results are a strange medly of "Oh yeah!" and "WTF?".

Search quality is very subjective.
What constitutes a good top match in a "typeahead / auto-suggest" case can be a poor match in a "search / find-all" scenario.
Some solutions optimize for the latter, some for the former.
It's common to find knobs that skew the results in either direction, but these are often by-feel and imperfect, being little more than a proxy to producing a single, composite match "score".

Let's take a look at some matches produced by the most popular fuzzy search library, Fuse.js and some others for which match highlighting is implemented in the demo.

Searching for the partial term "twili", we see these results appearing above numerous obvious "twilight" results:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,fuzzysort,QuickScore,Fuse&search=twili

- twirling
- The total number of received alerts that were invalid.
- Tom Clancy's Ghost Recon Wildlands - ASIA Pre-order Standard Uplay Activation
- theHunter™: Call of the Wild - Bearclaw Lite CB-60

Not only are these poor matches in isolation, but they actually rank higher than literal substrings.

Finishing the search term to "twilight", _still_ scores bizzare results higher:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,fuzzysort,QuickScore,Fuse&search=twilight

- Magic: The Gathering - Duels of the Planeswalkers Wings of Light Unlock
- The Wild Eight

Some engines do better with partial prefix matches, at the expense of higher startup/indexing cost:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,FlexSearch,match-sorter,MiniSearch&search=twili

Here, match-sorter returns 1,384 results, but only the first 40 are relevant. How do we know where the cut-off is?



Performance

Can-of-worms #2.

All benchmarks suck, but this one might suck more than others.

- I've tried to follow any "best performance" advice when I could find it in each library's docs, but it's a certainty that some stones were left unturned when implementing ~20 different search engines.
- Despite my best efforts, result quality is still extremely variable between libraries, and even between search terms. In some cases, results are very poor but the library is very fast; in other cases, the results are better, but the library is quite slow. What use is extreme speed when the search quality is sub-par? This is a subjective, nuanced topic that will surely affect how you interpret these numbers. I consider uFuzzy's search quality second-to-none, so my view of most faster libraries is typically one of quality trade-offs I'm happy not to have made. I encourage you to evaluate the results for all benched search phrases manually to decide this for yourself.
- Many fulltext & document-search libraries compared here are designed to work best with exact terms rather than partial matches (which this benchmark is skewed towards).

Still, something is better than a hand-wavy YMMV/do-it-yourself dismissal and certainly better than nothing.

Benchmark

- Each benchmark can be run by changing the libs parameter to the desired library name: https://leeoniya.github.io/uFuzzy/demos/compare.html?bench&libs=uFuzzy
- Results output is suppressed in bench mode to avoid benchmarking the DOM.
- Measurements are taken in the Performance secrion of Chrome's DevTools by recording several reloads of the bench page, with forced garbage collection in between. The middle/typical run is used to collect numbers.
- The search corpus is 162,000 words and phrases, loaded from a 4MB testdata.json.
- The benchmark types and then deletes, character-by-character (every 20ms) the following search terms, triggering a search for each keypress: test, chest, super ma, mania, puzz, prom rem stor, twil.

To evaluate the results for each library, or to compare several, simply visit the same page with more libs and without bench: https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,fuzzysort,QuickScore,Fuse&search=super%20ma.

There are several metrics evaluated:

- Init time - how long it takes to load the library and build any required index to perform searching.
- Bench runtime - how long it takes to execute all searches.
- Memory required - peak JS heap size used during the bench as well as how much is still retained after a forced garbage collection at the end.
- GC cost - how much time is needed to collect garbage at the end (main thread jank)



Lib Stars Size (min) Init Search
(x 86) Heap (peak) Retained GC

uFuzzy (try) ★ 1.5k 4KB 0.3ms 1030ms 26.6MB 8MB 30ms

uFuzzy (try)
                (external prefix caching)
460ms 27.5MB 8MB 30ms

uFuzzy (try)
                (outOfOrder, fuzzier)
1275ms 26.6MB 8MB 30ms

uFuzzy (try)
                (outOfOrder, fuzzier, SingleError)
1350ms 27.5MB 8MB 30ms

-------
Fuse.js (try) ★ 14.8k 23.5KB 40ms 35600ms 226MB 14.5MB 30ms

FlexSearch (Light) (try) ★ 8.9k 5.9KB 3600ms 145ms 673MB 316MB 450ms

Lunr.js (try) ★ 8.2k 29.4KB 2500ms 1430ms 379MB 121MB 200ms

Lyra (try) ★ 3.4k 30KB 4000ms 790ms 199MB 89MB 200ms

match-sorter (try) ★ 3.1k 7.3KB 0.03ms 10000ms 39MB 8MB 30ms

fuzzysort (try) ★ 3k 5.5KB 60ms 1850ms 174MB 84MB 70ms

Wade (try) ★ 3k 4KB 1000ms 460ms 438MB 42MB 100ms

fuzzysearch (try) ★ 2.6k 0.2KB 0.1ms 1000ms 28MB 8MB 20ms

js-search (try) ★ 2k 17.1KB 9400ms 1580ms 1760MB 734MB 1400ms

Elasticlunr.js (try) ★ 1.9k 18.1KB 1600ms 1800ms 227MB 70MB 130ms

MiniSearch (try) ★ 1.5k 22.4KB 650ms 2300ms 428MB 64MB 150ms

Fuzzyset (try) ★ 1.3k 2.8KB 4000ms 1140ms 628MB 238MB 600ms

search-index (try) ★ 1.3k

sifter.js (try) ★ 1.1k 7.5KB 3ms 1140ms 40MB 11.3MB 30ms

fuzzy (try) ★ 801 1.4KB 0.05ms 6000ms 41MB 8MB 30ms

fzf-for-js (try) ★ 538 15.4KB 75ms 6700ms 353MB 190MB 160ms

LiquidMetal (try) ★ 285 4.2KB (crash)

fast-fuzzy (try) ★ 270 13.8KB 850ms 10300ms 555MB 165MB 150ms

ItemJS (try) ★ 260

FuzzySearch (try) ★ 184 3.5KB 17ms 10000ms 51MB 11.2MB 30ms

FuzzySearch2 (try) ★ 173 19.4KB 120ms 6000ms 113MB 41MB 30ms

QuickScore (try) ★ 131 9.1KB 40ms 7900ms 172MB 12.8MB 30ms

fzy (try) ★ 115

fuzzyMatch (try) ★ 0 1KB 0.05ms 2500ms 90MB 8MB 30ms

Option	Description	Default	Examples
`intraMode`	How term matching should be performed	`0`	`0` MultiInsert `1` SingleError See How It Works
`intraIns`	Max number of extra chars allowed between each char within a term	`0`	Searching "cat"... `0` can match: cat, scat, catch, vacate `1` also matches: cart, chapter, outcast
`interIns`	Max number of extra chars allowed between terms	`Infinity`	Searching "where is"... `Infinity` can match: where is, where have blah wisdom `5` cannot match: where have blah wisdom
`intraSub` `intraTrn` `intraDel`	For `intraMode: 1` only, Error types to tolerate within terms	`0`	`0` No `1` Yes
`intraChars`	Partial regexp for allowed insert chars between each char within a term	`[a-z\d]`	`[a-z\d]` matches only alpha-numeric (case-insensitive) `[\w-]` would match alpha-numeric, undercore, and hyphen
`intraFilt`	Callback for excluding results based on term & match	`(term, match, index) => true`	Do your own thing, maybe... - Length diff threshold - Levenshtein distance - Term offset or content
`interChars`	Partial regexp for allowed chars between terms	`.`	`.` matches all chars `[^a-z\d]` would only match whitespace and punctuation
`interLft`	Determines allowable term left boundary	`0`	Searching "mania"... `0` any - anywhere: romanian `1` loose - whitespace, punctuation, alpha-num, case-change transitions: TrackMania, maniac `2` strict - whitespace, punctuation: maniacally
`interRgt`	Determines allowable term right boundary	`0`	Searching "mania"... `0` any - anywhere: romanian `1` loose - whitespace, punctuation, alpha-num, case-change transitions: ManiaStar `2` strict - whitespace, punctuation: mania_foo
`sort`	Custom result sorting function	`(info, haystack, needle) => idxsOrder`	Default: Search sort, prioritizes full term matches and char density Demo: Typeahead sort, prioritizes start offset and match length

Lib	Stars	Size (min)	Init	Search (x 86)	Heap (peak)	Retained	GC
uFuzzy (try)	★ 1.5k	4KB	0.3ms	1030ms	26.6MB	8MB	30ms
uFuzzy (try) (external prefix caching)				460ms	27.5MB	8MB	30ms
uFuzzy (try) (outOfOrder, fuzzier)				1275ms	26.6MB	8MB	30ms
uFuzzy (try) (outOfOrder, fuzzier, SingleError)				1350ms	27.5MB	8MB	30ms
-------
Fuse.js (try)	★ 14.8k	23.5KB	40ms	35600ms	226MB	14.5MB	30ms
FlexSearch (Light) (try)	★ 8.9k	5.9KB	3600ms	145ms	673MB	316MB	450ms
Lunr.js (try)	★ 8.2k	29.4KB	2500ms	1430ms	379MB	121MB	200ms
Lyra (try)	★ 3.4k	30KB	4000ms	790ms	199MB	89MB	200ms
match-sorter (try)	★ 3.1k	7.3KB	0.03ms	10000ms	39MB	8MB	30ms
fuzzysort (try)	★ 3k	5.5KB	60ms	1850ms	174MB	84MB	70ms
Wade (try)	★ 3k	4KB	1000ms	460ms	438MB	42MB	100ms
fuzzysearch (try)	★ 2.6k	0.2KB	0.1ms	1000ms	28MB	8MB	20ms
js-search (try)	★ 2k	17.1KB	9400ms	1580ms	1760MB	734MB	1400ms
Elasticlunr.js (try)	★ 1.9k	18.1KB	1600ms	1800ms	227MB	70MB	130ms
MiniSearch (try)	★ 1.5k	22.4KB	650ms	2300ms	428MB	64MB	150ms
Fuzzyset (try)	★ 1.3k	2.8KB	4000ms	1140ms	628MB	238MB	600ms
search-index (try)	★ 1.3k
sifter.js (try)	★ 1.1k	7.5KB	3ms	1140ms	40MB	11.3MB	30ms
fuzzy (try)	★ 801	1.4KB	0.05ms	6000ms	41MB	8MB	30ms
fzf-for-js (try)	★ 538	15.4KB	75ms	6700ms	353MB	190MB	160ms
LiquidMetal (try)	★ 285	4.2KB	(crash)
fast-fuzzy (try)	★ 270	13.8KB	850ms	10300ms	555MB	165MB	150ms
ItemJS (try)	★ 260
FuzzySearch (try)	★ 184	3.5KB	17ms	10000ms	51MB	11.2MB	30ms
FuzzySearch2 (try)	★ 173	19.4KB	120ms	6000ms	113MB	41MB	30ms
QuickScore (try)	★ 131	9.1KB	40ms	7900ms	172MB	12.8MB	30ms
fzy (try)	★ 115
fuzzyMatch (try)	★ 0	1KB	0.05ms	2500ms	90MB	8MB	30ms

μFuzzy

README

▒ μFuzzy

Overview

Features

Demos

Questions?

Installation

Node

Browser

Example

Match Highlighting

How It Works

API

Options

A biased appraisal of similar work

Search quality

Performance

Benchmark

关注公众号