Home - Waterfall Grid T-Grid Console Builders Recent Builds Buildslaves Changesources - JSON API - About

Change #8190

Category None
Changed by Mike Rylander <mrylanderohnoyoudont@gmail.com>
Changed at Fri 05 Nov 2021 12:04:45
Repository git://git.evergreen-ils.org/Evergreen.git
Project Evergreen
Branch master
Revision 32c880ddc7a1209e5239dbd07b6b1531fb74b1b0

Comments

LP#1947173: Speed up the symspell part of ingest
For certain data, and certain data set sizes, merging the suggestion
arrays used by the symspell algorithm is noticably expensive.  This is
the case for suggestion arrays containing many thousands of entries.
These suggestion sets are not only slow, but generally not useful.  We
avoid the creation of such overly long suggestion sets using several
word filters that take advantage of our knowledge of the incoming data
to optimize for what is useful in a bibliographic context.  The
mechanisms employed by this patch are:

- Omit suggestions whose length is longer than the max prefix key length
  when the prefix key length is less than or equal to the maximum prefix
  key length minus the maximum edit distance.
- Omit words that contain a run of 5 or more digits. This will drop most
  identifiers from the dictionary while still allowing suggestions to
  happen for year values.
- Omit empty keys from the dictionary.  This should have been the case
  already but is now enforced directly.
- Add a small speedup to evergreen.text_array_merge_unique() by making
  it assume that arrays passed to it do not have null values, which we
  intentionally avoid, and against which we protect in other ways in the
  commit.

Besides improving reingest speed, the patches will also make the
search.symspell_dictionary table significantly smaller.

Signed-off-by: Mike Rylander <mrylander@gmail.com>
Signed-off-by: Galen Charlton <gmc@equinoxOLI.org>

Changed files