Change #8190
Category | None |
Changed by | Mike Rylander <mrylander | @gmail.com>
Changed at | Fri 05 Nov 2021 12:04:45 |
Repository | git://git.evergreen-ils.org/Evergreen.git |
Project | Evergreen |
Branch | master |
Revision | 32c880ddc7a1209e5239dbd07b6b1531fb74b1b0 |
Comments
LP#1947173: Speed up the symspell part of ingest For certain data, and certain data set sizes, merging the suggestion arrays used by the symspell algorithm is noticably expensive. This is the case for suggestion arrays containing many thousands of entries. These suggestion sets are not only slow, but generally not useful. We avoid the creation of such overly long suggestion sets using several word filters that take advantage of our knowledge of the incoming data to optimize for what is useful in a bibliographic context. The mechanisms employed by this patch are: - Omit suggestions whose length is longer than the max prefix key length when the prefix key length is less than or equal to the maximum prefix key length minus the maximum edit distance. - Omit words that contain a run of 5 or more digits. This will drop most identifiers from the dictionary while still allowing suggestions to happen for year values. - Omit empty keys from the dictionary. This should have been the case already but is now enforced directly. - Add a small speedup to evergreen.text_array_merge_unique() by making it assume that arrays passed to it do not have null values, which we intentionally avoid, and against which we protect in other ways in the commit. Besides improving reingest speed, the patches will also make the search.symspell_dictionary table significantly smaller. Signed-off-by: Mike Rylander <mrylander@gmail.com> Signed-off-by: Galen Charlton <gmc@equinoxOLI.org>
Changed files
- Open-ILS/src/sql/Pg/300.schema.staged_search.sql
- Open-ILS/src/sql/Pg/upgrade/XXXX.schema.symspell-speed-ingest.sql
- Open-ILS/src/support-scripts/symspell-sideload.pl