简体   繁体   中英

A faster way to find out if any word in a list is a substring in a longer word

What I have

  • A string s of length m (where m > 3)
  • A huge list L of words (length > 2)

What I want to know

  • Is any word in L a substring of s

Currently i have L split up into files, one file each for each word-length 16.txt , 15.txt ... 04.txt
I then iterate over these files from n.txt --> 04.txt and basically do

cat n.txt | while read w; do if [[ $s =~ $w ]] ; then echo $w; fi; done

It's painfully slow there has to be a better way to do this.

Additional info:

  • L contains about 200k words and is fairly static so I don't mind a complicated time consuming setup if it means greater speed
  • There are several different such lists L but only one of them needs to be searched at a time.

I'm pretty language agnostic (se tags) but pseudo code is also fine

You can use grep without splitting the list:

grep -oFf list.txt <<< "$s"
  • -o will only output the matching substrings
  • -F will interpret the lines in list.txt as fixed strings, not regular expressions
  • -f will tell grep what file to use as the source of the patterns to match
  • <<< in bash takes the following word and makes it input to the preceding command

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM