简体   繁体   中英

Regex match between two arrays of strings

I have two arrays

sentences_ary = ['This is foo', 'bob is cool'] 

words_ary = ['foo', 'lol', 'something']

I want to check if any element from sentences_ary matched any word from words_ary .

I'm able to check for one work, but could not do it with word_ary .

#This is working
['This is foo', 'bob is cool'].any? { |s| s.match(/foo/)} 

But I'm not able to make it work with ary of ary regex. I'm always getting true from this:

# This is not working    
['This is foo', 'bob is cool'].any? { |s| ['foo', 'lol', 'something'].any? { |w| w.match(/s/) } }

I'm using this in the if condition.

You could use Regexp.union and Enumerable#grep :

sentences_ary.grep(Regexp.union(words_ary))
#=> ["This is foo"]

RegexpTrie improves this:

require 'regexp_trie'

sentences_ary = ['This is foo', 'This is foolish', 'bob is cool', 'foo bar', 'bar foo']
words_ary = ['foo', 'lol', 'something']

words_regex = /\b(?:#{RegexpTrie.union(words_ary, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:foo|lol|something))\b/i

sentences_ary.any?{ |s| s[words_regex] } # => true
sentences_ary.find{ |s| s[words_regex] } # => "This is foo"
sentences_ary.select{ |s| s[words_regex] } # => ["This is foo", "foo bar", "bar foo"]

You have to be careful how you construct your regex pattern, otherwise you can get false-positive hits. That can be a difficult bug to track down.

sentences_ary = ['This is foo', 'This is foolish', 'bob is cool', 'foo bar', 'bar foo']
words_ary = ['foo', 'lol', 'something']
words_regex = /\b(?:#{ Regexp.union(words_ary).source })\b/ # => /\b(?:foo|lol|something)\b/
sentences_ary.any?{ |s| s[words_regex] } # => true
sentences_ary.find{ |s| s[words_regex] } # => "This is foo"
sentences_ary.select{ |s| s[words_regex] } # => ["This is foo", "foo bar", "bar foo"]

The /\\b(?:foo|lol|something)\\b/ pattern that is generated is smart enough to look for word-boundaries, which will find words, not just sub-strings.

Also, notice the use of source . This is very important because its absence can lead to a very hard to locate bug. Compare these two regexp:

/#{ Regexp.union(words_ary).source }/ # => /foo|lol|something/
/#{ Regexp.union(words_ary) }/        # => /(?-mix:foo|lol|something)/

Notice how the second one has the embedded flags (?-mix:...) . They change the flags for the enclosed pattern, inside the surrounding pattern. It's possible to have that inner pattern behave differently than the surrounding one resulting in a black hole sucking in results you don't expect.

Even the Regexp union documentation shows the situation but doesn't mention why it can be bad:

Regexp.union(/dogs/, /cats/i)        #=> /(?-mix:dogs)|(?i-mx:cats)/

Notice that in this case, both patterns have different flags. On our team we use union often, but I'm always careful to look to see how it's being used during peer reviews. I got bit by this once, and it was tough figuring out what was wrong, so I am very sensitive to it. Though union takes patterns, as in the example, I recommend not using them and instead use an array of words or the pattern as a string, to avoid those pesky flags sneaking in there. There's a time and place for them, but knowing about this allows us to control when they get used.

Read through the Regexp documentation multiple times, as there's a lot to learn and it will be overwhelming the first several passes through it.

And, for extra-credit, read:

Another way:

def good_sentences(sentences_ary, words_ary)
  sentences_ary.select do |s|
    (s.downcase.gsub(/[^a-z\s]/,'').split & words_ary).any?
  end
end

For the example:

sentences_ary = ['This is foo', 'bob is cool']
words_ary = ['foo', 'lol', 'something']

good_sentences(sentences_ary, words_ary)
  #=> ["This is foo"]

For a case case:

words_ary = ['this', 'lol', 'something']
  #=> ["This is foo"]
good_sentences(sentences_ary, words_ary)
  #=> ["This is foo"]

For a punctuation case:

sentences_ary = ['This is Foo!', 'bob is very "cool" indeed!']
words_ary = ['foo', 'lol', 'cool']
good_sentences(sentences_ary, words_ary)
  #=> ["This is Foo!", "bob is very \"cool\" indeed!"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM