简体   繁体   English

两个字符串数组之间的正则表达式匹配

[英]Regex match between two arrays of strings

I have two arrays 我有两个数组

sentences_ary = ['This is foo', 'bob is cool'] 

words_ary = ['foo', 'lol', 'something']

I want to check if any element from sentences_ary matched any word from words_ary . 我要检查,如果任何元素sentences_ary匹配任何字words_ary

I'm able to check for one work, but could not do it with word_ary . 我可以检查一项工作,但无法使用word_ary

#This is working
['This is foo', 'bob is cool'].any? { |s| s.match(/foo/)} 

But I'm not able to make it work with ary of ary regex. 但是我无法使其与ary regex ary一起使用。 I'm always getting true from this: 我总是从中得到实现:

# This is not working    
['This is foo', 'bob is cool'].any? { |s| ['foo', 'lol', 'something'].any? { |w| w.match(/s/) } }

I'm using this in the if condition. 我在if条件中使用它。

You could use Regexp.union and Enumerable#grep : 您可以使用Regexp.unionEnumerable#grep

sentences_ary.grep(Regexp.union(words_ary))
#=> ["This is foo"]

RegexpTrie improves this: RegexpTrie对此进行了改进:

require 'regexp_trie'

sentences_ary = ['This is foo', 'This is foolish', 'bob is cool', 'foo bar', 'bar foo']
words_ary = ['foo', 'lol', 'something']

words_regex = /\b(?:#{RegexpTrie.union(words_ary, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:foo|lol|something))\b/i

sentences_ary.any?{ |s| s[words_regex] } # => true
sentences_ary.find{ |s| s[words_regex] } # => "This is foo"
sentences_ary.select{ |s| s[words_regex] } # => ["This is foo", "foo bar", "bar foo"]

You have to be careful how you construct your regex pattern, otherwise you can get false-positive hits. 您必须小心如何构造正则表达式模式,否则可能会得到假阳性结果。 That can be a difficult bug to track down. 这可能是很难追踪的错误。

sentences_ary = ['This is foo', 'This is foolish', 'bob is cool', 'foo bar', 'bar foo']
words_ary = ['foo', 'lol', 'something']
words_regex = /\b(?:#{ Regexp.union(words_ary).source })\b/ # => /\b(?:foo|lol|something)\b/
sentences_ary.any?{ |s| s[words_regex] } # => true
sentences_ary.find{ |s| s[words_regex] } # => "This is foo"
sentences_ary.select{ |s| s[words_regex] } # => ["This is foo", "foo bar", "bar foo"]

The /\\b(?:foo|lol|something)\\b/ pattern that is generated is smart enough to look for word-boundaries, which will find words, not just sub-strings. 生成的/\\b(?:foo|lol|something)\\b/模式足够聪明,可以查找单词边界,它将查找单词,而不仅仅是子字符串。

Also, notice the use of source . 另外,请注意source的使用。 This is very important because its absence can lead to a very hard to locate bug. 这一点非常重要,因为缺少它会导致很难定位的错误。 Compare these two regexp: 比较这两个正则表达式:

/#{ Regexp.union(words_ary).source }/ # => /foo|lol|something/
/#{ Regexp.union(words_ary) }/        # => /(?-mix:foo|lol|something)/

Notice how the second one has the embedded flags (?-mix:...) . 请注意,第二个具有嵌入标志(?-mix:...) They change the flags for the enclosed pattern, inside the surrounding pattern. 它们在周围的图案内部更改了封闭图案的标志。 It's possible to have that inner pattern behave differently than the surrounding one resulting in a black hole sucking in results you don't expect. 内部模式的行为可能与周围的模式有所不同,从而导致黑洞吞噬了您意想不到的结果。

Even the Regexp union documentation shows the situation but doesn't mention why it can be bad: 甚至Regexp union文档也显示了这种情况,但没有提及为什么它可能很糟糕:

Regexp.union(/dogs/, /cats/i)        #=> /(?-mix:dogs)|(?i-mx:cats)/

Notice that in this case, both patterns have different flags. 请注意,在这种情况下,两个模式都有不同的标志。 On our team we use union often, but I'm always careful to look to see how it's being used during peer reviews. 在我们的团队中,我们经常使用union ,但是我总是小心翼翼地看看在同行评审中它是如何使用的。 I got bit by this once, and it was tough figuring out what was wrong, so I am very sensitive to it. 我曾经对此有所了解,很难弄清楚出什么问题了,所以我对此非常敏感。 Though union takes patterns, as in the example, I recommend not using them and instead use an array of words or the pattern as a string, to avoid those pesky flags sneaking in there. 尽管union采用了模式,如示例中所示,但我建议不要使用它们,而应使用单词数组或模式作为字符串,以避免那些讨厌的标志潜入其中。 There's a time and place for them, but knowing about this allows us to control when they get used. 他们有时间和地点,但是了解这一点可以让我们控制它们的使用时间。

Read through the Regexp documentation multiple times, as there's a lot to learn and it will be overwhelming the first several passes through it. 多次阅读Regexp文档 ,因为有很多东西要学习,它将使前几次学习不堪重负。

And, for extra-credit, read: 而且,要获得额外的信用,请阅读:

Another way: 其他方式:

def good_sentences(sentences_ary, words_ary)
  sentences_ary.select do |s|
    (s.downcase.gsub(/[^a-z\s]/,'').split & words_ary).any?
  end
end

For the example: 例如:

sentences_ary = ['This is foo', 'bob is cool']
words_ary = ['foo', 'lol', 'something']

good_sentences(sentences_ary, words_ary)
  #=> ["This is foo"]

For a case case: 对于一个案例:

words_ary = ['this', 'lol', 'something']
  #=> ["This is foo"]
good_sentences(sentences_ary, words_ary)
  #=> ["This is foo"]

For a punctuation case: 对于标点符号的情况:

sentences_ary = ['This is Foo!', 'bob is very "cool" indeed!']
words_ary = ['foo', 'lol', 'cool']
good_sentences(sentences_ary, words_ary)
  #=> ["This is Foo!", "bob is very \"cool\" indeed!"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM