简体   繁体   English

使用正则表达式在Ruby列表中拆分Ruby中的字符串

[英]Splitting string in Ruby on list of words using regex

I'm trying to split a string in Ruby into smaller sub-strings or phrases based on a list of stop words. 我正在尝试将Ruby中的字符串拆分为基于停用词列表的较小子字符串或短语。 The split method works when I define the regular expression pattern directly; 当我直接定义正则表达式模式时,split方法有效; however, it doesn't work when I attempt to define the pattern by evaluating within the split method itself. 但是,当我尝试通过在split方法本身内进行评估来定义模式时,它不起作用。

In practice, I want to read an external file of stop words and use it to split my sentences. 在实践中,我想读一个停用词的外部文件,并用它来分割我的句子。 So, I want to be able to construct the pattern from external file, rather than specify it directly. 所以,我希望能够从外部文件构造模式,而不是直接指定它。 I also notice that when I use 'pp' versus 'puts' I get very different behaviors and I'm not sure why. 我还注意到,当我使用'pp'与'puts'时,我会得到非常不同的行为,我不知道为什么。 I'm using Ruby 2.0 and Notepad++ on Windows. 我在Windows上使用Ruby 2.0和Notepad ++。

 require 'pp'
 str = "The force be with you."     
 pp str.split(/(?:\bthe\b|\bwith\b)/i)
 => ["", " force be ", " you."]
 pp str.split(/(?:\bthe\b|\bwith\b)/i).collect(&:strip).reject(&:empty?)
 => ["force be", "you."] 

The final array above is my desired result. 上面的最后一个数组是我想要的结果。 However, this doesn't work below: 但是,这不起作用:

 require 'pp'
 stop_array = ["the", "with"]
 str = "The force be with you." 
 pattern = "(?:" + stop_array.map{|i| "\b#{i}\b" }.join("|") + ")"
 puts pattern
 => (?thwit)
 puts str.split(/#{pattern}/i)
 => The force be with you.
 pp pattern
 => "(?:\bthe\b|\bwith\b)"
 pp str.split(/#{pattern}/i)
 => ["The force be with you."]

Update: Using the comments below, I modified my original script. 更新:使用下面的评论,我修改了我的原始脚本。 I also created a method to split the string. 我还创建了一个分割字符串的方法。

 require 'pp'

 class String
      def splitstop(stopwords=[])
      stopwords_regex = /\b(?:#{ Regexp.union(*stopwords).source })\b/i
      return split(stopwords_regex).collect(&:strip).reject(&:empty?)
      end
 end

 stop_array = ["the", "with", "over"]

 pp "The force be with you.".splitstop stop_array
 => ["force be", "you."]
 pp "The quick brown fox jumps over the lazy dog.".splitstop stop_array
 => ["quick brown fox jumps", "lazy dog."]

I'd do it this way: 我这样做:

str = "The force be with you."     
stop_array = %w[the with]
stopwords_regex = /(?:#{ Regexp.union(stop_array).source })/i
str.split(stopwords_regex).map(&:strip) # => ["", "force be", "you."]

When using Regexp.union , it's important to watch out for the actual pattern that is generated: 使用Regexp.union ,注意生成的实际模式非常重要:

/(?:#{ Regexp.union(stop_array) })/i
=> /(?:(?-mix:the|with))/i

The embedded (?-mix: turns off the case-insensitive flag inside the pattern, which can break the pattern, causing it to grab the wrong things. Instead, you have to tell the engine to return just the pattern, without the flags: 嵌入式(?-mix:关闭模式中不区分大小写的标志,它可以破坏模式,导致它抓错了。相反,你必须告诉引擎只返回模式,没有标志:

/(?:#{ Regexp.union(stop_array).source })/i
=> /(?:the|with)/i

Here's why pattern = "(?:\\bthe\\b|\\bwith\\b)" doesn't work: 这就是为什么pattern = "(?:\\bthe\\b|\\bwith\\b)"不起作用的原因:

/#{pattern}/i # => /(?:\x08the\x08|\x08with\x08)/i

Ruby sees "\\b" as a backspace character. Ruby将"\\b"视为退格符。 Instead use: 而是使用:

pattern = "(?:\\bthe\\b|\\bwith\\b)"
/#{pattern}/i # => /(?:\bthe\b|\bwith\b)/i

You have to mask the backslashes: 你必须掩盖反斜杠:

"\\b#{i}\\b" 

ie

pattern = "(?:" + stop_array.map{|i| "\\b#{i}\\b" }.join("|") + ")"

And a minor improvement/simplification: 并略有改进/简化:

pattern = "\\b(?:" + stop_array.join("|") + ")\\b"

Then: 然后:

str.split(/#{pattern}/i) # => ["", " force be ", " you."]

If your stop list is short, I think this is the right approach. 如果您的停止列表很短,我认为这是正确的方法。

stop_array = ["the", "with"]
re = Regexp.union(stop_array.map{|w| /\s*\b#{Regexp.escape(w)}\b\s*/i})

"The force be with you.".split(re) # =>
[
  "",
  "force be",
  "you."
]
s = "the force be with you."
stop_words = %w|the with is|
# dynamically create a case-insensitive regexp
regexp = Regexp.new stop_words.join('|'), true
result = []
while(match = regexp.match(s))
  word = match.pre_match unless match.pre_match.empty?
  result << word
  s = match.post_match
end
# the last unmatched content, if any
result << s
result.compact!.map(&:strip!)

pp result
=> ["force be", "you."]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM