使用正则表达式在Ruby列表中拆分Ruby中的字符串

Question

我正在尝试将Ruby中的字符串拆分为基于停用词列表的较小子字符串或短语。 当我直接定义正则表达式模式时，split方法有效; 但是，当我尝试通过在split方法本身内进行评估来定义模式时，它不起作用。

在实践中，我想读一个停用词的外部文件，并用它来分割我的句子。 所以，我希望能够从外部文件构造模式，而不是直接指定它。 我还注意到，当我使用'pp'与'puts'时，我会得到非常不同的行为，我不知道为什么。 我在Windows上使用Ruby 2.0和Notepad ++。

 require 'pp'
 str = "The force be with you."     
 pp str.split(/(?:\bthe\b|\bwith\b)/i)
 => ["", " force be ", " you."]
 pp str.split(/(?:\bthe\b|\bwith\b)/i).collect(&:strip).reject(&:empty?)
 => ["force be", "you."]

上面的最后一个数组是我想要的结果。 但是，这不起作用：

 require 'pp'
 stop_array = ["the", "with"]
 str = "The force be with you." 
 pattern = "(?:" + stop_array.map{|i| "\b#{i}\b" }.join("|") + ")"
 puts pattern
 => (?thwit)
 puts str.split(/#{pattern}/i)
 => The force be with you.
 pp pattern
 => "(?:\bthe\b|\bwith\b)"
 pp str.split(/#{pattern}/i)
 => ["The force be with you."]

更新：使用下面的评论，我修改了我的原始脚本。 我还创建了一个分割字符串的方法。

 require 'pp'

 class String
      def splitstop(stopwords=[])
      stopwords_regex = /\b(?:#{ Regexp.union(*stopwords).source })\b/i
      return split(stopwords_regex).collect(&:strip).reject(&:empty?)
      end
 end

 stop_array = ["the", "with", "over"]

 pp "The force be with you.".splitstop stop_array
 => ["force be", "you."]
 pp "The quick brown fox jumps over the lazy dog.".splitstop stop_array
 => ["quick brown fox jumps", "lazy dog."]

Answer 1

我这样做：

str = "The force be with you."     
stop_array = %w[the with]
stopwords_regex = /(?:#{ Regexp.union(stop_array).source })/i
str.split(stopwords_regex).map(&:strip) # => ["", "force be", "you."]

使用Regexp.union ，注意生成的实际模式非常重要：

/(?:#{ Regexp.union(stop_array) })/i
=> /(?:(?-mix:the|with))/i

嵌入式(?-mix:关闭模式中不区分大小写的标志，它可以破坏模式，导致它抓错了。相反，你必须告诉引擎只返回模式，没有标志：

/(?:#{ Regexp.union(stop_array).source })/i
=> /(?:the|with)/i

这就是为什么pattern = "(?:\\bthe\\b|\\bwith\\b)"不起作用的原因：

/#{pattern}/i # => /(?:\x08the\x08|\x08with\x08)/i

Ruby将"\\b"视为退格符。 而是使用：

pattern = "(?:\\bthe\\b|\\bwith\\b)"
/#{pattern}/i # => /(?:\bthe\b|\bwith\b)/i

Answer 2

你必须掩盖反斜杠：

"\\b#{i}\\b"

即

pattern = "(?:" + stop_array.map{|i| "\\b#{i}\\b" }.join("|") + ")"

并略有改进/简化：

pattern = "\\b(?:" + stop_array.join("|") + ")\\b"

然后：

str.split(/#{pattern}/i) # => ["", " force be ", " you."]

如果您的停止列表很短，我认为这是正确的方法。

Answer 3

stop_array = ["the", "with"]
re = Regexp.union(stop_array.map{|w| /\s*\b#{Regexp.escape(w)}\b\s*/i})

"The force be with you.".split(re) # =>
[
  "",
  "force be",
  "you."
]

Answer 4

s = "the force be with you."
stop_words = %w|the with is|
# dynamically create a case-insensitive regexp
regexp = Regexp.new stop_words.join('|'), true
result = []
while(match = regexp.match(s))
  word = match.pre_match unless match.pre_match.empty?
  result << word
  s = match.post_match
end
# the last unmatched content, if any
result << s
result.compact!.map(&:strip!)

pp result
=> ["force be", "you."]

使用正则表达式在Ruby列表中拆分Ruby中的字符串

问题描述

4 个解决方案

解决方案1
3 已采纳 2013-06-12 07:32:02

解决方案2
0 2013-06-12 06:59:16

解决方案3
0 2013-06-12 07:06:10

解决方案4
0 2013-06-12 10:21:03

使用正则表达式在Ruby列表中拆分Ruby中的字符串

问题描述

4 个解决方案

解决方案1 3 已采纳 2013-06-12 07:32:02

解决方案2 0 2013-06-12 06:59:16

解决方案3 0 2013-06-12 07:06:10

解决方案4 0 2013-06-12 10:21:03

解决方案1
3 已采纳 2013-06-12 07:32:02

解决方案2
0 2013-06-12 06:59:16

解决方案3
0 2013-06-12 07:06:10

解决方案4
0 2013-06-12 10:21:03