简体   繁体   中英

Splitting string in Ruby on list of words using regex

I'm trying to split a string in Ruby into smaller sub-strings or phrases based on a list of stop words. The split method works when I define the regular expression pattern directly; however, it doesn't work when I attempt to define the pattern by evaluating within the split method itself.

In practice, I want to read an external file of stop words and use it to split my sentences. So, I want to be able to construct the pattern from external file, rather than specify it directly. I also notice that when I use 'pp' versus 'puts' I get very different behaviors and I'm not sure why. I'm using Ruby 2.0 and Notepad++ on Windows.

 require 'pp'
 str = "The force be with you."     
 pp str.split(/(?:\bthe\b|\bwith\b)/i)
 => ["", " force be ", " you."]
 pp str.split(/(?:\bthe\b|\bwith\b)/i).collect(&:strip).reject(&:empty?)
 => ["force be", "you."] 

The final array above is my desired result. However, this doesn't work below:

 require 'pp'
 stop_array = ["the", "with"]
 str = "The force be with you." 
 pattern = "(?:" + stop_array.map{|i| "\b#{i}\b" }.join("|") + ")"
 puts pattern
 => (?thwit)
 puts str.split(/#{pattern}/i)
 => The force be with you.
 pp pattern
 => "(?:\bthe\b|\bwith\b)"
 pp str.split(/#{pattern}/i)
 => ["The force be with you."]

Update: Using the comments below, I modified my original script. I also created a method to split the string.

 require 'pp'

 class String
      def splitstop(stopwords=[])
      stopwords_regex = /\b(?:#{ Regexp.union(*stopwords).source })\b/i
      return split(stopwords_regex).collect(&:strip).reject(&:empty?)
      end
 end

 stop_array = ["the", "with", "over"]

 pp "The force be with you.".splitstop stop_array
 => ["force be", "you."]
 pp "The quick brown fox jumps over the lazy dog.".splitstop stop_array
 => ["quick brown fox jumps", "lazy dog."]

I'd do it this way:

str = "The force be with you."     
stop_array = %w[the with]
stopwords_regex = /(?:#{ Regexp.union(stop_array).source })/i
str.split(stopwords_regex).map(&:strip) # => ["", "force be", "you."]

When using Regexp.union , it's important to watch out for the actual pattern that is generated:

/(?:#{ Regexp.union(stop_array) })/i
=> /(?:(?-mix:the|with))/i

The embedded (?-mix: turns off the case-insensitive flag inside the pattern, which can break the pattern, causing it to grab the wrong things. Instead, you have to tell the engine to return just the pattern, without the flags:

/(?:#{ Regexp.union(stop_array).source })/i
=> /(?:the|with)/i

Here's why pattern = "(?:\\bthe\\b|\\bwith\\b)" doesn't work:

/#{pattern}/i # => /(?:\x08the\x08|\x08with\x08)/i

Ruby sees "\\b" as a backspace character. Instead use:

pattern = "(?:\\bthe\\b|\\bwith\\b)"
/#{pattern}/i # => /(?:\bthe\b|\bwith\b)/i

You have to mask the backslashes:

"\\b#{i}\\b" 

ie

pattern = "(?:" + stop_array.map{|i| "\\b#{i}\\b" }.join("|") + ")"

And a minor improvement/simplification:

pattern = "\\b(?:" + stop_array.join("|") + ")\\b"

Then:

str.split(/#{pattern}/i) # => ["", " force be ", " you."]

If your stop list is short, I think this is the right approach.

stop_array = ["the", "with"]
re = Regexp.union(stop_array.map{|w| /\s*\b#{Regexp.escape(w)}\b\s*/i})

"The force be with you.".split(re) # =>
[
  "",
  "force be",
  "you."
]
s = "the force be with you."
stop_words = %w|the with is|
# dynamically create a case-insensitive regexp
regexp = Regexp.new stop_words.join('|'), true
result = []
while(match = regexp.match(s))
  word = match.pre_match unless match.pre_match.empty?
  result << word
  s = match.post_match
end
# the last unmatched content, if any
result << s
result.compact!.map(&:strip!)

pp result
=> ["force be", "you."]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM