简体   繁体   English

Ruby正则表达式提取单词

[英]Ruby regex extracting words

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. 我目前正在努力想出一个正则表达式,它可以将一个字符串拆分成单词,其中单词被定义为由空格包围的字符序列,或者用双引号括起来。 I'm using String#scan 我正在使用String#scan

For instance, the string: 例如,字符串:

'   hello "my name" is    "Tom"'

should match the words: 应该匹配的话:

hello
my name
is
Tom

I managed to match the words enclosed in double quotes by using: 我设法匹配双引号括起来的单词:

/"([^\"]*)"/

but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'. 但是我无法弄清楚如何将空白字符包围起来以获得'你好','是'和'汤姆',同时又不会搞砸'我的名字'。

Any help with this would be appreciated! 任何帮助都将不胜感激!

result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

will work for you. 会为你工作。 It will print 它会打印出来

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

Just ignore the empty strings. 只需忽略空字符串。

Explanation 说明

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

You can use reject like this to avoid empty strings 您可以使用这样的reject来避免空字符串

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

prints 版画

=> ["hello", "\"my name\"", "is", "\"Tom\""]
text = '   hello "my name" is    "Tom"'

text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}

Produces: 生产:

hello
my name
is
Tom

Explanation: 说明:

0 or more spaces followed by 0个或更多空格后跟

either

some words within double-quotes OR 双引号中的一些单词OR

a single word 一个字

followed by 0 or more spaces 然后是0或更多的空格

You can try this regex: 你可以尝试这个正则表达式:

/\b(\w+)\b/

which uses \\b to find the word boundary. 它使用\\b来查找单词边界。 And this web site http://rubular.com/ is helpful. 这个网站http://rubular.com/很有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM