简体   繁体   English

红宝石字符串分割成多个字符

[英]Ruby string split on more than one character

I have a string, say "Hello_World I am Learning,Ruby". 我有一个字符串,说“我正在学习的Hello_World,Ruby”。 I would like to split this string into each distinct word, what's the best way? 我想将此字符串分成每个不同的词,最好的方法是什么?

Thanks! 谢谢! C. C。

You could use \\W for any non-word character: 您可以将\\ W用于任何非单词字符:

"Hello_World I am Learning,Ruby".split /[\W_]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

"Hello_World I am Learning,   Ruby".split /[\W_]+/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

You can use String.split with a regex pattern as the parameter. 您可以使用带有正则表达式模式的String.split作为参数。 Like this: 像这样:

"Hello_World I am Learning,Ruby".split /[ _,.!?]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
ruby-1.9.2-p290 :022 > str =  "Hello_World I am Learning,Ruby"
ruby-1.9.2-p290 :023 > str.split(/\s|,|_/)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"] 

String#Scan seems to be an appropriate method for this task String#Scan似乎是完成此任务的合适方法

irb(main):018:0> "Hello_World    I am Learning,Ruby".scan(/[a-z]+/i)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

or you might use built-in matcher \\w 或者您可以使用内置的匹配器\\w

irb(main):020:0> "Hello_World    I am Learning,Ruby".scan(/\w+/)
=> ["Hello_World", "I", "am", "Learning", "Ruby"]

Whilst the above examples work, I think it's probably better when splitting a string into words to split on characters not considered to be part of any kind of word. 尽管上面的示例有效,但我认为将字符串拆分为单词以拆分不视为任何类型单词的字符可能会更好。 To do this, I did this: 为此,我这样做:

str =  "Hello_World I am Learning,Ruby"
str.split(/[^a-zA-Z]/).reject(&:empty?).compact

This statement does the following: 该语句执行以下操作:

  1. Splits the string by characters that are not in the alphabet 按字母以外的字符分割字符串
  2. Then rejects anything that is an empty string 然后拒绝任何空字符串
  3. And removes all nulls from the array 并从数组中删除所有空值

It would then handle most combination of words. 然后它将处理单词的大多数组合。 The above examples require you to list out all the characters you want to match against. 上面的示例要求您列出要与之匹配的所有字符。 It's far easier to specify the characters that you would not consider part of a word. 指定不属于单词的字符要容易得多。

Just for fun, a Unicode aware version for 1.9 (or 1.8 with Oniguruma): 只是为了好玩,一个支持Unicode的版本1.9(或者在Oniguruma中是1.8):

>> "This_µstring has words.and thing's".split(/[^\p{Word}']|\p{Connector_Punctuation}/)
=> ["This", "µstring", "has", "words", "and", "thing's"]

Or maybe: 或者可能:

>> "This_µstring has words.and thing's".split(/[^\p{Word}']|_/)
=> ["This", "µstring", "has", "words", "and", "thing's"]

The real problem is determining what sequence of characters constitute a "word" in this context. 真正的问题是确定在这种情况下哪些字符序列构成一个“单词”。 You might want to have a look at the Oniguruma docs for the character properties that are supported, Wikipedia has some notes on the properties as well. 您可能需要查看Oniguruma文档中所支持的字符属性, Wikipedia也对该属性进行了一些注释

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM