[英]Ruby string split on more than one character
I have a string, say "Hello_World I am Learning,Ruby". 我有一个字符串,说“我正在学习的Hello_World,Ruby”。 I would like to split this string into each distinct word, what's the best way?
我想将此字符串分成每个不同的词,最好的方法是什么?
Thanks! 谢谢! C.
C。
You could use \\W for any non-word character: 您可以将\\ W用于任何非单词字符:
"Hello_World I am Learning,Ruby".split /[\W_]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
"Hello_World I am Learning, Ruby".split /[\W_]+/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
You can use String.split with a regex pattern as the parameter. 您可以使用带有正则表达式模式的String.split作为参数。 Like this:
像这样:
"Hello_World I am Learning,Ruby".split /[ _,.!?]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
ruby-1.9.2-p290 :022 > str = "Hello_World I am Learning,Ruby"
ruby-1.9.2-p290 :023 > str.split(/\s|,|_/)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
String#Scan seems to be an appropriate method for this task String#Scan似乎是完成此任务的合适方法
irb(main):018:0> "Hello_World I am Learning,Ruby".scan(/[a-z]+/i)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
or you might use built-in matcher \\w
或者您可以使用内置的匹配器
\\w
irb(main):020:0> "Hello_World I am Learning,Ruby".scan(/\w+/)
=> ["Hello_World", "I", "am", "Learning", "Ruby"]
Whilst the above examples work, I think it's probably better when splitting a string into words to split on characters not considered to be part of any kind of word. 尽管上面的示例有效,但我认为将字符串拆分为单词以拆分不视为任何类型单词的字符可能会更好。 To do this, I did this:
为此,我这样做:
str = "Hello_World I am Learning,Ruby"
str.split(/[^a-zA-Z]/).reject(&:empty?).compact
This statement does the following: 该语句执行以下操作:
It would then handle most combination of words. 然后它将处理单词的大多数组合。 The above examples require you to list out all the characters you want to match against.
上面的示例要求您列出要与之匹配的所有字符。 It's far easier to specify the characters that you would not consider part of a word.
指定不属于单词的字符要容易得多。
Just for fun, a Unicode aware version for 1.9 (or 1.8 with Oniguruma): 只是为了好玩,一个支持Unicode的版本1.9(或者在Oniguruma中是1.8):
>> "This_µstring has words.and thing's".split(/[^\p{Word}']|\p{Connector_Punctuation}/)
=> ["This", "µstring", "has", "words", "and", "thing's"]
Or maybe: 或者可能:
>> "This_µstring has words.and thing's".split(/[^\p{Word}']|_/)
=> ["This", "µstring", "has", "words", "and", "thing's"]
The real problem is determining what sequence of characters constitute a "word" in this context. 真正的问题是确定在这种情况下哪些字符序列构成一个“单词”。 You might want to have a look at the Oniguruma docs for the character properties that are supported, Wikipedia has some notes on the properties as well.
您可能需要查看Oniguruma文档中所支持的字符属性, Wikipedia也对该属性进行了一些注释 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.