简体   繁体   中英

Ruby string split on more than one character

I have a string, say "Hello_World I am Learning,Ruby". I would like to split this string into each distinct word, what's the best way?

Thanks! C.

You could use \\W for any non-word character:

"Hello_World I am Learning,Ruby".split /[\W_]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

"Hello_World I am Learning,   Ruby".split /[\W_]+/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

You can use String.split with a regex pattern as the parameter. Like this:

"Hello_World I am Learning,Ruby".split /[ _,.!?]/
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]
ruby-1.9.2-p290 :022 > str =  "Hello_World I am Learning,Ruby"
ruby-1.9.2-p290 :023 > str.split(/\s|,|_/)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"] 

String#Scan seems to be an appropriate method for this task

irb(main):018:0> "Hello_World    I am Learning,Ruby".scan(/[a-z]+/i)
=> ["Hello", "World", "I", "am", "Learning", "Ruby"]

or you might use built-in matcher \\w

irb(main):020:0> "Hello_World    I am Learning,Ruby".scan(/\w+/)
=> ["Hello_World", "I", "am", "Learning", "Ruby"]

Whilst the above examples work, I think it's probably better when splitting a string into words to split on characters not considered to be part of any kind of word. To do this, I did this:

str =  "Hello_World I am Learning,Ruby"
str.split(/[^a-zA-Z]/).reject(&:empty?).compact

This statement does the following:

  1. Splits the string by characters that are not in the alphabet
  2. Then rejects anything that is an empty string
  3. And removes all nulls from the array

It would then handle most combination of words. The above examples require you to list out all the characters you want to match against. It's far easier to specify the characters that you would not consider part of a word.

Just for fun, a Unicode aware version for 1.9 (or 1.8 with Oniguruma):

>> "This_µstring has words.and thing's".split(/[^\p{Word}']|\p{Connector_Punctuation}/)
=> ["This", "µstring", "has", "words", "and", "thing's"]

Or maybe:

>> "This_µstring has words.and thing's".split(/[^\p{Word}']|_/)
=> ["This", "µstring", "has", "words", "and", "thing's"]

The real problem is determining what sequence of characters constitute a "word" in this context. You might want to have a look at the Oniguruma docs for the character properties that are supported, Wikipedia has some notes on the properties as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM