简体   繁体   中英

How can I use regex in Ruby to split a string into an array of the words it contains?

I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:

  1. It must split the string on all dashes, spaces, underscores, and periods.
  2. When multiple of the aforementioned characters show up together, it must only split once (so 'the--.quick' must split to ['the', 'quick'] and not ['the', '', '', 'quick'] )
  3. It must split the string on new capital letters, while keeping that letter with its corresponding word ('theQuickBrown' splits to ['the', 'quick', 'brown']
  4. It must group multiple uppercase letters in a row together ('LETS_GO' must split to ['lets', 'go'], not ['l', 'e', 't', 's', 'g', 'o'])
  5. It must use only lowercase letters in the split array.

If it is working properly, the following should be true

"theQuick--brown_fox JumpsOver___the.lazy  DOG".split_words == 
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]

I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.

Here's what I have so far:

class String
  def split_words 
    split(/[_,\-, ,.]|(?=[A-Z]+)/).
    map(&:downcase).
    reject(&:empty?)
  end 
end

Which when called on the string from the test above returns:

["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]

How can I update this method to meet all of the above specs?

You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [az]+ after the [AZ]+

string = "theQuick--brown_fox JumpsOver___the.lazy  DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]

You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:

s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)

See the Ruby demo and the Rubular demo .

The regex matches:

  • \\p{Lu}{2,} - 2 or more uppercase letters
  • | - or
  • \\p{L} - any letter
  • \\p{Ll}* - 0 or more lowercase letters.

With map(&:downcase) , the items you get with .scan() are turned to lower case.

r = /
    [- _.]+      # match one or more combinations of dashes, spaces,
                 # underscores and periods
    |            # or
    (?<=\p{Ll})  # match a lower case letter in a positive lookbehind
    (?=\p{Lu})   # match an upper case letter in a positive lookahead
    /x           # free-spacing regex definition mode

str = "theQuick--brown_dog, JumpsOver___the.--lazy   FOX for $5"

str.split(r).map(&:downcase)
  #=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
       "fox", "for", "$5"]

If the string is to be broken on spaces and all punctuation characters, replace [- _.]+ with [ [:punct:]]+ . Search for "[[:punct:]]" at Regexp for the reference.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM