How can I use regex in Ruby to split a string into an array of the words it contains?

Question

I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:

It must split the string on all dashes, spaces, underscores, and periods.
When multiple of the aforementioned characters show up together, it must only split once (so 'the--.quick' must split to ['the', 'quick'] and not ['the', '', '', 'quick'] )
It must split the string on new capital letters, while keeping that letter with its corresponding word ('theQuickBrown' splits to ['the', 'quick', 'brown']
It must group multiple uppercase letters in a row together ('LETS_GO' must split to ['lets', 'go'], not ['l', 'e', 't', 's', 'g', 'o'])
It must use only lowercase letters in the split array.

If it is working properly, the following should be true

"theQuick--brown_fox JumpsOver___the.lazy  DOG".split_words == 
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]

I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.

Here's what I have so far:

class String
  def split_words 
    split(/[_,\-, ,.]|(?=[A-Z]+)/).
    map(&:downcase).
    reject(&:empty?)
  end 
end

Which when called on the string from the test above returns:

["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]

How can I update this method to meet all of the above specs?

Answer 1

You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [az]+ after the [AZ]+

string = "theQuick--brown_fox JumpsOver___the.lazy  DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]

Answer 2

You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:

s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)

See the Ruby demo and the Rubular demo .

The regex matches:

\\p{Lu}{2,} - 2 or more uppercase letters
| - or
\\p{L} - any letter
\\p{Ll}* - 0 or more lowercase letters.

With map(&:downcase) , the items you get with .scan() are turned to lower case.

Answer 3

r = /
    [- _.]+      # match one or more combinations of dashes, spaces,
                 # underscores and periods
    |            # or
    (?<=\p{Ll})  # match a lower case letter in a positive lookbehind
    (?=\p{Lu})   # match an upper case letter in a positive lookahead
    /x           # free-spacing regex definition mode

str = "theQuick--brown_dog, JumpsOver___the.--lazy   FOX for $5"

str.split(r).map(&:downcase)
  #=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
       "fox", "for", "$5"]

If the string is to be broken on spaces and all punctuation characters, replace [- _.]+ with [ [:punct:]]+ . Search for "[[:punct:]]" at Regexp for the reference.

How can I use regex in Ruby to split a string into an array of the words it contains?

Question

3 answers

solution1
5 2018-06-01 17:53:06

solution2
4 ACCPTED 2018-06-01 18:01:03

solution3
2 2018-06-01 18:10:50

How can I use regex in Ruby to split a string into an array of the words it contains?

Question

3 answers

solution1 5 2018-06-01 17:53:06

solution2 4 ACCPTED 2018-06-01 18:01:03

solution3 2 2018-06-01 18:10:50

solution1
5 2018-06-01 17:53:06

solution2
4 ACCPTED 2018-06-01 18:01:03

solution3
2 2018-06-01 18:10:50