I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:
If it is working properly, the following should be true
"theQuick--brown_fox JumpsOver___the.lazy DOG".split_words ==
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]
I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.
Here's what I have so far:
class String
def split_words
split(/[_,\-, ,.]|(?=[A-Z]+)/).
map(&:downcase).
reject(&:empty?)
end
end
Which when called on the string from the test above returns:
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]
How can I update this method to meet all of the above specs?
You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [az]+
after the [AZ]+
string = "theQuick--brown_fox JumpsOver___the.lazy DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]
You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:
s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)
See the Ruby demo and the Rubular demo .
The regex matches:
\\p{Lu}{2,}
- 2 or more uppercase letters |
- or \\p{L}
- any letter \\p{Ll}*
- 0 or more lowercase letters. With map(&:downcase)
, the items you get with .scan()
are turned to lower case.
r = /
[- _.]+ # match one or more combinations of dashes, spaces,
# underscores and periods
| # or
(?<=\p{Ll}) # match a lower case letter in a positive lookbehind
(?=\p{Lu}) # match an upper case letter in a positive lookahead
/x # free-spacing regex definition mode
str = "theQuick--brown_dog, JumpsOver___the.--lazy FOX for $5"
str.split(r).map(&:downcase)
#=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
"fox", "for", "$5"]
If the string is to be broken on spaces and all punctuation characters, replace [- _.]+
with [ [:punct:]]+
. Search for "[[:punct:]]"
at Regexp for the reference.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.