Partition/split a string by character set in Ruby

Question

How can I separate different character sets in my string? For example, if I had these charsets:

[a-z]
[A-Z]
[0-9]
[\s]
{everything else}

And this input:

thisISaTEST***1234pie

Then I want to separate the different character sets, for example, if I used a newline as the separating character:

this
IS
a
TEST
***
1234
pie

I've tried this regex, with a positive lookahead:

'thisISaTEST***1234pie'.gsub(/(?=[a-z]+|[A-Z]+|[0-9]+|[\s]+)/, "\n")

But apparently the + s aren't being greedy, because I'm getting:

t
h
# (snip)...
S
T***
1
# (snip)...
e

I snipped out the irrelevant parts, but as you can see each character is counting as its own charset, except the {everything else} charset.

How can I do this? It does not necessarily have to be by regex. Splitting them into an array would work too.

Answer 1

The difficult part is to match whatever that does not match the rest of the regex. Forget about that, and think of a way that you can mix the non-matching parts together with the matching parts.

"thisISaTEST***1234pie"
.split(/([a-z]+|[A-Z]+|\d+|\s+)/).reject(&:empty?)
# => ["this", "IS", "a", "TEST", "***", "1234", "pie"]

Answer 2

In the ASCII character set, apart from alphanumerics and space, there are thirty-two "punctuation" characters, which are matched with the property construct \\p{punct} .

To split your string into sequences of a single category, you can write

str = 'thisISaTEST***1234pie'
p str.scan(/\G(?:[a-z]+|[A-Z]+|\d+|\s+|[\p{punct}]+)/)

output

["this", "IS", "a", "TEST", "***", "1234", "pie"]

Alternatively, if your string contains characters outside the ASCII set, you could write the whole thing in terms of properties, like this

p str.scan(/\G(?:\p{lower}+|\p{upper}+|\p{digit}+|\p{space}|[^\p{alnum}\p{space}]+)/)

Answer 3

Here a two solutions.

String#scan with a regular expression

str = "thisISa\n TEST*$*1234pie"

r = /[a-z]+|[A-Z]+|\d+|\s+|[^a-zA-Z\d\s]+/
str.scan r
  #=> ["this", "IS", "a", "\n ", "TEST", "*$*", "1234", "pie"]

Because of ^ at the beginning of [^a-zA-Z\\d\\s] that character class matches any character other than letters (lower and upper case), digits and whitespace.

Use Enumerable#slice_when ¹

First, a helper method:

def type(c)
  case c
  when /[a-z]/ then 0
  when /[A-Z]/ then 1
  when /\d/    then 2
  when /\s/    then 3
  else              4
  end
end

For example,

type "f"   #=> 0
type "P"   #=> 1
type "3"   #=> 2
type "\n"  #=> 3
type "*"   #=> 4

Then

str.each_char.slice_when { |c1,c2| type(c1) != type(c2) }.map(&:join)
  #=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

^{1. slich_when made its debut in Ruby v2.4.}

Answer 4

Non-word, non-space chars can be covered with [^\\w\\s] , so:

"thisISaTEST***1234pie".scan /[a-z]+|[A-Z]+|\d+|\s+|[^\w\s]+/
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

Partition/split a string by character set in Ruby

Question

4 answers

solution1
4 ACCPTED 2013-08-27 00:07:24

solution2
1 2013-08-27 01:02:42

solution3
0 2020-11-25 09:12:20

solution4
-1 2013-08-27 01:34:41

Partition/split a string by character set in Ruby

Question

4 answers

solution1 4 ACCPTED 2013-08-27 00:07:24

solution2 1 2013-08-27 01:02:42

solution3 0 2020-11-25 09:12:20

solution4 -1 2013-08-27 01:34:41

solution1
4 ACCPTED 2013-08-27 00:07:24

solution2
1 2013-08-27 01:02:42

solution3
0 2020-11-25 09:12:20

solution4
-1 2013-08-27 01:34:41