简体   繁体   中英

Partition/split a string by character set in Ruby

How can I separate different character sets in my string? For example, if I had these charsets:

[a-z]
[A-Z]
[0-9]
[\s]
{everything else}

And this input:

thisISaTEST***1234pie

Then I want to separate the different character sets, for example, if I used a newline as the separating character:

this
IS
a
TEST
***
1234
pie

I've tried this regex, with a positive lookahead:

'thisISaTEST***1234pie'.gsub(/(?=[a-z]+|[A-Z]+|[0-9]+|[\s]+)/, "\n")

But apparently the + s aren't being greedy, because I'm getting:

t
h
# (snip)...
S
T***
1
# (snip)...
e

I snipped out the irrelevant parts, but as you can see each character is counting as its own charset, except the {everything else} charset.

How can I do this? It does not necessarily have to be by regex. Splitting them into an array would work too.

The difficult part is to match whatever that does not match the rest of the regex. Forget about that, and think of a way that you can mix the non-matching parts together with the matching parts.

"thisISaTEST***1234pie"
.split(/([a-z]+|[A-Z]+|\d+|\s+)/).reject(&:empty?)
# => ["this", "IS", "a", "TEST", "***", "1234", "pie"]

In the ASCII character set, apart from alphanumerics and space, there are thirty-two "punctuation" characters, which are matched with the property construct \\p{punct} .

To split your string into sequences of a single category, you can write

str = 'thisISaTEST***1234pie'
p str.scan(/\G(?:[a-z]+|[A-Z]+|\d+|\s+|[\p{punct}]+)/)

output

["this", "IS", "a", "TEST", "***", "1234", "pie"]

Alternatively, if your string contains characters outside the ASCII set, you could write the whole thing in terms of properties, like this

p str.scan(/\G(?:\p{lower}+|\p{upper}+|\p{digit}+|\p{space}|[^\p{alnum}\p{space}]+)/)

Here a two solutions.

String#scan with a regular expression

str = "thisISa\n TEST*$*1234pie"
r = /[a-z]+|[A-Z]+|\d+|\s+|[^a-zA-Z\d\s]+/
str.scan r
  #=> ["this", "IS", "a", "\n ", "TEST", "*$*", "1234", "pie"]

Because of ^ at the beginning of [^a-zA-Z\\d\\s] that character class matches any character other than letters (lower and upper case), digits and whitespace.

Use Enumerable#slice_when 1

First, a helper method:

def type(c)
  case c
  when /[a-z]/ then 0
  when /[A-Z]/ then 1
  when /\d/    then 2
  when /\s/    then 3
  else              4
  end
end

For example,

type "f"   #=> 0
type "P"   #=> 1
type "3"   #=> 2
type "\n"  #=> 3
type "*"   #=> 4    

Then

str.each_char.slice_when { |c1,c2| type(c1) != type(c2) }.map(&:join)
  #=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

1. slich_when made its debut in Ruby v2.4.

Non-word, non-space chars can be covered with [^\\w\\s] , so:

"thisISaTEST***1234pie".scan /[a-z]+|[A-Z]+|\d+|\s+|[^\w\s]+/
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM