简体   繁体   English

按 Ruby 中的字符集对字符串进行分区/拆分

[英]Partition/split a string by character set in Ruby

How can I separate different character sets in my string?如何在我的字符串中分隔不同的字符集? For example, if I had these charsets:例如,如果我有这些字符集:

[a-z]
[A-Z]
[0-9]
[\s]
{everything else}

And this input:这个输入:

thisISaTEST***1234pie

Then I want to separate the different character sets, for example, if I used a newline as the separating character:然后我想分隔不同的字符集,例如,如果我使用换行符作为分隔符:

this
IS
a
TEST
***
1234
pie

I've tried this regex, with a positive lookahead:我已经尝试过这个正则表达式,并具有积极的前瞻性:

'thisISaTEST***1234pie'.gsub(/(?=[a-z]+|[A-Z]+|[0-9]+|[\s]+)/, "\n")

But apparently the + s aren't being greedy, because I'm getting:但显然+ s 并不贪婪,因为我得到:

t
h
# (snip)...
S
T***
1
# (snip)...
e

I snipped out the irrelevant parts, but as you can see each character is counting as its own charset, except the {everything else} charset.我剪掉了不相关的部分,但正如你所看到的,每个字符都被算作自己的字符集,除了{everything else}字符集。

How can I do this?我怎样才能做到这一点? It does not necessarily have to be by regex.它不一定必须是正则表达式。 Splitting them into an array would work too.将它们拆分成一个数组也可以。

The difficult part is to match whatever that does not match the rest of the regex.困难的部分是匹配与正则表达式其余部分不匹配的任何内容。 Forget about that, and think of a way that you can mix the non-matching parts together with the matching parts.忘记这一点,想办法将不匹配的部分与匹配的部分混合在一起。

"thisISaTEST***1234pie"
.split(/([a-z]+|[A-Z]+|\d+|\s+)/).reject(&:empty?)
# => ["this", "IS", "a", "TEST", "***", "1234", "pie"]

In the ASCII character set, apart from alphanumerics and space, there are thirty-two "punctuation" characters, which are matched with the property construct \\p{punct} .在 ASCII 字符集中,除了字母数字和空格之外,还有 32 个“标点”字符,它们与属性结构\\p{punct}匹配。

To split your string into sequences of a single category, you can write要将字符串拆分为单个类别的序列,您可以编写

str = 'thisISaTEST***1234pie'
p str.scan(/\G(?:[a-z]+|[A-Z]+|\d+|\s+|[\p{punct}]+)/)

output输出

["this", "IS", "a", "TEST", "***", "1234", "pie"]

Alternatively, if your string contains characters outside the ASCII set, you could write the whole thing in terms of properties, like this或者,如果您的字符串包含 ASCII 集之外的字符,您可以按照属性编写整个内容,如下所示

p str.scan(/\G(?:\p{lower}+|\p{upper}+|\p{digit}+|\p{space}|[^\p{alnum}\p{space}]+)/)

Here a two solutions.这里有两个解决方案。

String#scan with a regular expression String#scan使用正则表达式

str = "thisISa\n TEST*$*1234pie"
r = /[a-z]+|[A-Z]+|\d+|\s+|[^a-zA-Z\d\s]+/
str.scan r
  #=> ["this", "IS", "a", "\n ", "TEST", "*$*", "1234", "pie"]

Because of ^ at the beginning of [^a-zA-Z\\d\\s] that character class matches any character other than letters (lower and upper case), digits and whitespace.由于^[^a-zA-Z\\d\\s]的开头,该字符类匹配字母(小写和大写)、数字和空格以外的任何字符。

Use Enumerable#slice_when 1使用Enumerable#slice_when 1

First, a helper method:首先是一个辅助方法:

def type(c)
  case c
  when /[a-z]/ then 0
  when /[A-Z]/ then 1
  when /\d/    then 2
  when /\s/    then 3
  else              4
  end
end

For example,例如,

type "f"   #=> 0
type "P"   #=> 1
type "3"   #=> 2
type "\n"  #=> 3
type "*"   #=> 4    

Then然后

str.each_char.slice_when { |c1,c2| type(c1) != type(c2) }.map(&:join)
  #=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

1. slich_when made its debut in Ruby v2.4. 1. slich_when在 Ruby v2.4 中首次亮相。

Non-word, non-space chars can be covered with [^\\w\\s] , so:非单词、非空格字符可以用[^\\w\\s]覆盖,因此:

"thisISaTEST***1234pie".scan /[a-z]+|[A-Z]+|\d+|\s+|[^\w\s]+/
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM