[英]Python string split by multiple delimiters
Given a string: s = FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE
给定一个字符串: s = FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE
The delimiting characters are P
, Q
, D
and E
分隔字符为P
, Q
, D
和E
I want to be able to split the string on these characters. 我希望能够在这些字符上分割字符串。
Based on: Is it possible to split a string on multiple delimiters in order? 基于: 是否可以按顺序在多个定界符上拆分字符串?
I have the following 我有以下
def splits(s,seps):
l,_,r = s.partition(seps[0])
if len(seps) == 1:
return [l,r]
return [l] + splits(r,seps[1:])
seps = ['P', 'D', 'Q', 'E']
sequences = splits(s, seps)
This gives me: 这给了我:
['FFFFRRFFFFFFF',
'PRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLF',
'RRFRRFFFFFFFFR',
'',
'E']
As we can see the second entry has many P
. 我们可以看到第二个条目有很多P
I want is the occurrence of characters between the last set of P
, not the first occurrence (ie, RFFFFFFFLF
). 我想要的是最后一组P
之间的字符出现,而不是第一个出现(即RFFFFFFFLF
)。
Also, the order of occurrence of the delimiting characters is not fixed. 另外,定界字符的出现顺序也不固定。
Looking for solutions/hints on how to achieve this? 寻找解决方案/提示如何实现这一目标?
Update: Desired output, all set of strings between these delimiters (similar to the one shown) but adhering to the condition of the last occurrence as above 更新:所需的输出,这些定界符之间的所有字符串集(与所示定界符相似),但遵循上述最后一次出现的条件
Update2: Expected output Update2:预期输出
['FFFFRRFFFFFFF',
'RFFFFFFFLF', # << this is where the output differs
'RRFRRFFFFFFFFR',
'',
''] # << the last E is 2 consecutive E with no other letters, hence should be empty
Sounds like you want to split at sequence from first character appearance until the last. 听起来您想从第一个字符出现到最后一个字符按顺序进行拆分。
([PDQE])(?:.*\1)?
([PDQE])
captures one of the characters in class ([PDQE])
捕获在字符中的一个类 (?:.*\\1)?
optionally match any amount of characters until last occurence of captured . (可选)匹配任意数量的字符,直到最后一次捕获到 。 Have a try with split pattern at regex101 and a PHP Demo at 3v4l.org (should be similar in Python). 在regex101上尝试使用拆分模式,在3v4l.org上尝试使用 PHP演示 (在Python中应该类似)。
import re
s = "FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE"
def get_sequences(s):
seen_delimiters = {c: ('', None) for c in 'PDQE'}
order = 0
for g in re.finditer(r'(.*?)([PDQE]|\Z)', s):
if g[2]:
if seen_delimiters[g[2][0]][1] == None:
seen_delimiters[g[2][0]] = (g[1], order)
order += 1
return seen_delimiters
for k, (seq, order) in get_sequences(s).items():
print('{}: order: {} seq: {}'.format(k, order, seq))
Prints: 打印:
P: order: 0 seq: FFFFRRFFFFFFF
D: order: 1 seq: RFFFFFFFLF
Q: order: 2 seq: RRFRRFFFFFFFFR
E: order: 3 seq:
Update (for print sequences and delimiters enclosing): 更新(用于打印顺序和定界符):
import re
s = "FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE"
for g in re.finditer(r'(.*?)([PDQE]+|\Z)', s):
print(g[1], g[2])
Prints: 打印:
FFFFRRFFFFFFF PP
RRRRRRLLRLLRLLL PP
F PP
L PP
L PP
LF PP
FF P
FLR P
FFRRLLR P
F P
RFFFFFFFLF D
RRFRRFFFFFFFFR QEE
Use re.split
with a character class [PQDE]
: 将re.split
与字符类[PQDE]
:
import re
s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
sequences = re.split(r'[PQDE]', s)
print(sequences)
Output: 输出:
['FFFFRRFFFFFFF', '', 'RRRRRRLLRLLRLLL', '', 'F', '', 'L', '', 'L', '', 'LF', '', 'FF', 'FLR', 'FFRRLLR', 'F', 'RFFFFFFFLF', 'RRFRRFFFFFFFFR', '', '', '']
If you want to split on 1 or more delimiter: 如果要分割1个或多个定界符:
import re
s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
sequences = re.split(r'[PQDE]+', s)
print(sequences)
Output: 输出:
['FFFFRRFFFFFFF', 'RRRRRRLLRLLRLLL', 'F', 'L', 'L', 'LF', 'FF', 'FLR', 'FFRRLLR', 'F', 'RFFFFFFFLF', 'RRFRRFFFFFFFFR', '']
If you want to capture the delimiters: 如果要捕获定界符:
import re
s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
sequences = re.split(r'([PQDE])', s)
print(sequences)
Output: 输出:
['FFFFRRFFFFFFF', 'P', '', 'P', 'RRRRRRLLRLLRLLL', 'P', '', 'P', 'F', 'P', '', 'P', 'L', 'P', '', 'P', 'L', 'P', '', 'P', 'LF', 'P', '', 'P', 'FF', 'P', 'FLR', 'P', 'FFRRLLR', 'P', 'F', 'P', 'RFFFFFFFLF', 'D', 'RRFRRFFFFFFFFR', 'Q', '', 'E', '', 'E', '']
This solution is iterating the delimiters one by one, so you can control the order you want to apply each one of them: 此解决方案逐个迭代定界符,因此您可以控制要应用每个定界符的顺序:
s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
spliters='PDQE'
for sp in spliters:
if type(s) is str:
s = s.split(sp)
else: #type is list
s=[x.split(sp) for x in s]
s = [item for sublist in s for item in sublist if item != ''] #flatten the list
output: 输出:
['FFFFRRFFFFFFF',
'RRRRRRLLRLLRLLL',
'F',
'L',
'L',
'LF',
'FF',
'FLR',
'FFRRLLR',
'F',
'RFFFFFFFLF',
'RRFRRFFFFFFFFR']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.