简体   繁体   English

Python字符串由多个定界符分隔

[英]Python string split by multiple delimiters

Given a string: s = FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE 给定一个字符串: s = FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE

The delimiting characters are P , Q , D and E 分隔字符为PQDE

I want to be able to split the string on these characters. 我希望能够在这些字符上分割字符串。

Based on: Is it possible to split a string on multiple delimiters in order? 基于: 是否可以按顺序在多个定界符上拆分字符串?

I have the following 我有以下

def splits(s,seps):
    l,_,r = s.partition(seps[0])
    if len(seps) == 1:
        return [l,r]
    return [l] + splits(r,seps[1:])

seps = ['P', 'D', 'Q', 'E']

sequences = splits(s, seps)

This gives me: 这给了我:

['FFFFRRFFFFFFF',
 'PRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLF',
 'RRFRRFFFFFFFFR',
 '',
 'E']

As we can see the second entry has many P . 我们可以看到第二个条目有很多P

I want is the occurrence of characters between the last set of P , not the first occurrence (ie, RFFFFFFFLF ). 我想要的是最后一组P之间的字符出现,而不是第一个出现(即RFFFFFFFLF )。

Also, the order of occurrence of the delimiting characters is not fixed. 另外,定界字符的出现顺序也不固定。

Looking for solutions/hints on how to achieve this? 寻找解决方案/提示如何实现这一目标?

Update: Desired output, all set of strings between these delimiters (similar to the one shown) but adhering to the condition of the last occurrence as above 更新:所需的输出,这些定界符之间的所有字符串集(与所示定界符相似),但遵循上述最后一次出现的条件

Update2: Expected output Update2:预期输出

['FFFFRRFFFFFFF',
 'RFFFFFFFLF',   # << this is where the output differs
 'RRFRRFFFFFFFFR',
 '',
 '']   # << the last E is 2 consecutive E with no other letters, hence should be empty

Sounds like you want to split at sequence from first character appearance until the last. 听起来您想从第一个字符出现到最后一个字符按顺序进行拆分。

([PDQE])(?:.*\1)?

Have a try with split pattern at regex101 and a PHP Demo at 3v4l.org (should be similar in Python). 在regex101上尝试使用拆分模式,在3v4l.org上尝试使用 PHP演示 (在Python中应该类似)。

import re

s = "FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE"

def get_sequences(s):
    seen_delimiters = {c: ('', None) for c in 'PDQE'}
    order = 0
    for g in re.finditer(r'(.*?)([PDQE]|\Z)', s):
        if g[2]:
            if seen_delimiters[g[2][0]][1] == None:
                seen_delimiters[g[2][0]] = (g[1], order)
                order += 1
    return seen_delimiters

for k, (seq, order) in get_sequences(s).items():
    print('{}: order: {} seq: {}'.format(k, order, seq))

Prints: 打印:

P: order: 0 seq: FFFFRRFFFFFFF
D: order: 1 seq: RFFFFFFFLF
Q: order: 2 seq: RRFRRFFFFFFFFR
E: order: 3 seq: 

Update (for print sequences and delimiters enclosing): 更新(用于打印顺序和定界符):

import re
s = "FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE"
for g in re.finditer(r'(.*?)([PDQE]+|\Z)', s):
    print(g[1], g[2])

Prints: 打印:

FFFFRRFFFFFFF PP
RRRRRRLLRLLRLLL PP
F PP
L PP
L PP
LF PP
FF P
FLR P
FFRRLLR P
F P
RFFFFFFFLF D
RRFRRFFFFFFFFR QEE

Use re.split with a character class [PQDE] : re.split与字符类[PQDE]

import re

s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'    
sequences = re.split(r'[PQDE]', s)
print(sequences)

Output: 输出:

['FFFFRRFFFFFFF', '', 'RRRRRRLLRLLRLLL', '', 'F', '', 'L', '', 'L', '', 'LF', '', 'FF', 'FLR', 'FFRRLLR', 'F', 'RFFFFFFFLF', 'RRFRRFFFFFFFFR', '', '', '']

If you want to split on 1 or more delimiter: 如果要分割1个或多个定界符:

import re

s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'    
sequences = re.split(r'[PQDE]+', s)
print(sequences)

Output: 输出:

['FFFFRRFFFFFFF', 'RRRRRRLLRLLRLLL', 'F', 'L', 'L', 'LF', 'FF', 'FLR', 'FFRRLLR', 'F', 'RFFFFFFFLF', 'RRFRRFFFFFFFFR', '']

If you want to capture the delimiters: 如果要捕获定界符:

import re

s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'    
sequences = re.split(r'([PQDE])', s)
print(sequences)

Output: 输出:

['FFFFRRFFFFFFF', 'P', '', 'P', 'RRRRRRLLRLLRLLL', 'P', '', 'P', 'F', 'P', '', 'P', 'L', 'P', '', 'P', 'L', 'P', '', 'P', 'LF', 'P', '', 'P', 'FF', 'P', 'FLR', 'P', 'FFRRLLR', 'P', 'F', 'P', 'RFFFFFFFLF', 'D', 'RRFRRFFFFFFFFR', 'Q', '', 'E', '', 'E', '']

This solution is iterating the delimiters one by one, so you can control the order you want to apply each one of them: 此解决方案逐个迭代定界符,因此您可以控制要应用每个定界符的顺序:

s = 'FFFFRRFFFFFFFPPRRRRRRLLRLLRLLLPPFPPLPPLPPLFPPFFPFLRPFFRRLLRPFPRFFFFFFFLFDRRFRRFFFFFFFFRQEE'
spliters='PDQE'
for sp in spliters:
    if type(s) is str:
        s = s.split(sp)
    else: #type is list
        s=[x.split(sp) for x in s]
        s = [item for sublist in s for item in sublist if item != ''] #flatten the list

output: 输出:

['FFFFRRFFFFFFF',
 'RRRRRRLLRLLRLLL',
 'F',
 'L',
 'L',
 'LF',
 'FF',
 'FLR',
 'FFRRLLR',
 'F',
 'RFFFFFFFLF',
 'RRFRRFFFFFFFFR']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM