简体   繁体   English

Python 按照层次结构由多个分隔符拆分字符串

[英]Python split string by multiple delimiters following a hierarchy

I want to split strings only once based on multiple delimiters like "and", "&" and "-" in that order.我只想根据多个分隔符(如“and”、“&”和“-”)将字符串拆分一次。 Example:例子:

'121 34 adsfd' -> ['121 34 adsfd']
'dsfsd and adfd' -> ['dsfsd ', ' adfd']
'dsfsd & adfd' -> ['dsfsd ', ' adfd']
'dsfsd - adfd' -> ['dsfsd ', ' adfd']
'dsfsd and adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
'dsfsd and adfd - adsfa' -> ['dsfsd ', ' adfd - adsfa']
'dsfsd - adfd and adsfa' -> ['dsfsd - adfd ', ' adsfa']

I tried the below code to achieve this:我尝试了下面的代码来实现这一点:

import re
re.split('and|&|-', string, maxsplit=1)

It works for all cases except the last one.它适用于除最后一种情况之外的所有情况。 As it does not follow the hierarchy, for the last one it returns:由于它不遵循层次结构,因此对于它返回的最后一个:

'dsfsd - adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']

How can I achieve this?我怎样才能做到这一点?

This would be impractical with a single regular expression.这对于单个正则表达式是不切实际的。 You could get it to work with negative lookbehinds, but it would get quite complicated with each additional delimiter.你可以让它与负面的lookbehinds一起工作,但每增加一个分隔符就会变得相当复杂。 It's pretty trivial to do this with plain old str.split() and multiple lines.用普通的旧str.split()和多行来做到这一点非常简单。 All you have to do is check if splitting with the current delimiter gives you two elements.您所要做的就是检查使用当前分隔符拆分是否会给您两个元素。 If it does, that's your answer.如果是这样,这就是你的答案。 If not, move on to the next delimiter:如果不是,请转到下一个分隔符:

def split_new(inp, delims):
    for d in delims:
        result = inp.split(d, maxsplit=1)
        if len(result) == 2: return result

    return [inp] # If nothing worked, return the input

To test this:要对此进行测试:

teststrs = ['121 34 adsfd' , 'dsfsd and adfd', 'dsfsd & adfd' , 'dsfsd - adfd' , 'dsfsd and adfd and adsfa' , 'dsfsd and adfd - adsfa' , 'dsfsd - adfd and adsfa' ]
for t in teststrs:
    print(repr(t), '->', split_new(t, ['and', '&', '-']))

outputs输出

'121 34 adsfd' -> ['121 34 adsfd']
'dsfsd and adfd' -> ['dsfsd ', ' adfd']
'dsfsd & adfd' -> ['dsfsd ', ' adfd']
'dsfsd - adfd' -> ['dsfsd ', ' adfd']
'dsfsd and adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
'dsfsd and adfd - adsfa' -> ['dsfsd ', ' adfd - adsfa']
'dsfsd - adfd and adsfa' -> ['dsfsd - adfd ', ' adsfa']

Try:尝试:

import re

tests = [
    ["121 34 adsfd", ["121 34 adsfd"]],
    ["dsfsd and adfd", ["dsfsd ", " adfd"]],
    ["dsfsd & adfd", ["dsfsd ", " adfd"]],
    ["dsfsd - adfd", ["dsfsd ", " adfd"]],
    ["dsfsd and adfd and adsfa", ["dsfsd ", " adfd and adsfa"]],
    ["dsfsd and adfd - adsfa", ["dsfsd ", " adfd - adsfa"]],
    ["dsfsd - adfd and adsfa", ["dsfsd - adfd ", " adsfa"]],
]

for s, result in tests:
    res = re.split(r"and|&(?!.*and)|-(?!.*and|.*&)", s, maxsplit=1)
    print(res)
    assert res == result

Prints:印刷:

['121 34 adsfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd and adsfa']
['dsfsd ', ' adfd - adsfa']
['dsfsd - adfd ', ' adsfa']

Explanation:解释:

The regex and|&(?.?*and)|-(...*and|.*&) uses 3 alternatives.正则表达式and|&(?.?*and)|-(...*and|.*&)使用 3 种替代方法。

  1. We match and always or:我们匹配and总是或者:
  2. We match & only if there isn't and ahead (using the negative look-ahead (?! ) or:我们匹配&仅当没有and提前(使用否定前瞻(?! )或:
  3. We match - only if there isn't and or & ahead.我们匹配-仅当前面没有and&时。

We're using this pattern in re.sub -> splitting only on the first match.我们在re.sub -> 仅在第一次匹配时使用此模式。

You can keep a list of the delimiters, ordered by their value.您可以保留按值排序的分隔符列表。 Then, you can combine re.split with re.findall to only use delimiters produced from the latter that are the least valuable in the split, per the ranking in ops :然后,您可以将re.splitre.findall结合使用,根据ops中的排名,仅使用后者生成的在拆分中价值最低的分隔符:

import re
def split_order(s):
   r, ops = re.findall('(?<=\s)and(?=\s)|\&|\-', s), ['and', '&', '-']
   m = -1 if not r else min([ops.index(i) for i in r])
   a, *b = re.split('|'.join(l:=[i for i in r if ops.index(i) == m]), s)
   return [s] if not l else ([a] if not b else [a, s[len(a)+len(l[0]):]])


vals = ['121 34 adsfd' , 'dsfsd and adfd', 'dsfsd & adfd' , 'dsfsd - adfd' , 'dsfsd and adfd and adsfa' , 'dsfsd and adfd - adsfa' , 'dsfsd - adfd and adsfa' ]
for i in vals:
   print(split_order(i))

Output: Output:

['121 34 adsfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd and adsfa']
['dsfsd ', ' adfd - adsfa']
['dsfsd - adfd ', ' adsfa']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM