简体   繁体   English

使用多个定界符分割Python字符串

[英]Split a Python String Using Multiple Delimiters

I have a somewhat complex filename following the pattern s[num][alpha1][alpha2].ext that I'm trying to tokenize. 我要尝试标记化的模式s[num][alpha1][alpha2].ext ,文件名有点复杂。 The lexicons from which alpha1 and alpha2 are drawn are contained in two lists. 从中绘制alpha1和alpha2的词典包含在两个列表中。

I found the question at https://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters useful, but it didn't solve my problem. 我在https://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters上发现了这个问题,但没有解决我的问题。

Between [num] and [alpha1] , a number precedes a letter (a fairly easy regex), but between [alpha1] and [alpha2] , I'm splitting between two words. [num][alpha1] ,数字在字母(相当容易的正则表达式)之前,但是在[alpha1][alpha2]之间,我在两个单词之间分割。

Given the filename s13LoremIpsum.ext , for instance, I'd want ("s", "13", "Lorem", "Ipsum") . 例如,给定文件名s13LoremIpsum.ext ,我想要("s", "13", "Lorem", "Ipsum")

What would be the best way to accomplish this? 做到这一点的最佳方法是什么?

Note that in this particular case, [alpha2] is a single letter, but I'm interested in solutions for both this case and the general case where [alpha1] and [alpha2] are words of arbitrary length. 请注意,在这种特殊情况下, [alpha2]是一个字母,但我对这种情况以及[alpha1][alpha2]是任意长度的单词的一般情况下的解决方案感兴趣。 Note also that the general case could introduce ambiguity if there is more than one possible splitting by combining words from the respective lexicons, eg 还应注意,如果通过组合来自各个词典的单词进行多个拆分,则一般情况可能会引入歧义,例如

alpha1 = ["a", "ab"]
alpha2 = ["bc", "c"]
# How will we split?
splitString == ("a", "bc")
# --OR--
splitString == ("ab", "c")

Solving this ambiguity is a secondary concern, however. 然而,解决这种歧义是次要问题。

alpha1, alpha2 = ["a", "ab", "Lorem"], ["bc", "c", "Ipsum"]
import re
pattern = re.compile("(s)(\\d+)("+"|".join(alpha1) + ")(" + "|".join(alpha2)+")")
data = "s13LoremIpsum.ext"
result = [pattern.match(data).group(i) for i in range(1, 5)]
print result

Output 输出量

['s', '13', 'Lorem', 'Ipsum']

The actual compiled pattern can be checked like this 可以像这样检查实际的编译模式

print pattern.pattern

which prints 哪个打印

(s)(\d+)(a|ab|Lorem)(bc|c|Ipsum)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM