使用多个定界符分割Python字符串

Question

I have a somewhat complex filename following the pattern s[num][alpha1][alpha2].ext that I'm trying to tokenize. 我要尝试标记化的模式s[num][alpha1][alpha2].ext ，文件名有点复杂。 The lexicons from which alpha1 and alpha2 are drawn are contained in two lists. 从中绘制alpha1和alpha2的词典包含在两个列表中。

I found the question at https://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters useful, but it didn't solve my problem. 我在https://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters上发现了这个问题，但没有解决我的问题。

Between [num] and [alpha1] , a number precedes a letter (a fairly easy regex), but between [alpha1] and [alpha2] , I'm splitting between two words. 在[num]和[alpha1] ，数字在字母（相当容易的正则表达式）之前，但是在[alpha1]和[alpha2]之间，我在两个单词之间分割。

Given the filename s13LoremIpsum.ext , for instance, I'd want ("s", "13", "Lorem", "Ipsum") . 例如，给定文件名s13LoremIpsum.ext ，我想要("s", "13", "Lorem", "Ipsum") 。

What would be the best way to accomplish this? 做到这一点的最佳方法是什么？

Note that in this particular case, [alpha2] is a single letter, but I'm interested in solutions for both this case and the general case where [alpha1] and [alpha2] are words of arbitrary length. 请注意，在这种特殊情况下， [alpha2]是一个字母，但我对这种情况以及[alpha1]和[alpha2]是任意长度的单词的一般情况下的解决方案都感兴趣。 Note also that the general case could introduce ambiguity if there is more than one possible splitting by combining words from the respective lexicons, eg 还应注意，如果通过组合来自各个词典的单词进行多个拆分，则一般情况可能会引入歧义，例如

alpha1 = ["a", "ab"]
alpha2 = ["bc", "c"]
# How will we split?
splitString == ("a", "bc")
# --OR--
splitString == ("ab", "c")

Solving this ambiguity is a secondary concern, however. 然而，解决这种歧义是次要问题。

Answer 1

alpha1, alpha2 = ["a", "ab", "Lorem"], ["bc", "c", "Ipsum"]
import re
pattern = re.compile("(s)(\\d+)("+"|".join(alpha1) + ")(" + "|".join(alpha2)+")")
data = "s13LoremIpsum.ext"
result = [pattern.match(data).group(i) for i in range(1, 5)]
print result

Output 输出量

['s', '13', 'Lorem', 'Ipsum']

The actual compiled pattern can be checked like this 可以像这样检查实际的编译模式

print pattern.pattern

which prints 哪个打印

(s)(\d+)(a|ab|Lorem)(bc|c|Ipsum)

使用多个定界符分割Python字符串

问题描述

1 个解决方案

解决方案1
3 2014-01-14 17:53:15

使用多个定界符分割Python字符串

问题描述

1 个解决方案

解决方案1 3 2014-01-14 17:53:15

解决方案1
3 2014-01-14 17:53:15