[英]python regex to split on certain patterns with skip patterns
I want to split a Python string on certain patterns but not others. 我想在某些模式上拆分Python字符串,而不是其他模式。 For example, I have the string 例如,我有字符串
Joe, Dave, Professional, Ph.D. and Someone else
I want to split on \\sand\\s
and ,
, but not , Ph.D.
我想拆就\\sand\\s
和,
但不, Ph.D.
How can this be accomplished in Python regex? 如何在Python正则表达式中完成?
You can use: 您可以使用:
re.split(r'\s+and\s+|,(?!\s*Ph\.D\.)\s*', 'Joe, Dave, Professional, Ph.D. and Someone else')
Result: 结果:
['Joe', 'Dave', 'Professional, Ph.D.', 'Someone else']
You can basically do this with either regular expressions or just regular string manipulation operations, (ie str.split()
) 您基本上可以使用正则表达式或常规字符串操作操作(即str.split()
)
Here's an example that shows you how to do this using just regular string manipulation operations: 这是一个示例,向您展示如何使用常规字符串操作操作:
>>> DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'
>>> IGNORE_THESE = frozenset([',', 'and'])
>>> PRUNED_DATA = [d.strip(',') for d in DATA.split(' ') if d not in IGNORE_THESE]
>>>> print PRUNED_DATA
['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone', 'else']
I'm sure there is going to be some complex regular expression you can use, but this seems very straight forward for me and quite maintainable. 我敢肯定会有一些复杂的正则表达式你可以使用,但这对我来说似乎很直接,而且非常可维护。
I hope you're not trying to parse natural language, for that, I would use some other library like NLTK 我希望你不是要解析自然语言,为此,我会使用像NLTK这样的其他库
import re
DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'
regx = re.compile('\s*(?:,|and)\s*')
print regx.split(DATA)
result 结果
['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone else']
Where's the difficulty ? 难度在哪里?
Note that with (?:,|and)
the delimiters don't appear in the result, while with (;|and)
the result would be 请注意,使用(?:,|and)
分隔符不会出现在结果中,而使用(;|and)
,结果将是
['Joe', ',', 'Dave', ',', 'Professional', ',', 'Ph.D.', 'and', 'Someone else']
errrr.... the difficulty is that with 错误......困难在于
DATA = 'Joe, Dave, Professional, Handicaped, Ph.D. and Someone else'
the result is 结果是
['Joe', 'Dave', 'Professional', 'H', 'icaped', 'Ph.D.', 'Someone else']
. 。
Corrected: 更正:
regx = re.compile('\s+and\s+|\s*,\s*')
. 。
errrr.. ahem... ahem... 呃......咳咳......唉...
Sorry, I hadn't noticed that Professional, Ph.D. 对不起,我没有注意到那个专业,博士。 must not be splitted. 不得拆分。 But what is the criterion to not split according the comma in this string ? 但是不按照此字符串中的逗号分割的标准是什么?
I choosed this criterion: "a comma not followed by a string having dots before the next comma" 我选择了这个标准:“在下一个逗号之前,逗号后面没有带点的字符串”
Another problem is if whitespaces and word 'and' are mixed. 另一个问题是如果空格和单词'和'混合在一起。
And there's also the problem of heading and trailing whitespaces. 还有标题和尾随空格的问题。
Finally I succeeded to write a regex' pattern that manages much more cases than the previous one, even if some cases are somewhat artificial (like lost 'and' present at the end of a string; why not at the beginning too, also ? etc) : 最后,我成功地编写了一个正则表达式模式,管理比前一个案例更多的案例,即使某些案例有点人为(比如丢失'和'出现在字符串的末尾;为什么不在开头呢,也是?等等):
import re
regx = re.compile('\s*,(?!(?:[^,.]+?\.)+[^,.]*?,)(?:\sand[,\s]+|\s)*|\s*and[,\s]+|[.\s]*\Z|\A\s*')
DATA = ' Joe ,and, Dave , Professional, Ph.D., Handicapped and handyman , and Someone else and . .'
print repr(DATA)
print
print regx.split(DATA)
result 结果
' Joe ,and, Dave , Professional, Ph.D., Handicapped and handyman , and Someone else and . .'
['', 'Joe', '', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else', '', '']
. 。
With print [x for x in regx.split(DATA) if x]
, we obtain: 如果使用print [x for x in regx.split(DATA) if x]
,我们得到:
['Joe', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else']
. 。
Compared to result of Qtax's regex on the same string: 与Qtax在同一字符串上的正则表达式的结果相比:
[' Joe ', 'and', 'Dave ', 'Professional, Ph.D.', 'Handicapped', 'handyman ', 'and Someone else', '. .']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.