简体   繁体   English

python正则表达式用跳过模式拆分某些模式

[英]python regex to split on certain patterns with skip patterns

I want to split a Python string on certain patterns but not others. 我想在某些模式上拆分Python字符串,而不是其他模式。 For example, I have the string 例如,我有字符串

Joe, Dave, Professional, Ph.D. and Someone else

I want to split on \\sand\\s and , , but not , Ph.D. 我想拆就\\sand\\s,但不, Ph.D.

How can this be accomplished in Python regex? 如何在Python正则表达式中完成?

You can use: 您可以使用:

re.split(r'\s+and\s+|,(?!\s*Ph\.D\.)\s*', 'Joe, Dave, Professional, Ph.D. and Someone else')

Result: 结果:

['Joe', 'Dave', 'Professional, Ph.D.', 'Someone else']

You can basically do this with either regular expressions or just regular string manipulation operations, (ie str.split() ) 您基本上可以使用正则表达式或常规字符串操作操作(即str.split()

Here's an example that shows you how to do this using just regular string manipulation operations: 这是一个示例,向您展示如何使用常规字符串操作操作:

>>> DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'
>>> IGNORE_THESE = frozenset([',', 'and'])

>>> PRUNED_DATA = [d.strip(',') for d in DATA.split(' ') if d not in IGNORE_THESE]
>>>> print PRUNED_DATA
['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone', 'else']

I'm sure there is going to be some complex regular expression you can use, but this seems very straight forward for me and quite maintainable. 我敢肯定会有一些复杂的正则表达式你可以使用,但这对我来说似乎很直接,而且非常可维护。

I hope you're not trying to parse natural language, for that, I would use some other library like NLTK 我希望你不是要解析自然语言,为此,我会使用像NLTK这样的其他库

import re

DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'

regx = re.compile('\s*(?:,|and)\s*')

print regx.split(DATA)

result 结果

['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone else']

Where's the difficulty ? 难度在哪里?

Note that with (?:,|and) the delimiters don't appear in the result, while with (;|and) the result would be 请注意,使用(?:,|and)分隔符不会出现在结果中,而使用(;|and) ,结果将是

['Joe', ',', 'Dave', ',', 'Professional', ',', 'Ph.D.', 'and', 'Someone else'] 

Edit 1 编辑1

errrr.... the difficulty is that with 错误......困难在于

DATA = 'Joe, Dave, Professional, Handicaped, Ph.D. and Someone else'

the result is 结果是

['Joe', 'Dave', 'Professional', 'H', 'icaped', 'Ph.D.', 'Someone else']

.

Corrected: 更正:

regx = re.compile('\s+and\s+|\s*,\s*')

.

Edit 2 编辑2

errrr.. ahem... ahem... 呃......咳咳......唉...

Sorry, I hadn't noticed that Professional, Ph.D. 对不起,我没有注意到那个专业,博士。 must not be splitted. 不得拆分。 But what is the criterion to not split according the comma in this string ? 但是不按照此字符串中的逗号分割的标准是什么?

I choosed this criterion: "a comma not followed by a string having dots before the next comma" 我选择了这个标准:“在下一个逗号之前,逗号后面没有带点的字符串”

Another problem is if whitespaces and word 'and' are mixed. 另一个问题是如果空格和单词'和'混合在一起。

And there's also the problem of heading and trailing whitespaces. 还有标题和尾随空格的问题。

Finally I succeeded to write a regex' pattern that manages much more cases than the previous one, even if some cases are somewhat artificial (like lost 'and' present at the end of a string; why not at the beginning too, also ? etc) : 最后,我成功地编写了一个正则表达式模式,管理比前一个案例更多的案例,即使某些案例有点人为(比如丢失'和'出现在字符串的末尾;为什么不在开头呢,也是?等等):

import re

regx = re.compile('\s*,(?!(?:[^,.]+?\.)+[^,.]*?,)(?:\sand[,\s]+|\s)*|\s*and[,\s]+|[.\s]*\Z|\A\s*')

DATA = '   Joe ,and, Dave , Professional, Ph.D., Handicapped   and handyman  , and   Someone else  and  . .'
print repr(DATA)
print
print regx.split(DATA)

result 结果

'   Joe ,and, Dave , Professional, Ph.D., Handicapped   and handyman  , and   Someone else  and  . .'

['', 'Joe', '', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else', '', '']

.

With print [x for x in regx.split(DATA) if x] , we obtain: 如果使用print [x for x in regx.split(DATA) if x] ,我们得到:

['Joe', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else']

.

Compared to result of Qtax's regex on the same string: 与Qtax在同一字符串上的正则表达式的结果相比:

['   Joe ', 'and', 'Dave ', 'Professional, Ph.D.', 'Handicapped', 'handyman  ', 'and   Someone else', '. .']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM