python正则表达式用跳过模式拆分某些模式

Question

I want to split a Python string on certain patterns but not others. 我想在某些模式上拆分Python字符串，而不是其他模式。 For example, I have the string 例如，我有字符串

Joe, Dave, Professional, Ph.D. and Someone else

I want to split on \\sand\\s and , , but not , Ph.D. 我想拆就\\sand\\s和,但不, Ph.D.

How can this be accomplished in Python regex? 如何在Python正则表达式中完成？

Answer 1

You can use: 您可以使用：

re.split(r'\s+and\s+|,(?!\s*Ph\.D\.)\s*', 'Joe, Dave, Professional, Ph.D. and Someone else')

Result: 结果：

['Joe', 'Dave', 'Professional, Ph.D.', 'Someone else']

Answer 2

You can basically do this with either regular expressions or just regular string manipulation operations, (ie str.split() ) 您基本上可以使用正则表达式或常规字符串操作操作（即str.split() ）

Here's an example that shows you how to do this using just regular string manipulation operations: 这是一个示例，向您展示如何使用常规字符串操作操作：

>>> DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'
>>> IGNORE_THESE = frozenset([',', 'and'])

>>> PRUNED_DATA = [d.strip(',') for d in DATA.split(' ') if d not in IGNORE_THESE]
>>>> print PRUNED_DATA
['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone', 'else']

I'm sure there is going to be some complex regular expression you can use, but this seems very straight forward for me and quite maintainable. 我敢肯定会有一些复杂的正则表达式你可以使用，但这对我来说似乎很直接，而且非常可维护。

I hope you're not trying to parse natural language, for that, I would use some other library like NLTK 我希望你不是要解析自然语言，为此，我会使用像NLTK这样的其他库

Answer 3

import re

DATA = 'Joe, Dave, Professional, Ph.D. and Someone else'

regx = re.compile('\s*(?:,|and)\s*')

print regx.split(DATA)

result 结果

['Joe', 'Dave', 'Professional', 'Ph.D.', 'Someone else']

Where's the difficulty ? 难度在哪里？

Note that with (?:,|and) the delimiters don't appear in the result, while with (;|and) the result would be 请注意，使用(?:,|and)分隔符不会出现在结果中，而使用(;|and) ，结果将是

['Joe', ',', 'Dave', ',', 'Professional', ',', 'Ph.D.', 'and', 'Someone else']

Edit 1 编辑1

errrr.... the difficulty is that with 错误......困难在于

DATA = 'Joe, Dave, Professional, Handicaped, Ph.D. and Someone else'

the result is 结果是

['Joe', 'Dave', 'Professional', 'H', 'icaped', 'Ph.D.', 'Someone else']

. 。

Corrected: 更正：

regx = re.compile('\s+and\s+|\s*,\s*')

. 。

Edit 2 编辑2

errrr.. ahem... ahem... 呃......咳咳......唉...

Sorry, I hadn't noticed that Professional, Ph.D. 对不起，我没有注意到那个专业，博士。 must not be splitted. 不得拆分。 But what is the criterion to not split according the comma in this string ? 但是不按照此字符串中的逗号分割的标准是什么？

I choosed this criterion: "a comma not followed by a string having dots before the next comma" 我选择了这个标准：“在下一个逗号之前，逗号后面没有带点的字符串”

Another problem is if whitespaces and word 'and' are mixed. 另一个问题是如果空格和单词'和'混合在一起。

And there's also the problem of heading and trailing whitespaces. 还有标题和尾随空格的问题。

Finally I succeeded to write a regex' pattern that manages much more cases than the previous one, even if some cases are somewhat artificial (like lost 'and' present at the end of a string; why not at the beginning too, also ? etc) : 最后，我成功地编写了一个正则表达式模式，管理比前一个案例更多的案例，即使某些案例有点人为（比如丢失'和'出现在字符串的末尾;为什么不在开头呢，也是？等等）：

import re

regx = re.compile('\s*,(?!(?:[^,.]+?\.)+[^,.]*?,)(?:\sand[,\s]+|\s)*|\s*and[,\s]+|[.\s]*\Z|\A\s*')

DATA = '   Joe ,and, Dave , Professional, Ph.D., Handicapped   and handyman  , and   Someone else  and  . .'
print repr(DATA)
print
print regx.split(DATA)

result 结果

'   Joe ,and, Dave , Professional, Ph.D., Handicapped   and handyman  , and   Someone else  and  . .'

['', 'Joe', '', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else', '', '']

. 。

With print [x for x in regx.split(DATA) if x] , we obtain: 如果使用print [x for x in regx.split(DATA) if x] ，我们得到：

['Joe', 'Dave', 'Professional, Ph.D.', 'Handicapped', 'handyman', 'Someone else']

. 。

Compared to result of Qtax's regex on the same string: 与Qtax在同一字符串上的正则表达式的结果相比：

['   Joe ', 'and', 'Dave ', 'Professional, Ph.D.', 'Handicapped', 'handyman  ', 'and   Someone else', '. .']

python正则表达式用跳过模式拆分某些模式

问题描述

3 个解决方案

解决方案1
3 已采纳 2011-07-29 06:30:49

解决方案2
0 2011-07-29 05:02:56

解决方案3
0 2011-07-29 06:32:04

Edit 1 编辑1

Edit 2 编辑2

python正则表达式用跳过模式拆分某些模式

问题描述

3 个解决方案

解决方案1 3 已采纳 2011-07-29 06:30:49

解决方案2 0 2011-07-29 05:02:56

解决方案3 0 2011-07-29 06:32:04

Edit 1 编辑1

Edit 2 编辑2

解决方案1
3 已采纳 2011-07-29 06:30:49

解决方案2
0 2011-07-29 05:02:56

解决方案3
0 2011-07-29 06:32:04