简体   繁体   English

正则表达式在逗号“,”上拆分字符串,但前提是逗号不在数字之间

[英]Regex to split string on comma “,”, but only if comma is not in between digits

How could I split this given string into separate words -我怎么能把这个给定的字符串拆分成单独的词 -

Given string s = "Consumer notes, State Consumer Forum, Rs.50,000 penatly against ICICI,Andhra Pradesh"给定字符串 s = "Consumer notes, State Consumer Forum, Rs.50,000 peratly against ICICI,Andhra Pradesh"

I want the result to be = ["Consumer notes", "State Consumer Forum", "Rs.50,000 penatly against ICICI", "Andhra Pradesh"]我希望结果是 = ["Consumer notes", "State Consumer Forum", "Rs.50,000 peratly against ICICI", "Andhra Pradesh"]

I am a newbie in regex and am not able to write regex for this.我是正则表达式的新手,无法为此编写正则表达式。

Currently I am doing this目前我正在这样做

s = "Consumer notes, State Consumer Forum, Rs.50,000 penatly against ICICI,Andhra Pradesh"
result = set(w for w in s.split(r','))
print result

result:- 
set(['Andhra Pradesh', ' Rs.50', 'Consumer notes', '000 penatly against ICICI', ' State Consumer Forum'])

This gives me 5 words as it also splits the number Rs 50,000 into 2 parts.这给了我 5 个词,因为它还将 50,000 卢比的数字分成两部分。 And I do not want this split.我不想要这种分裂。 How can I solve it?我该如何解决?

In [1]: s = "Consumer notes, State Consumer Forum, Rs.50,000 penatly against ICICI,Andhra Pradesh"

In [2]: import re

In [3]: re.split(r'(?<!\d),(?!\d)',s)
Out[3]: 
['Consumer notes',
 ' State Consumer Forum',
 ' Rs.50,000 penatly against ICICI',
 'Andhra Pradesh']

you can use re.split(r'(?<!\\d),\\s*(?!\\d)',s) to remove the spaces after , too.你可以使用re.split(r'(?<!\\d),\\s*(?!\\d)',s)后把空格去掉,太。

You can use either你可以使用

(?<!\d),|,(?!\d)

Or或者

,(?!(?<=\d.)\d)

See the regex #1 demo and regex #2 demo .请参阅正则表达式 #1演示和正则表达式 #2 演示

Details细节

  • (?<!\\d), - a comma not immediately preceded with a digit (?<!\\d), - 逗号前面没有紧跟数字
  • | - or - 或者
  • ,(?!\\d) - a comma not immediately followed with a digit ,(?!\\d) - 逗号后不紧跟数字

This pattern is not that efficient because of 1) alternation and 2) lookbehind used at the start of the pattern making the regex engine check each position in the string.这种模式效率不高,因为 1) 交替和 2) 在模式开始时使用的后视使正则表达式引擎检查字符串中的每个位置。

  • , - a comma that is... , - 一个逗号是...
  • (?!(?<=\\d.)\\d) - not immediately followed with a digit (see (?!...\\d) ) that is immediately preceded with a digit and any one char (it is a comma in fact, so . and , here would work the same). (?!(?<=\\d.)\\d) - 不紧跟一个数字(见(?!...\\d) ),它紧跟一个数字和任何一个字符(它是一个逗号事实上,所以.,这里的工作方式相同)。

The second pattern is much more efficient as the regex engine only needs to test the commas in the text.第二种模式效率更高,因为正则表达式引擎只需要测试文本中的逗号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM