[英]Python regex, conditional searching
I am trying to split this sentence 我正在尝试拆分这句话
"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot " \
"for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this " \
"isn't true... Well, with a probability of .9 it isn't."
Into list of below. 进入下面的列表。
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
Code: 码:
print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )[^a-z]',text)
Output: 输出:
['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
a lot for it. ', "Adam Jones Jr. thinks he didn't. "]
K gud, but it missed some, is there a way to tell Python since last [^az] isn't part of my group, pls continue searching from there. K gud,但是错过了一些,有没有办法告诉Python,因为last [^ az]不属于我的小组,请继续从那里搜索。
EDIT: 编辑:
This was achieved through forward look ahead regex as mentioned by @sputnick. 这是通过@sputnick提到的前瞻性正则表达式实现的。
print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)
Output: 输出:
['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... "]
But we still need the last sentence. 但是我们仍然需要最后一句话。 Any ideas? 有任何想法吗?
Try this : 尝试这个 :
print re.findall('([A-Z]+[^.].*?[a-z.][.?!] )(?=[^a-z])',text)
using positive look-ahead regex technique, check http://www.regular-expressions.info/lookaround.html 使用积极的前瞻正则表达式技术,请检查http://www.regular-expressions.info/lookaround.html
(.+?)(?<=(?<![A-Z][a-z])(?<![a-z]\.[a-z])(?:\.|\?)(?=\s|$))
Try this.See demo.Grab the capture or groups.Use re.findall
. 试试这个,看演示,获取捕获或分组,使用re.findall
。
https://regex101.com/r/gQ3kS4/45 https://regex101.com/r/gQ3kS4/45
Finally 最后
print re.findall('[A-Z]+[^.].*?[a-z.][.?!] (?=[^a-z])|.*.$',text)
Above works perfect as needed. 以上作品按需完美。 Includes the last sentence. 包括最后一句话。 But I have no idea why |.*.$
worked pls help me understand. 但是我不知道为什么|.*.$
帮助我理解。
Output: 输出:
['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid
a lot for it. ', 'Did he mind? ', "Adam Jones Jr. thinks he didn't. "
, "In any case, this isn't true... ", "Well, with a probability of .9
it isn't."]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.