简体   繁体   English

Python正则表达式拆分但将正则表达式匹配的结尾部分放回字符串中?

[英]Python regex split but put end part of regex match back into string?

I'd like to find a regex expression that can break up paragraphs (long strings, no newline characters to worry about) into sentences with the simple rule that an of {., ?, !} followed by a whitespace and then a capital letter should be the end of the sentence (I realize this is not a good rule for real life). 我想找到一个正则表达式,可以将段落(长字符串,无需换行符)分解成句子,其简单规则是{。,?,!}中的a,后跟空白,然后是大写字母应该是句子的结尾(我意识到这不是现实生活中的好规则)。

I've got something partly working, but it doesn't quite do the job: 我有部分工作,但并不能完全完成工作:

line = 'a b c FFF! D a b a a FFF. gegtat FFF. A'
matchObj = re.split(r'(.*?\sFFF[\.|\?|\!])\s[A-Z]', line)
print (matchObj)

prints 版画

['', 'a b c FFF!', '', ' a b a a FFF. gegtat FFF.', '']

whereas I'd like to get: 而我想得到:

['a b c FFF!', 'D a b a a FFF. gegtat FFF.']

So two questions. 有两个问题。

  • Why are there empty members ( '' ) in the results? 为何会出现空成员( '' )的结果吗?

  • I understand why the D gets cut out from the split result - it's part of the first search. 我知道D为何会从拆分结果中删除-这是第一次搜索的一部分。 How can I structure my search differently so that the capital letter coming after the punctuation is put back so it can be included with the next sentence? 如何以不同的方式组织搜索,以便将标点符号后的大写字母放回原处,以便可以将其包含在下一个句子中? In this case, how can I get D to turn up in the second element of the split result? 在这种情况下,如何使D在分割结果的第二个元素中出现?

I know I could accomplish this with some sort of for-loop just peeling off the first result, adding back the capital letter and then doing it all over again, but this seems not-so-Pythonic. 我知道我可以通过某种for循环来完成此任务,只需剥离第一个结果,加回大写字母,然后再重新进行一次,但这似乎不太像Pythonic。 If regex is not the way to go here, is there something that still avoids the for loop? 如果正则表达式不是解决问题的办法,那么还有什么可以避免for循环的吗?

Thanks for any suggestions. 感谢您的任何建议。

  1. To solve the first problem (empty strings in the result returned by split() ), use either findall() or finditer() : 要解决第一个问题split()返回的结果中的空字符串),请使用findall()finditer()

     >>> re.findall(r'(.*?\\sFFF[\\.|\\?|\\!])\\s[AZ]', line) ['abc FFF!', ' abaa FFF. gegtat FFF.'] 

    You were seeing empty strings in the output because that's what split() is supposed to do: split the input string using the matched groups as delimiters. 您在输出中看到空字符串,因为这是split()应该做的事情:使用匹配的组作为定界符分割输入字符串。

  2. For the second problem (the missing D from the output), use a lookahead assertion (?=...) : 对于第二个问题 (输出中缺少D ),请使用先行断言 (?=...)

     >>> re.findall(r'(.*?\\sFFF[\\.|\\?|\\!])\\s(?=[AZ])', line) ['abc FFF!', 'D abaa FFF. gegtat FFF.'] 

    Lookaheads, negative lookaheads, lookbehinds and negative lookbehinds are four kinds of assertions that you can use to say "match this group only if followed/preceded by group, but don't consume the string". 前瞻,否定前瞻,后向和否定后备是四种断言,您可以用来说“仅在组后面/前面有此组,但不使用该字符串才匹配该组”。

  3. Reading carefully your expression, it seems you have misunderstood the syntax of the [...] operator. 仔细阅读表达式,似乎您误解了[...]运算符的语法。 It seems you want to match one of . 看来您想搭配之一. , ? ? and ! ! .

    If that is the case, then you can rewrite [\\.|\\?|\\!] as [.?!] : 如果是这种情况,则可以[\\.|\\?|\\!]重写为[.?!]

     >>> re.findall(r'(.*?\\sFFF[.?!])\\s(?=[AZ])', line) ['abc FFF!', 'D abaa FFF. gegtat FFF.'] 

    [.?!] is not only more compact, but is also more correct: with [\\.|\\?|\\!] you were also matching the | [.?!]不仅更紧凑,而且更正确:使用[\\.|\\?|\\!]您还可以匹配| character (so that 'abc FFF|' were a valid match)! 字符(这样'abc FFF|'是有效的匹配项)!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM