简体   繁体   English

Python使用正则表达式将具有多个换行符的段落匹配

[英]Python match a paragraph with multiple line-breaks using regex

I try to Match Paragraphs using Python and Re. 我尝试使用Python和Re来匹配段落。

An example of a text: 文本示例:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. Lorem ipsum dolor坐着,consetetur sadipscing elitr,sed diam nonumy eirmod tempor invitunt ut Labore et dolore magna aliquyam erat,sed diam voluptua。 At vero eos et accusam et justo duo dolores et ea rebum. 在Vero eos etAccusam和Justo duo dolores et ea rebum。

two or more line breaks here 这里有两个或多个换行符

Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Stet clita kasd gubergren,没有大海,也没有坐在这里。

two or more line breaks here 这里有两个或多个换行符

Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. Ipsum dolor就座,somesturiping抬头,sed diam nonumy eirmod tempor invitunt ut labour和dolore magna aliquyam erat,sed diam voluptua。 At vero eos et accusam et justo duo dolores et ea rebum. 在Vero eos etAccusam和Justo duo dolores et ea rebum。 Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Stet clita kasd gubergren,没有大海,也没有坐在这里。

This Expression seems to almost do the job: 这个表达式似乎几乎可以完成工作:

paragraphs = re.findall(r'(?s)((?:[^\n][\n]?)+)', textContent)

But I want to make sure to only match if there are two or more line-breaks. 但是我想确保只有在两个或多个换行符时才匹配。 Currently it matches too often. 目前,它匹配得太频繁了。

Edit: 编辑:

ART. WEFWEFEW
  1 SDVSDRG: **<at the momemnt it breaks here, but it shouldnt>**
     a. wevvdfvdfd
     b. sdfsdfsdfsdfsdfsdghtrhrth

Edit2: 编辑2:

ART. WEFWEFEW
   1 SDVSDRG: 
      **here are two line-breaks, but dont split this paragraph**
      **at the momemnt it breaks here, but it shouldnt**
     a. wevvdfvdfd
     b. sdfsdfsdfsdfsdfsdghtrhrth

Check out this regex (?m)(?:.+(?:\\n.)?)+ on RegEx101 , where you can also get an explanation of it. RegEx101上查看此正则表达式(?m)(?:.+(?:\\n.)?)+ ,您还可以在其中获得解释。

Sample Python code that uses this regex: 使用此正则表达式的示例Python代码:

import re
import pprint

textContent = '''Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et
accusam et justo duo dolores et ea rebum.

Stet clita kasd gubergren, no sea takimata sanctus est Lorem
ipsum dolor sit amet.


Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna
aliquyam erat, sed diam voluptua. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren, no
sea takimata sanctus est Lorem ipsum dolor sit amet.



ART. WEFWEFEW
  1 SDVSDRG:
     a. wevvdfvdfd
     b. sdfsdfsdfsdfsdfsdghtrhrth'''

pprint.pprint(re.findall(r'(?m)(?:.+(?:\n.)?)+', textContent))

Output: 输出:

['Lorem ipsum dolor sit amet, consetetur sadipscing elitr,\n'
 'sed diam nonumy eirmod tempor invidunt ut labore et dolore\n'
 'magna aliquyam erat, sed diam voluptua. At vero eos et\n'
 'accusam et justo duo dolores et ea rebum.',
 'Stet clita kasd gubergren, no sea takimata sanctus est Lorem\n'
 'ipsum dolor sit amet.',
 'Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam\n'
 'nonumy eirmod tempor invidunt ut labore et dolore magna\n'
 'aliquyam erat, sed diam voluptua. At vero eos et accusam et\n'
 'justo duo dolores et ea rebum. Stet clita kasd gubergren, no\n'
 'sea takimata sanctus est Lorem ipsum dolor sit amet.',
 'ART. WEFWEFEW\n'
 '  1 SDVSDRG:\n'
 '     a. wevvdfvdfd\n'
 '     b. sdfsdfsdfsdfsdfsdghtrhrth']

Demo on Rextester . Rextester上的演示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM