[英]Python match a paragraph with multiple line-breaks using regex
I try to Match Paragraphs using Python and Re. 我尝试使用Python和Re来匹配段落。
An example of a text: 文本示例:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
Lorem ipsum dolor坐着,consetetur sadipscing elitr,sed diam nonumy eirmod tempor invitunt ut Labore et dolore magna aliquyam erat,sed diam voluptua。 At vero eos et accusam et justo duo dolores et ea rebum.
在Vero eos etAccusam和Justo duo dolores et ea rebum。
two or more line breaks here
这里有两个或多个换行符
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
Stet clita kasd gubergren,没有大海,也没有坐在这里。
two or more line breaks here
这里有两个或多个换行符
Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
Ipsum dolor就座,somesturiping抬头,sed diam nonumy eirmod tempor invitunt ut labour和dolore magna aliquyam erat,sed diam voluptua。 At vero eos et accusam et justo duo dolores et ea rebum.
在Vero eos etAccusam和Justo duo dolores et ea rebum。 Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
Stet clita kasd gubergren,没有大海,也没有坐在这里。
This Expression seems to almost do the job: 这个表达式似乎几乎可以完成工作:
paragraphs = re.findall(r'(?s)((?:[^\n][\n]?)+)', textContent)
But I want to make sure to only match if there are two or more line-breaks. 但是我想确保只有在两个或多个换行符时才匹配。 Currently it matches too often.
目前,它匹配得太频繁了。
Edit: 编辑:
ART. WEFWEFEW
1 SDVSDRG: **<at the momemnt it breaks here, but it shouldnt>**
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth
Edit2: 编辑2:
ART. WEFWEFEW
1 SDVSDRG:
**here are two line-breaks, but dont split this paragraph**
**at the momemnt it breaks here, but it shouldnt**
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth
Check out this regex (?m)(?:.+(?:\\n.)?)+
on RegEx101 , where you can also get an explanation of it. 在RegEx101上查看此正则表达式
(?m)(?:.+(?:\\n.)?)+
,您还可以在其中获得解释。
Sample Python code that uses this regex: 使用此正则表达式的示例Python代码:
import re
import pprint
textContent = '''Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et
accusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est Lorem
ipsum dolor sit amet.
Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna
aliquyam erat, sed diam voluptua. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren, no
sea takimata sanctus est Lorem ipsum dolor sit amet.
ART. WEFWEFEW
1 SDVSDRG:
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth'''
pprint.pprint(re.findall(r'(?m)(?:.+(?:\n.)?)+', textContent))
Output: 输出:
['Lorem ipsum dolor sit amet, consetetur sadipscing elitr,\n'
'sed diam nonumy eirmod tempor invidunt ut labore et dolore\n'
'magna aliquyam erat, sed diam voluptua. At vero eos et\n'
'accusam et justo duo dolores et ea rebum.',
'Stet clita kasd gubergren, no sea takimata sanctus est Lorem\n'
'ipsum dolor sit amet.',
'Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam\n'
'nonumy eirmod tempor invidunt ut labore et dolore magna\n'
'aliquyam erat, sed diam voluptua. At vero eos et accusam et\n'
'justo duo dolores et ea rebum. Stet clita kasd gubergren, no\n'
'sea takimata sanctus est Lorem ipsum dolor sit amet.',
'ART. WEFWEFEW\n'
' 1 SDVSDRG:\n'
' a. wevvdfvdfd\n'
' b. sdfsdfsdfsdfsdfsdghtrhrth']
Demo on Rextester . 在Rextester上的演示。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.