[英]Python regex to replace single newlines and ignore sequences of two or more newlines
I'm using python 3.6 through 3.8.我正在使用 python 3.6 到 3.8。
I'm trying to replace any instance of a single newline with a single space in text read from a file.我正在尝试用从文件读取的文本中的单个空格替换单个换行符的任何实例。 My goal is to compress paragraphs into single lines of text for re-wrapping by
textwrap
.我的目标是将段落压缩成单行文本,以便通过
textwrap
重新换行。 Since textwrap
only works on a single paragraph I need an easy way to detect/delineate paragraphs, and compressing them into a single line of text seems the most expedient.由于
textwrap
仅适用于单个段落,我需要一种简单的方法来检测/描绘段落,并将它们压缩成一行文本似乎是最方便的。 In order for this to work, any instance of two or more newlines in sequence define a paragraph boundary and should be left alone.为了让它起作用,任何两个或更多换行符的实例都定义了一个段落边界,应该单独保留。
My first try was with lookahead/lookbehind assertions to insist that any newline I replace not be bounded by other newlines:我的第一次尝试是使用前瞻/后视断言来坚持我替换的任何换行符都不受其他换行符的限制:
re.sub(r'(?<!\n)\n(?!\n)', ' ', input_text)
This works fine is most circumstances.这在大多数情况下都很好用。 However, I quickly ran into a case where someone had a paragraph separator that contained other whitespace.
但是,我很快遇到了一个案例,有人的段落分隔符包含其他空格。
This is some sample text beginning with a short paragraph.\n\nThis second paragraph is long enough to be split across lines, so it contains\na single newline in the middle.\n \nThis third paragraph has an unusual separator before it;
这是一些以一小段开头的示例文本。\n\n第二段足够长,可以分成多行,因此它在中间包含\n一个换行符。\n \n第三段之前有一个不寻常的分隔符; a newline followed by\na space followed by another newline.
一个换行符后跟一个空格,然后是另一个换行符。 It's a special case that needs to be\nhandled.
这是一个需要处理\n的特殊情况。
My lookahead/lookbehind assertion tactic won't work here, because the required lookbehind needs to be of an indeterminate length (maybe the space is there, maybe it isn't) and that's not allowed.我的前瞻/后视断言策略在这里不起作用,因为所需的后视需要具有不确定的长度(可能有空格,也可能没有),这是不允许的。
# this is an error
re.sub(r'(?<!\n\s*)\n(?!\s*\n)', ' ', input_text)
My next try was to do this in two passes, removing any non-newline whitespace between newlines, but I can't find a regex that will do that perfectly.我的下一次尝试是分两次执行此操作,删除换行符之间的任何非换行符空格,但我找不到可以完美执行此操作的正则表达式。 This works, sortof, but will compress any occurrences of more than two newlines.
这行得通,sortof,但会压缩任何超过两个换行符的出现。
# this compresses "\n\n\n" or "\n\n \n" into "\n\n"
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n\s*\n', '\n\n', input_text))
I'd like to avoid this, because extra blank lines between paragraphs may be intentional;我想避免这种情况,因为段落之间的额外空行可能是故意的; they should be left alone.
他们应该一个人呆着。
The unicode definition of \s
isn't specific enough to allow me to construct a character set of "all whitespace except newlines", so I can't do something like this: \s
的 unicode 定义不够具体,无法构造“除换行符外的所有空格”的字符集,所以我不能这样做:
# this only works for ASCII
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n[ \t\r\f\v]*\n', '\n\n', input_text))
To do that I need a way to express " \s
except \n
" for unicode and I don't think that exists.为此,我需要一种方法来为 unicode 表达“
\s
except \n
”,但我认为它不存在。 I tried [\s!\n]
on a lark and, bizarrely, it seems to do the right thing in 3.6.5 and 3.8.0.我尝试
[\s!\n]
很奇怪,奇怪的是,它似乎在 3.6.5 和 3.8.0 中做了正确的事情。 This, despite the fact that !
这,尽管事实上
!
has no documented effect inside a character set for either version, and that the documentation for re.escape()
explicitly states that, as of 3.7, !
在任一版本的字符集中都没有记录效果,并且
re.escape()
的文档明确指出,从 3.7 开始, !
is no longer escaped by the method as it's not a special character.不再被该方法转义,因为它不是特殊字符。
# this appears to work, but the docs say it shouldn't
re.sub(r'(?<!\n)\n(?!\n)', ' ', re.sub(r'\n[\s!\n]\n', '\n\n', input_text))
Even though it seems to work, I don't want to rely on the behaviour, for obvious reasons.尽管它似乎有效,但出于显而易见的原因,我不想依赖这种行为。 I should probably report it as a bug in either the code or the documentation.
我可能应该将其报告为代码或文档中的错误。
Assuming that last one is not supposed to be supported, what other approach am I missing?假设不应该支持最后一个,我还缺少其他什么方法?
You may capture the occurrences of double and more newlines to keep them when matched and just match all other newlines:您可以捕获出现的双倍和更多换行符以在匹配时保留它们并匹配所有其他换行符:
import re
text = "This is some sample text beginning with a short paragraph.\n\nThis second paragraph is long enough to be split across lines, so it contains\na single newline in the middle.\n \nThis third paragraph has an unusual separator before it; a newline followed by\na space followed by another newline. It's a special case that needs to be\nhandled."
print( re.sub(r'([^\S\n]*\n(?:[^\S\n]*\n)+[^\S\n]*)|[^\S\n]*\n[^\S\n]*', lambda x: x.group(1) or ' ', text) )
See the Python demo请参阅Python 演示
Details细节
([^\S\n]*\n(?:[^\S\n]*\n)+[^\S\n]*)
- Group 1: 0+ whitespaces other than a newline, a newline, then 1 or more (so, at least two newlines are matched) occurrences of 0+ whitespaces other than a newline and a newline, and then again 0+ whitespaces other than a newline ([^\S\n]*\n(?:[^\S\n]*\n)+[^\S\n]*)
- 第 1 组:除换行符、换行符之外的 0+ 个空格,然后出现 1 次或更多次(因此,至少匹配两个换行符)出现 0+ 个空格而不是换行符和换行符,然后再次出现 0+ 个空格而不是换行符|
- or [^\S\n]*\n[^\S\n]*
- 0+ whitespaces other than a newline, a newline and again 0+ whitespaces other than a newline [^\S\n]*\n[^\S\n]*
- 换行符以外的 0+ 个空格,换行符和换行符以外的 0+ 个空格The replacement is lambda x: x.group(1) or ' '
: if Group 1 matched, no replacement should occur, else, replace with a space.替换为
lambda x: x.group(1) or ' '
:如果第 1 组匹配,则不应进行替换,否则用空格替换。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.