[英]What's the most pythonic way of normalizing lineends in a string?
Given a text-string of unknown source, how does one best rewrite it to have a known lineend-convention? 给定一个未知来源的文本字符串,如何最好地重写它以具有已知的行尾约定?
I usually do: 我通常这样做:
lines = text.splitlines()
text = '\n'.join(lines)
... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!). ...但这不能处理完全混淆的约定的“混合”文本文件(是的,它们仍然存在!)。
The oneliner of what I'm doing is of course: 当然,我正在做的事情是:
'\n'.join(text.splitlines())
... that's not what I'm asking about. ...这不是我要问的。
The total number of lines should be the same afterwards, so no stripping of empty lines. 之后,总行数应相同,因此不要剥离空行。
Splitting 分裂
'a\nb\n\nc\nd'
'a\r\nb\r\n\r\nc\r\nd'
'a\rb\r\rc\rd'
'a\rb\n\rc\rd'
'a\rb\r\nc\nd'
'a\nb\r\nc\rd'
.. should all result in 5 lines. ..应该全部导致5行。 In a mixed context, splitlines assumes that '\\r\\n' is a single logical newline, leading to 4 lines for the last two testcases. 在混合上下文中,分割线假定'\\ r \\ n'是单个逻辑换行符,最后两个测试用例导致4行。
Hm, a mixed context that contains '\\r\\n' can be detected by comparing the result of splitlines() and split('\\n'), and/or split('\\r')... 嗯,可以通过比较splitlines()和split('\\ n')和/或split('\\ r')的结果来检测包含'\\ r \\ n'的混合上下文...
mixed.replace('\r\n', '\n').replace('\r', '\n')
应该处理所有可能的变体。
... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!) ...但是这不能处理完全混淆的约定的“混合”文本文件(是的,它们仍然存在!)
Actually it should work fine: 实际上,它应该可以正常工作:
>>> s = 'hello world\nline 1\r\nline 2'
>>> s.splitlines()
['hello world', 'line 1', 'line 2']
>>> '\n'.join(s.splitlines())
'hello world\nline 1\nline 2'
What version of Python are you using? 您正在使用哪个版本的Python?
EDIT: I still don't see how splitlines()
is not working for you: 编辑:我仍然看不到splitlines()
对您不起作用:
>>> s = '''\
... First line, with LF\n\
... Second line, with CR\r\
... Third line, with CRLF\r\n\
... Two blank lines with LFs\n\
... \n\
... \n\
... Two blank lines with CRs\r\
... \r\
... \r\
... Two blank lines with CRLFs\r\n\
... \r\n\
... \r\n\
... Three blank lines with a jumble of things:\r\n\
... \r\
... \r\n\
... \n\
... End without a newline.'''
>>> s.splitlines()
['First line, with LF', 'Second line, with CR', 'Third line, with CRLF', 'Two blank lines with LFs', '', '', 'Two blank lines with CRs', '', '', 'Two blank lines with CRLFs', '', '', 'Three blank lines with a jumble of things:', '', '', '', 'End without a newline.']
>>> print '\n'.join(s.splitlines())
First line, with LF
Second line, with CR
Third line, with CRLF
Two blank lines with LFs
Two blank lines with CRs
Two blank lines with CRLFs
Three blank lines with a jumble of things:
End without a newline.
As far as I know splitlines()
doesn't split the list twice or anything. 据我所知, splitlines()
不会两次拆分列表。
Can you paste a sample of the kind of input that's giving you trouble? 您可以粘贴给您带来麻烦的那种输入示例吗?
Are there even more convetions than \\r\\n\\
and \\n
? 还有比\\r\\n\\
和\\n
更多的惊喜吗? Simply replacing \\r\\n
is enough if you dont want lines. 如果您不需要行,只需替换\\r\\n
就足够了。
only_newlines = mixed.replace('\r\n','\n')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.