简体   繁体   中英

What's the most pythonic way of normalizing lineends in a string?

Given a text-string of unknown source, how does one best rewrite it to have a known lineend-convention?

I usually do:

lines = text.splitlines()
text = '\n'.join(lines)

... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!).

Edit

The oneliner of what I'm doing is of course:

'\n'.join(text.splitlines())

... that's not what I'm asking about.

The total number of lines should be the same afterwards, so no stripping of empty lines.

Testcases

Splitting

'a\nb\n\nc\nd'
'a\r\nb\r\n\r\nc\r\nd'
'a\rb\r\rc\rd'
'a\rb\n\rc\rd'
'a\rb\r\nc\nd'
'a\nb\r\nc\rd'

.. should all result in 5 lines. In a mixed context, splitlines assumes that '\\r\\n' is a single logical newline, leading to 4 lines for the last two testcases.

Hm, a mixed context that contains '\\r\\n' can be detected by comparing the result of splitlines() and split('\\n'), and/or split('\\r')...

mixed.replace('\r\n', '\n').replace('\r', '\n')

应该处理所有可能的变体。

... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!)

Actually it should work fine:

>>> s = 'hello world\nline 1\r\nline 2'

>>> s.splitlines()
['hello world', 'line 1', 'line 2']

>>> '\n'.join(s.splitlines())
'hello world\nline 1\nline 2'

What version of Python are you using?

EDIT: I still don't see how splitlines() is not working for you:

>>> s = '''\
... First line, with LF\n\
... Second line, with CR\r\
... Third line, with CRLF\r\n\
... Two blank lines with LFs\n\
... \n\
... \n\
... Two blank lines with CRs\r\
... \r\
... \r\
... Two blank lines with CRLFs\r\n\
... \r\n\
... \r\n\
... Three blank lines with a jumble of things:\r\n\
... \r\
... \r\n\
... \n\
... End without a newline.'''

>>> s.splitlines()
['First line, with LF', 'Second line, with CR', 'Third line, with CRLF', 'Two blank lines with LFs', '', '', 'Two blank lines with CRs', '', '', 'Two blank lines with CRLFs', '', '', 'Three blank lines with a jumble of things:', '', '', '', 'End without a newline.']

>>> print '\n'.join(s.splitlines())
First line, with LF
Second line, with CR
Third line, with CRLF
Two blank lines with LFs


Two blank lines with CRs


Two blank lines with CRLFs


Three blank lines with a jumble of things:



End without a newline.

As far as I know splitlines() doesn't split the list twice or anything.

Can you paste a sample of the kind of input that's giving you trouble?

Are there even more convetions than \\r\\n\\ and \\n ? Simply replacing \\r\\n is enough if you dont want lines.

only_newlines = mixed.replace('\r\n','\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM