简体   繁体   中英

Removing odd \n, \t, \r and space combinations from a given string in Python

I have a long string which contains various combinations of \\n, \\r, \\t and spaces in-between words and other characters.

  • I'd like to reduce all multiple spaces to a single space.
  • I want to reduce all \\n, \\r, \\t combos to a single new-line character.
  • I want to reduce all \\n, \\r, \\t and space combinations to a single new-line character as well.

I've tried ''.join(str.split()) in various ways to no success.

  • What is the correct Pythonic way here?

  • Would the solution be different for Python 3.x?

Ex. string:

ex_str = u'Word   \n \t \r   \n\n\n word2    word3   \r\r\r\r\nword4\n    word5'

Desired output [new new-line = \\n]:

new_str = u'Word\nword2 word3\nword4\nword5'

Use a combination str.splitlines() and splitting on all whitespace with str.split() :

'\n'.join([' '.join(line.split()) for line in ex_str.splitlines() if line.strip()])

This treats each line separately, removes empty lines, and then collapses all whitespace per line into single spaces.

Provided the input is a Python 3 string, the same solution works across both Python versions.

Demo:

>>> ex_str = u'Word   \n \t \r   \n\n\n word2    word3   \r\r\r\r\nword4\n    word5'
>>> '\n'.join([' '.join(line.split()) for line in ex_str.splitlines() if line.strip(' ')])
u'Word\nword2 word3\nword4\nword5'

To preserve tabs, you'd need to strip and split on just spaces and filter out empty strings:

'\n'.join([' '.join([s for s in line.split(' ') if s]) for line in ex_str.splitlines() if line.strip()])

Demo:

>>> '\n'.join([' '.join([s for s in line.split(' ') if s]) for line in ex_str.splitlines() if line.strip(' ')])
u'Word\n\t\nword2 word3\nword4\nword5'

Use simple regexps:

import re
new_str = re.sub(r'[^\S\n]+', ' ', re.sub(r'\s*[\n\t\r]\s*', '\n', ex_str))

Use a regex:

>>> s
u'Word   \n \t \r   \n\n\n word2    word3   \r\r\r\r\nword4\t    word5'
>>> re.sub(r'[\n\r\t ]{2,}| {2,}', lambda x: '\n' if x.group().strip(' ') else ' ', s)
u'Word\nword2 word3\nword4\nword5'
>>> 

Another solution using regex which replaces tabs with a space u'word1\\t\\tword2' , or do you really want to add a line break here too?

import re
new_str = re.sub(r"[\n\ ]{2,}", "\n", re.sub(r"[\t\r\ ]+", " ", ex_str))
'\n'.join(str.split())

输出:

u'Word\nword2\nword3\nword4\nword5'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM