简体   繁体   中英

Remove leading/ending and internal multiple spaces but NOT tabs, newlines, or return characters, in Python

The answer to the question at Python remove all whitespace in a string shows separate ways to remove leading/ending, duplicated, and all spaces, respectively, from a string in Python. But strip() removes tabs and newlines, and lstrip() only affects leading spaces. The solution using .join(sentence.split()) also appears to remove Unicode whitespace characters.

Suppose I have a string, in this case scraped from a website using Scrapy, like this:

['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text\n',
  ' and on another a line some more text', '
                ']

The newlines preserve formatting of the text when I use it in another contexts, but all the extra space is a nuisance. How do I remove all the leading, ending, and duplicated internal spaces while preserving the newline characters (in addition to any \\r or \\t characters, if there are any)?

The result I want (after I join the individual strings) would then be:

['\n\n\nSome text and some more text\nand on another line some more text']

No sample code is provided because what I've tried so far is just the suggestions on the page referenced above, which gets the results I'm trying to avoid.

In that case str.strip() won't help you (even if you use " " as an argument because it won't remove the spaces inside, only at the start/end of your string, and it would remove the single space before "and" as well.

Instead, use regex to remove 2 or more spaces from your strings:

l= ['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text\n',
  ' and on another a line some more text']

import re

result = "".join([re.sub("  +","",x) for x in l])

print(repr(result))

prints:

'\n\n\nSome text and some more text\n and on another a line some more text'

EDIT: if we apply the regex to each line, we cannot detect \\n in some cases, as you noted. So, the alternate and more complex solution would be to join the strings before applying regex, and apply a more complex regex (note that I changed the test list of strings to add more corner cases):

l= ['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text \n',
  '\n and on another a line some more text ']

import re

result = re.sub("(^ |(?<=\n) |  +| (?=\n)| $)","","".join(l))

print(repr(result))

prints:

'\n\n\nSome text and some more text\n\nand on another a line some more text'

There are 5 cases in the regex now that will be removed:

  • start by one space
  • space following a newline
  • 2 or more spaces
  • space followed by a newline
  • end by one space

Aftertought: looks (and is) complicated. There is a non-regex solution after all which gives exactly the same result (if there aren't multiple spaces between words):

result = "\n".join([x.strip(" ") for x in "".join(l).split("\n")])
print(repr(result))

just join the strings, then split according to newline, apply strip with " " as argument to preserve tabs, and join again according to newline.

Chain with re.sub(" +"," ",x.strip(" ")) to take care of possible double spaces between words:

result = "\n".join([re.sub("  +"," ",x.strip(" ")) for x in "".join(l).split("\n")])

You can also do the whole thing in terms of built in string operations if you like.

l = ['\n                        \n                    ',
     '\n                        ',
     'Some text',
     ' and some more text\n',
     ' and on another a      line some more text',
     '              ']


def remove_duplicate_spaces(l):
    words = [w for w in l.split(' ') if w != '']
    return ' '.join(words)

lines = ''.join(l).split('\n')
formatted_lines = map(remove_duplicate_spaces, lines)
u = "\n".join(formatted_lines)

print(repr(u))

gives

'\n\n\nSome text and some more text\nand on another a line some more text'

You can also collapse the whole thing into a one-liner:

s = '\n'.join([' '.join([s for s in x.strip(' ').split(' ') if s!='']) for x in ''.join(l).split('\n')])

# OR

t = '\n'.join(map(lambda x: ' '.join(filter(lambda s: s!='', x.strip(' ').split(' '))), ''.join(l).split('\n')))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM