Remove leading/ending and internal multiple spaces but NOT tabs, newlines, or return characters, in Python

Question

The answer to the question at Python remove all whitespace in a string shows separate ways to remove leading/ending, duplicated, and all spaces, respectively, from a string in Python. But strip() removes tabs and newlines, and lstrip() only affects leading spaces. The solution using .join(sentence.split()) also appears to remove Unicode whitespace characters.

Suppose I have a string, in this case scraped from a website using Scrapy, like this:

['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text\n',
  ' and on another a line some more text', '
                ']

The newlines preserve formatting of the text when I use it in another contexts, but all the extra space is a nuisance. How do I remove all the leading, ending, and duplicated internal spaces while preserving the newline characters (in addition to any \\r or \\t characters, if there are any)?

The result I want (after I join the individual strings) would then be:

['\n\n\nSome text and some more text\nand on another line some more text']

No sample code is provided because what I've tried so far is just the suggestions on the page referenced above, which gets the results I'm trying to avoid.

Answer 1

In that case str.strip() won't help you (even if you use " " as an argument because it won't remove the spaces inside, only at the start/end of your string, and it would remove the single space before "and" as well.

Instead, use regex to remove 2 or more spaces from your strings:

l= ['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text\n',
  ' and on another a line some more text']

import re

result = "".join([re.sub("  +","",x) for x in l])

print(repr(result))

prints:

'\n\n\nSome text and some more text\n and on another a line some more text'

EDIT: if we apply the regex to each line, we cannot detect \\n in some cases, as you noted. So, the alternate and more complex solution would be to join the strings before applying regex, and apply a more complex regex (note that I changed the test list of strings to add more corner cases):

l= ['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text \n',
  '\n and on another a line some more text ']

import re

result = re.sub("(^ |(?<=\n) |  +| (?=\n)| $)","","".join(l))

print(repr(result))

prints:

'\n\n\nSome text and some more text\n\nand on another a line some more text'

There are 5 cases in the regex now that will be removed:

start by one space
space following a newline
2 or more spaces
space followed by a newline
end by one space

Aftertought: looks (and is) complicated. There is a non-regex solution after all which gives exactly the same result (if there aren't multiple spaces between words):

result = "\n".join([x.strip(" ") for x in "".join(l).split("\n")])
print(repr(result))

just join the strings, then split according to newline, apply strip with " " as argument to preserve tabs, and join again according to newline.

Chain with re.sub(" +"," ",x.strip(" ")) to take care of possible double spaces between words:

result = "\n".join([re.sub("  +"," ",x.strip(" ")) for x in "".join(l).split("\n")])

Answer 2

You can also do the whole thing in terms of built in string operations if you like.

l = ['\n                        \n                    ',
     '\n                        ',
     'Some text',
     ' and some more text\n',
     ' and on another a      line some more text',
     '              ']


def remove_duplicate_spaces(l):
    words = [w for w in l.split(' ') if w != '']
    return ' '.join(words)

lines = ''.join(l).split('\n')
formatted_lines = map(remove_duplicate_spaces, lines)
u = "\n".join(formatted_lines)

print(repr(u))

gives

'\n\n\nSome text and some more text\nand on another a line some more text'

You can also collapse the whole thing into a one-liner:

s = '\n'.join([' '.join([s for s in x.strip(' ').split(' ') if s!='']) for x in ''.join(l).split('\n')])

# OR

t = '\n'.join(map(lambda x: ' '.join(filter(lambda s: s!='', x.strip(' ').split(' '))), ''.join(l).split('\n')))

Remove leading/ending and internal multiple spaces but NOT tabs, newlines, or return characters, in Python

Question

2 answers

solution1
4 ACCPTED 2017-06-28 19:44:25

solution2
2 2017-06-28 20:37:32

Remove leading/ending and internal multiple spaces but NOT tabs, newlines, or return characters, in Python

Question

2 answers

solution1 4 ACCPTED 2017-06-28 19:44:25

solution2 2 2017-06-28 20:37:32

solution1
4 ACCPTED 2017-06-28 19:44:25

solution2
2 2017-06-28 20:37:32