简体   繁体   中英

Replace multiple spaces by a single space if they don't appear between quotes?

I have a use case where I want to replace multiple spaces with a single space unless they appear within quotes. For example

Original

this is the first    a   b   c
this is the second    "a      b      c"

After

this is the first a b c
this is the second "a      b      c"

I believe a regular expression should be able to do the trick but I don't have much experience with them. Here's some of the code I already have

import re

str = 'this is the second    "a      b      c"'
# Replace all multiple spaces with single space
print re.sub('\s\s+', '\s', str)

# Doesn't work, but something like this
print re.sub('[\"]^.*\s\s+.*[\"]^, '\s', str)

I understand why my second one above doesn't work, so would just like some alternative approaches. If possible, could you explain the parts of your regex solution. Thanks

Assuming no " within the "substring"

import re
str = 'a    b    c  "d   e   f"'  
str = re.sub(r'("[^"]*")|[ \t]+', lambda m: m.group(1) if m.group(1) else ' ', str)

print(str)
#'a b c "d   e   f"'

The regex ("[^"]*")|[ \\t]+ will match either a quoted substring or one or more single spaces or tabs. Because the regex matches the quoted substring first, the whitespace inside it will not be able to be matched by the alternative subpattern [ \\t]+ , and therefore will be ignored.

The pattern that matches the quoted substring is enclosed in () so the callback can check if it was matched. If it was, m.group(1) will be truthy and it's value is simply returned. If not, it is whitespace that has been matched so a single space is returned as the replacement value.

Without the lamda

def repl(match):
    quoted = match.group(1)
    return quoted if quoted else ' '

str = re.sub(r'("[^"]*")|[ \t]+', repl, str)

If you want a solution that will work reliably every time, no matter the input or other caveats like not allowing embedded quotes, then you want to write a simple parser not use RegExp or splitting on quotes.

def parse(s):
    last = ''
    result = ''
    toggle = 0
    for c in s:
        if c == '"' and last != '\\':
            toggle ^= 1
        if c == ' ' and toggle == 0 and last == ' ':
            continue
        result += c
        last = c
    return result

test = r'"  <  >"test   1   2   3 "a \"<   >\"  b  c"'
print test
print parse(test)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM