简体   繁体   中英

How to remove commas within double quotes in python?

I want to remove comma/commas within double quotes from a string.

ie, the following string

string = '100,"def,ghi","jkl,mno\"pqr,stu","stu,vwx"'

should output:

'100,"defghi","jklmno"pqrstu","stuvwx"'

I tried this regex:

re.sub(',(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)', "", string) 

but it fails if there are double quotes within double quotes.

another example:

string = 'null,"re,\"move,\"comma"'

output:

'null,"re"move"comma"'

This is so that I can split this string wrt comma, so i get a list like:

['100','"defghi"','"jklmno"pqrstu"','"stuvwx"']

You could split the string on non-escaped quotes and remove commas on every odd indexed substring. Then join the substrings back together with quotes as the separator:

For example:

'100,"def,ghi","jkl,mno\"pqr,stu","stu,vwx"' would be split into:

  1. 100,
  2. def,ghi <-- odd index (remove comma)
  3. ,
  4. jkl,mno\"pqr,stu <-- odd index (remove comma)
  5. ,
  6. stu,vwx <-- odd index (remove comma)
  7. <empty string>

...

import re

def remComma(S):
    return '"'.join(s.replace(",","") if i%2 else s
                    for i,s in enumerate(re.split(r'(?<!\\)"',string)))

Output:

string = r'100,"def,ghi","jkl,mno\"pqr,stu","stu,vwx"'
print(remComma(string).replace(r'\"','"'))
# 100,"defghi","jklmno"pqrstu","stuvwx"

string = r'null,"re,\"move,\"comma"'
print(remComma(string).replace(r'\"','"'))
# null,"re"move"comma"

Note that you also 'unescaped' the quotes in your output which I did separately in the print statement as it is not related to the comma removal process per se.

You could also do this without using a regular expression by changing escaped quotes to something else before doing a basic split on quotes (then restore the escaped quotes as you join the parts back together):

def remComma(S):
    parts = S.replace(r'\"','\0').split('"')               
    parts[1::2] = (p.replace(',','') for p in parts[1::2])  
    return '"'.join(p.replace('\0','"') for p in parts)

[EDIT] Solution changed after removal of quote escaping from question.

Having unescaped quotes allows for situations where parts of the string may need to be 'tentatively' parsed to ensure that the balance of quotes and separators produces a valid list of parts. This will be very hard to implement using a regular expression. But you can achieve it with a recursive function.

The function would isolate the first part, remove commas if it is quoted and attempt to clean up the commas in the remaining parts recursively. If the rest of the string produces an invalid CSV split because of unbalanced quotes/commas, then the first part is extended by treating its closing quotes as the beginning of a quoted part (rather than the end):

def remComma(S):
    if S.startswith('"'):            # first part is quoted
        p = S.find('",',1)           # find end of quote
        if p<0:                      # last quoted part or error
            return S.replace(',','') if S.endswith('"') else None   
        rest = remComma(S[p+2:])     # clean rest of parts
        if rest is None:             # invalid -> not read end of quote
            return S[:p].replace(',','')+remComma(S[p:])
        return S[:p+1].replace(',','')+ "," + rest
    p = S.find(',"')                   # find next quoted part
    if p<0:                            # if no quoted part
        return None if '"' in S else S # invalid if contains quotes
    return S[:p+1]+remComma(S[p+1:])   # unquoted part + cleaned rest


string = r'100,"def,ghi","jkl,mno"pqr,stu","stu,vwx"'
print(remComma(string))
# 100,"defghi","jklmno"pqrstu","stuvwx"

string = r'null,"re,"move,"comma"'
print(remComma(string))
# null,"re"move"comma"

string = r'100,"abc",123,456, "mno",xyz"'
print(remComma(string))
# 100,"abc"123456 "mno"xyz"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM