简体   繁体   中英

Conditional regular expression to split on commas

I am splitting a string in python and my goal is to split by commas except these between quotations marks. I am using

fields = line.strip().split(",")

but some strings are like the following one:

10,20,"Installations, machines",3,5

How can I use regular expressions for accomplishing this?

Although I agree that regular expressions may not be the best tool for the job, I found the problem quite interesting on its own.

import re
split_on_commas = re.compile(r'[^,]*".*"[^,]*|[^,]+|(?<=,)|^(?=,)').findall

This regexp consists in four alternative parts in this order:

  1. any number of non-commas, followed by a substring enclosed between double quotes, followed by any number of non-commas;
  2. at least one non-comma;
  3. an empty substring following a comma;
  4. an empty substring at the start of the string, and followed by a comma.

Some tests:

assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,20,"aaa, bbb",3,5') == ['10', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,,20,"aaa, bbb",3,5') == ['10', '', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas(',10,20,"aaa, bbb",3,5') == ['', '10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb",3,5,') == ['10', '20', '"aaa, bbb"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', '"aaa, bbb" ccc', '3', '5']
assert split_on_commas('10,20,ccc "aaa, bbb",3,5') == ['10', '20', 'ccc "aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb" "ccc",3,5,') == ['10', '20', '"aaa, bbb" "ccc"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" "ccc, ddd",3,5,') == ['10', '20', '"aaa, bbb" "ccc, ddd"', '3', '5', '']
assert split_on_commas('10,20,"aaa, "bbb",3,5') == ['10', '20', '"aaa, "bbb"', '3', '5']
assert split_on_commas('10,20,"",3,5') == ['10', '20', '""', '3', '5']
assert split_on_commas('10,20,",",3,5') == ['10', '20', '","', '3', '5']
assert split_on_commas(',,,') == ['', '', '', '']
assert split_on_commas('') == []
assert split_on_commas(',') == ['', '']
assert split_on_commas('","') == ['","']
assert split_on_commas('",') == ['"', '']
assert split_on_commas(',"') == ['', '"']
assert split_on_commas('"') == ['"']

Update: comparison with the csv module solution

Similar questions have been asked many times on SO, and each time the best / accepted answer was "Just use the csv module". Perhaps it's useful to point out some differences between the recommended solution and my re proposition. But first, devise a csv function with the same interface as split (not idiomatic, but consistent with the original requirement):

import csv
split_on_commas = lambda s: csv.reader([s]).next()

The first thing to be aware of is that csv.reader does more than a smart split . The external delimiters are suppressed:

assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']

Which can lead to some strange behaviours:

assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', 'aaa, bbb ccc', '3', '5']
assert split_on_commas('10,20,aaa", bbb ccc",3,5') == ['10', '20', 'aaa"', ' bbb ccc"', '3', '5']

I am sure this is not a problem with a generated CSV, since the offending double quotes would be escaped.

More shocking is the fact that this module still does not support Unicode :

split_on_commas(u'10,20,"Juan, Chô",3,5')

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-83-a0ef82b5fc26> in <module>()
----> 1 split_on_commas(u'10,20,"Juan, Chô",3,5')

<ipython-input-81-18a2b4070348> in <lambda>(s)
      1 if __name__ == "__main__":
      2     import csv
----> 3     split_on_commas = lambda s: csv.reader([s]).next()
      4 
      5     assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 15: ordinal not in range(128)

But there is of course a third difference: my solution has not be thoroughly tested, and is not guaranteed to work in the cases I didn't think of... Now, since this approach seems to have several real use cases ( eg , non-TSV files, non-ASCII input), I would be glad if some regex guru, far from dismissing it as dangerous, could help to find out its limitations and improve it.

This is how I'd do it:

import re

data = "my string \"string is nice\" other string "
print re.findall(r'(\w+|".*?")', data)

The output will be:

['my', 'string', '"string is nice"', 'other', 'string']

I don't think there's anything to explain here as the regex speaks for itself. Anyway, if you have any doubts I recommend regex101

\\w+ - match any word character [a-zA-Z0-9_]
" - matches the characters " literally
.*? - matches any character (except newline)

If you also want to get rid of the square brackets, do this:

import re

string = "my string \"string is nice\" other string "
parsed_string = re.findall(r'(\w+|".*?")', string)

print(", ".join(parsed_string))

The output will be:

my, string, "string is nice", other, string

As jonrsharpe and Alan Moore mentioned, the Python's built-in CSV module would be a much better solution.

As per their own example:

import csv
with open('some.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

Regular expressions will not work well here.

You can split by comma and then recombine... Or use the csv module as suggested in the comments...

line = '10,20,"Installations, machines",3,5'
fields = line.strip().split(",")

result = []
tmpfield = ''
for checkfield in fields:
    tmpfield = checkfield if tmpfield=='' else tmpfield +','+ checkfield
    if tmpfield.strip().startswith('"'):
        if tmpfield.strip().endswith('"'):
            result.append(tmpfield)
            tmpfield = ''
    else:
        result.append(tmpfield)
        tmpfield = ''

if tmpfield<>'':
    result.append(tmpfield)

print(result)  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM