I am splitting a string in python and my goal is to split by commas except these between quotations marks. I am using
fields = line.strip().split(",")
but some strings are like the following one:
10,20,"Installations, machines",3,5
How can I use regular expressions for accomplishing this?
Although I agree that regular expressions may not be the best tool for the job, I found the problem quite interesting on its own.
import re
split_on_commas = re.compile(r'[^,]*".*"[^,]*|[^,]+|(?<=,)|^(?=,)').findall
This regexp consists in four alternative parts in this order:
Some tests:
assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,20,"aaa, bbb",3,5') == ['10', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,,20,"aaa, bbb",3,5') == ['10', '', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas(',10,20,"aaa, bbb",3,5') == ['', '10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb",3,5,') == ['10', '20', '"aaa, bbb"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', '"aaa, bbb" ccc', '3', '5']
assert split_on_commas('10,20,ccc "aaa, bbb",3,5') == ['10', '20', 'ccc "aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb" "ccc",3,5,') == ['10', '20', '"aaa, bbb" "ccc"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" "ccc, ddd",3,5,') == ['10', '20', '"aaa, bbb" "ccc, ddd"', '3', '5', '']
assert split_on_commas('10,20,"aaa, "bbb",3,5') == ['10', '20', '"aaa, "bbb"', '3', '5']
assert split_on_commas('10,20,"",3,5') == ['10', '20', '""', '3', '5']
assert split_on_commas('10,20,",",3,5') == ['10', '20', '","', '3', '5']
assert split_on_commas(',,,') == ['', '', '', '']
assert split_on_commas('') == []
assert split_on_commas(',') == ['', '']
assert split_on_commas('","') == ['","']
assert split_on_commas('",') == ['"', '']
assert split_on_commas(',"') == ['', '"']
assert split_on_commas('"') == ['"']
csv
module solutionSimilar questions have been asked many times on SO, and each time the best / accepted answer was "Just use the csv
module". Perhaps it's useful to point out some differences between the recommended solution and my re
proposition. But first, devise a csv
function with the same interface as split
(not idiomatic, but consistent with the original requirement):
import csv
split_on_commas = lambda s: csv.reader([s]).next()
The first thing to be aware of is that csv.reader
does more than a smart split
. The external delimiters are suppressed:
assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']
Which can lead to some strange behaviours:
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', 'aaa, bbb ccc', '3', '5']
assert split_on_commas('10,20,aaa", bbb ccc",3,5') == ['10', '20', 'aaa"', ' bbb ccc"', '3', '5']
I am sure this is not a problem with a generated CSV, since the offending double quotes would be escaped.
More shocking is the fact that this module still does not support Unicode :
split_on_commas(u'10,20,"Juan, Chô",3,5')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-83-a0ef82b5fc26> in <module>()
----> 1 split_on_commas(u'10,20,"Juan, Chô",3,5')
<ipython-input-81-18a2b4070348> in <lambda>(s)
1 if __name__ == "__main__":
2 import csv
----> 3 split_on_commas = lambda s: csv.reader([s]).next()
4
5 assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 15: ordinal not in range(128)
But there is of course a third difference: my solution has not be thoroughly tested, and is not guaranteed to work in the cases I didn't think of... Now, since this approach seems to have several real use cases ( eg , non-TSV files, non-ASCII input), I would be glad if some regex guru, far from dismissing it as dangerous, could help to find out its limitations and improve it.
This is how I'd do it:
import re
data = "my string \"string is nice\" other string "
print re.findall(r'(\w+|".*?")', data)
The output will be:
['my', 'string', '"string is nice"', 'other', 'string']
I don't think there's anything to explain here as the regex speaks for itself. Anyway, if you have any doubts I recommend regex101
\\w+ - match any word character [a-zA-Z0-9_]
" - matches the characters "
literally
.*? - matches any character (except newline)
If you also want to get rid of the square brackets, do this:
import re
string = "my string \"string is nice\" other string "
parsed_string = re.findall(r'(\w+|".*?")', string)
print(", ".join(parsed_string))
The output will be:
my, string, "string is nice", other, string
As jonrsharpe and Alan Moore mentioned, the Python's built-in CSV module would be a much better solution.
As per their own example:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
Regular expressions will not work well here.
You can split by comma and then recombine... Or use the csv module as suggested in the comments...
line = '10,20,"Installations, machines",3,5'
fields = line.strip().split(",")
result = []
tmpfield = ''
for checkfield in fields:
tmpfield = checkfield if tmpfield=='' else tmpfield +','+ checkfield
if tmpfield.strip().startswith('"'):
if tmpfield.strip().endswith('"'):
result.append(tmpfield)
tmpfield = ''
else:
result.append(tmpfield)
tmpfield = ''
if tmpfield<>'':
result.append(tmpfield)
print(result)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.