I need to read a CSV file in Pandas which has data in the following format (double 'double quotes' for one of the fields)
"column1","column2","column3","column4"
"10",""AB"","ABCD","abcd"
"11",""CD,E"","CDEF","abcd"
"12",""WER"","DEF,31","abcd"
I expect the correctly parsed dataframe to be like
column1 column2 column3 column4
10 AB ABCD abcd
11 "CD,E" CDEF abcd
12 WER "DEF,31" abcd
I tried using
df= pd.read_csv('sample.txt',quotechar='""', quoting=csv.QUOTE_ALL)
and
df= pd.read_csv('sample.txt',quotechar='"', quoting=csv.QUOTE_ALL)
but getting
TypeError: "quotechar" must be a 1-character string
and
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5
Is there a way for me to read this file as is without having to preprocess and remove the double 'double quotes' in the data?
When column2
has no commas, I'm able to read the data with some extra quotes which I can replace in further processing steps. I'm having parsing problems only when column2
is having a comma.
I'm not sure if pandas can do this by itself since you also have both unescaped separators and quotes in your data.
However, you should be able to parse it after modifying the data with regex by escaping quotes that are part of the field.
import re
from io import StringIO
data = """
"column1","column2","column3","column4"
"10",""AB"","ABCD","abcd"
"11",""CD,E"","CDEF","abcd"
"12",""WER"","DEF,31","abcd"
"""
data = re.sub('(?<!^)"(?!,")(?<!,")(?!$)', '\\"', data, flags=re.M)
pd.read_csv(StringIO(data), escapechar='\\')
If you are reading from a file, then:
with open('path/to/csv', 'r') as f:
data = re.sub('(?<!^)"(?!,")(?<!,")(?!$)', '\\"', f.read(), flags=re.M)
df = pd.read_csv(StringIO(data), escapechar='\\')
The regex looks for quotes that are not at the start (?<!^)
or end (?!$)
of each line, and for quotes that are not part of the start (?<,,")
and end (?,,")
of each field
This should do the trick for you
df =pd.read_csv("so.txt",encoding='utf-8', names=["column1","column2","column3","column4"],sep='",',header=0,quoting=csv.QUOTE_ALL)
Pre-processing is not necessarily a bad thing. If done using system pipes, it won't add too much overhead (like the one above with StringIO
). It can also save lots of time chasing some known issues (ex. DOS CR, no EOL at EOF, NULL bytes in the middle, etc. depending on your system).
For this case + some other common issues like extra quotes, spaces, commas, imbedded commas. While pandas
do have a parameter doublequote
, it's not flexible enough.
Using a system pipe, should be efficient for large files on Linux
import os
df = pd.read_csv(
os.popen('sed -r "s/^\s+|(^[,[:space:]]*|\s*)(#.*)?$//g; s/\s+,/,/g; s/\\"\\"/\\"/g" %s' % fname),
quotechar='"', skipinitialspace=True)
OR: using a python pipe
import re
from io import StringIO
with open(fname) as f:
data = re.sub('""', '"', re.sub('[ \t]+,', ',',
re.sub('^[ \t]+|(^[ \t,]*|[ \t]*)(#.*)?$', '', f.read(), flags=re.M)))
df = pd.read_csv(StringIO(data), quotechar='"', skipinitialspace=True)
Input file with comments and issues
a, b, c, d # header w/ trailing spaces
,, ,,,, # commas + spaces, no data
# extra space before data
1, 2, 3.5, 4k
3, " 5 " , 7.6, "n, m" # extra spaces, comma inside
10, "20" , 30.5, w z
40, 60, 75, ""x, q"" # double quoting
Output:
a b c d
0 1 2 3.5 4k
1 3 5 7.6 n, m
2 10 20 30.5 w z
3 40 60 75.0 x, q
It's now clean and properly formatted:
a int64
b int64
c float64
d object
list(df['d']): ['4k', 'n, m', 'w z', 'x, q']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.