简体   繁体   中英

Reading CSV file in Pandas with double 'double quotes' and embedded commas

I need to read a CSV file in Pandas which has data in the following format (double 'double quotes' for one of the fields)

"column1","column2","column3","column4"
"10",""AB"","ABCD","abcd"
"11",""CD,E"","CDEF","abcd"
"12",""WER"","DEF,31","abcd"

I expect the correctly parsed dataframe to be like

column1  column2  column3   column4
10       AB       ABCD      abcd
11       "CD,E"   CDEF      abcd
12       WER      "DEF,31"  abcd

I tried using

df= pd.read_csv('sample.txt',quotechar='""', quoting=csv.QUOTE_ALL)

and

df= pd.read_csv('sample.txt',quotechar='"', quoting=csv.QUOTE_ALL)

but getting

TypeError: "quotechar" must be a 1-character string

and

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5

Is there a way for me to read this file as is without having to preprocess and remove the double 'double quotes' in the data?

When column2 has no commas, I'm able to read the data with some extra quotes which I can replace in further processing steps. I'm having parsing problems only when column2 is having a comma.

I'm not sure if pandas can do this by itself since you also have both unescaped separators and quotes in your data.

However, you should be able to parse it after modifying the data with regex by escaping quotes that are part of the field.

import re
from io import StringIO

data = """
"column1","column2","column3","column4"
"10",""AB"","ABCD","abcd"
"11",""CD,E"","CDEF","abcd"
"12",""WER"","DEF,31","abcd"
"""

data = re.sub('(?<!^)"(?!,")(?<!,")(?!$)', '\\"', data, flags=re.M)

pd.read_csv(StringIO(data), escapechar='\\')

If you are reading from a file, then:

with open('path/to/csv', 'r') as f:
    data = re.sub('(?<!^)"(?!,")(?<!,")(?!$)', '\\"', f.read(), flags=re.M)
    df = pd.read_csv(StringIO(data), escapechar='\\')

The regex looks for quotes that are not at the start (?<!^) or end (?!$) of each line, and for quotes that are not part of the start (?<,,") and end (?,,") of each field

This should do the trick for you

df =pd.read_csv("so.txt",encoding='utf-8', names=["column1","column2","column3","column4"],sep='",',header=0,quoting=csv.QUOTE_ALL)

Pre-processing is not necessarily a bad thing. If done using system pipes, it won't add too much overhead (like the one above with StringIO ). It can also save lots of time chasing some known issues (ex. DOS CR, no EOL at EOF, NULL bytes in the middle, etc. depending on your system).

For this case + some other common issues like extra quotes, spaces, commas, imbedded commas. While pandas do have a parameter doublequote , it's not flexible enough.

Using a system pipe, should be efficient for large files on Linux

import os
df = pd.read_csv(
  os.popen('sed -r "s/^\s+|(^[,[:space:]]*|\s*)(#.*)?$//g; s/\s+,/,/g; s/\\"\\"/\\"/g" %s' % fname),
    quotechar='"', skipinitialspace=True)

OR: using a python pipe

import re
from io import StringIO
with open(fname) as f:
  data = re.sub('""', '"', re.sub('[ \t]+,', ',',
    re.sub('^[ \t]+|(^[ \t,]*|[ \t]*)(#.*)?$', '', f.read(), flags=re.M)))
  df = pd.read_csv(StringIO(data), quotechar='"', skipinitialspace=True)

Input file with comments and issues

a, b, c, d         # header w/ trailing spaces
,, ,,,,            # commas + spaces, no data
# extra space before data
  1, 2, 3.5, 4k     
3, " 5 " , 7.6, "n, m" # extra spaces, comma inside
10, "20" , 30.5, w z
40, 60, 75, ""x, q""   # double quoting

Output:

    a   b     c     d
0   1   2   3.5    4k
1   3   5   7.6  n, m
2  10  20  30.5   w z
3  40  60  75.0  x, q

It's now clean and properly formatted:

a int64
b int64
c float64
d object

list(df['d']):  ['4k', 'n, m', 'w z', 'x, q']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM