简体   繁体   中英

.read_csv() in pandas isn't reading escape characters properly

I'm trying to create an ETL pipeline with pandas and CSVing the data but I'm having some problems with some escape characters.

If, for example, my data is '\\"' and the escapechar defined is '\\' with quotechar '"', when I read the file my data turns into "\\", missing one escapechar.

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd
import csv

escape_char_defined = '\\'
quote_defined = '"'
separator = "|"

sample_data = []

for i in range(1,11):
    sample_data.append(i*escape_char_defined + quote_defined)

initial_df = pd.DataFrame(sample_data,columns=['column'])


csv_text = initial_df.to_csv(sep=separator,columns=None,header=None,index=False,doublequote=False,quoting=csv.QUOTE_ALL,quotechar=quote_defined,escapechar=escape_char_defined,encoding='utf-8')

csv_text = StringIO(csv_text)

final_df = pd.read_csv(csv_text,sep=separator,escapechar=escape_char_defined,quoting=csv.QUOTE_ALL,header=None,doublequote=False,encoding='utf-8')

if not final_df.equals(initial_df):
    raise Exception("Dataframes are not equal!")    

I don't think this is an expected behaviour since I'm using the same tools to write and read the CSV text.

Does anyone already have a problem with this ?

Here the fixed code if I correctly understood what you need.

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd
import csv

escape_char_defined = "\\"
quote_defined = '"'
separator = "|"

sample_data = []

for i in range(1,11):
    sample_data.append(i*escape_char_defined + quote_defined)

initial_df = pd.DataFrame(sample_data,columns=['column'])

csv_text = initial_df.to_csv(sep=separator,columns=None,quoting=csv.QUOTE_NONE,header=None,index=False,doublequote=False,quotechar=quote_defined,escapechar=escape_char_defined)
csv_text = StringIO(csv_text)
final_df = pd.read_csv(csv_text,names=(["column"]),sep=separator,quoting=csv.QUOTE_NONE,escapechar=escape_char_defined,quotechar=quote_defined,header=None,doublequote=False)

if not final_df.equals(initial_df):
    raise Exception("Dataframes are not equal!")    

I have replaced the quoting=csv.QUOTE_ALL in the pd.to_csv() and pd.read_csv() with quoting=csv.QUOTE_NONE .

The option csv.QUOTE_NONE block the writer to quote fields. If the current delimiter is present in the output data is preceded by the current escapechar. If it isn't set the writer will raise an error if any characters that require escaping are encountered.

In the pd.read_csv() I have also added the column name as 'column'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM