简体   繁体   English

pandas 中的 .read_csv() 没有正确读取转义字符

[英].read_csv() in pandas isn't reading escape characters properly

I'm trying to create an ETL pipeline with pandas and CSVing the data but I'm having some problems with some escape characters.我正在尝试使用 Pandas 和 CSV 数据创建 ETL 管道,但我在使用某些转义字符时遇到了一些问题。

If, for example, my data is '\\"' and the escapechar defined is '\\' with quotechar '"', when I read the file my data turns into "\\", missing one escapechar.例如,如果我的数据是 '\\"' 并且定义的转义字符是 '\\' 和引用字符 '"',那么当我读取文件时,我的数据变成了“\\”,缺少一个转义字符。

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd
import csv

escape_char_defined = '\\'
quote_defined = '"'
separator = "|"

sample_data = []

for i in range(1,11):
    sample_data.append(i*escape_char_defined + quote_defined)

initial_df = pd.DataFrame(sample_data,columns=['column'])


csv_text = initial_df.to_csv(sep=separator,columns=None,header=None,index=False,doublequote=False,quoting=csv.QUOTE_ALL,quotechar=quote_defined,escapechar=escape_char_defined,encoding='utf-8')

csv_text = StringIO(csv_text)

final_df = pd.read_csv(csv_text,sep=separator,escapechar=escape_char_defined,quoting=csv.QUOTE_ALL,header=None,doublequote=False,encoding='utf-8')

if not final_df.equals(initial_df):
    raise Exception("Dataframes are not equal!")    

I don't think this is an expected behaviour since I'm using the same tools to write and read the CSV text.我不认为这是预期的行为,因为我使用相同的工具来编写和读取 CSV 文本。

Does anyone already have a problem with this ?有没有人已经有这个问题?

Here the fixed code if I correctly understood what you need.如果我正确理解您的需要,这里是固定代码。

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd
import csv

escape_char_defined = "\\"
quote_defined = '"'
separator = "|"

sample_data = []

for i in range(1,11):
    sample_data.append(i*escape_char_defined + quote_defined)

initial_df = pd.DataFrame(sample_data,columns=['column'])

csv_text = initial_df.to_csv(sep=separator,columns=None,quoting=csv.QUOTE_NONE,header=None,index=False,doublequote=False,quotechar=quote_defined,escapechar=escape_char_defined)
csv_text = StringIO(csv_text)
final_df = pd.read_csv(csv_text,names=(["column"]),sep=separator,quoting=csv.QUOTE_NONE,escapechar=escape_char_defined,quotechar=quote_defined,header=None,doublequote=False)

if not final_df.equals(initial_df):
    raise Exception("Dataframes are not equal!")    

I have replaced the quoting=csv.QUOTE_ALL in the pd.to_csv() and pd.read_csv() with quoting=csv.QUOTE_NONE .我已经用quoting=csv.QUOTE_ALL替换了pd.to_csv()pd.read_csv()中的quoting=csv.QUOTE_NONE

The option csv.QUOTE_NONE block the writer to quote fields.选项csv.QUOTE_NONE阻止作者引用字段。 If the current delimiter is present in the output data is preceded by the current escapechar.如果当前分隔符存在于输出数据中,则前面是当前转义字符。 If it isn't set the writer will raise an error if any characters that require escaping are encountered.如果未设置,则在遇到任何需要转义的字符时,编写器将引发错误。

In the pd.read_csv() I have also added the column name as 'column'在 pd.read_csv() 我还添加了列名作为'column'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM