简体   繁体   English

Pandas read_csv-如何在双引号内处理逗号,而双引号本身又在双引号内

[英]Pandas read_csv - How to handle a comma inside double quotes that are themselves inside double quotes

This is not the same question as double quoted elements in csv cant read with pandas . 这与csv中不能用pandas读取的双引号元素不同

The difference is that in that question: "ABC,DEF" was breaking the code. 不同之处在于该问题:“ ABC,DEF”正在破坏代码。

Here, "ABC "DE" ,F" is breaking the code. 在这里,“ ABC“ DE”,F“破坏了代码。

The whole string should be parsed in as 'ABC "DE", F'. 整个字符串应解析为“ ABC“ DE”,F“。 Instead the inside double quotes are leading to the below-mentioned issue. 相反,内部双引号导致了下面提到的问题。

I am working with a csv file that contains the following type of entries: 我正在使用包含以下条目类型的csv文件:

header1, header2, header3,header4 标头1,标头2,标头3,标头4

2001-01-01,123456,"abc def",V4 2001-01-01,123456,“ abc def”,V4

2001-01-02,789012,"ghi "jklm" n,op",V4 2001-01-02,789012,“ ghi” jklm“ n,op”,V4

The second row of data is breaking the code, with the following error: 第二行数据正在破坏代码,并出现以下错误:

ParserError: Error tokenizing data. C error: Expected 4 fields in line 1234, saw 5

I have tried playing with various sep , delimiter & quoting etc. arguments but nothing seems to work. 我尝试使用各种sepdelimiterquoting等参数,但是似乎没有任何效果。

Can someone please help with this? 有人可以帮忙吗? Thank you! 谢谢!

Based on the two rows you have provided here is an option where the text file is read into a Series object and then regex extract is used via Series.str.extract() get the information you want in a DataFrame : 基于你在这里提供的两行就是文本文件读入一个选项Series对象,然后正则表达式提取物是通过使用Series.str.extract()让你在一个想要的信息DataFrame

with open('so.txt') as f:
    contents = f.readlines()

s = pd.Series(contents)

s now looks like the following: s现在看起来如下:

0 header1, header2, header3,header4\\n 1 \\n 2 2001-01-01,123456,"abc def",V4\\n 3 \\n 4 2001-01-02,789012,"ghi "jklm" n,op",V4

Now you can use regex extract to get what you want into a DataFrame : 现在,您可以使用正则表达式提取将想要的内容放入DataFrame

df = s.str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2}),([0-9]+),(.+),(\w{2})$')

# remove empty rows
df = df.dropna(how='all')

df looks like the following: df如下所示:

0 1 2 3 2 2001-01-01 123456 "abc def" V4 4 2001-01-02 789012 "ghi "jklm" n,op" V4

and you can set your columns names with df.columns = ['header1', 'header2', 'header3', 'header4'] 您可以使用df.columns = ['header1', 'header2', 'header3', 'header4']设置列名称

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM