简体   繁体   English

python csv模块读取用逗号分隔的csv,但忽略双引号或单引号内的逗号

[英]python csv module read csv split by comma but ignore the comma inside double or single quotes

I have a .csv file with column values contain some commas. 我有一个.csv文件,其列值包含一些逗号。 Below are the examples: 以下是示例:

Header: ID     Value           Content                                            Date
        1      34             "market, business"                               12/20/2013
        2      15             "market, business", yesterday, metric            11/21/2014
        3      18             "market," business and yesterday                 10/20/2014
        4      19              yesterday, today,                               11/22/2014

This is the format of the .csv file which if I open in Sublime Text, it appears in format: 这是.csv文件的格式,如果我以Sublime Text打开,它将以以下格式显示:

1, 34, "market, business", 12/20/2013
2, 15, "market, business", "yesterday, metric, 11/21/2014
3, 18, "market," business and yesterday, 10/20/2014
4, 19, yesterday, today, 11/22/2014

But what I want is after the python csv reader program is: 但是我想要的是python csv reader程序之后:

[1, 34, "market, business", 12/20/2013]
[2, 15, "market, business" "yesterday metric, 11/21/2014]
[3, 18, "market," business and yesterday, 10/20/2014]
[4, 19, yesterday today, 11/22/2014]

These are just sample data I have, the "content" column is the headache here cause csv module uses "," as separator, I used 这些只是我的示例数据,这里的“内容”列令人头疼,因为csv模块使用“,”作为分隔符,我使用了

reader = csv.reader(f, skipinitialspace=True)

It works for the first row if all the strings are inside one double quotes. 如果所有字符串都在一个双引号内,则它适用于第一行。 But it doesn't apply for the third and second row if there're commas outside the quotes (single or double) 但是,如果引号外有逗号(单或双),则不适用于第三和第二行

How can I solve the problem? 我该如何解决这个问题? I'm just using the traditional csv module in python now, does "panda" has the ability to solve the problem? 我现在只是在python中使用传统的csv模块,“ panda”有能力解决问题吗?

Thanks. 谢谢。

I made some updates, I think what I want is, method to specify comma at different places... Now I paste here it seems unreasonable cause there's no way I can find inside csv module to tell the differences from separator "," and "," inside a field. 我进行了一些更新,我想我想要的是在不同位置指定逗号的方法...现在我在这里粘贴似乎不合理,因为我无法在csv模块内部找到区分分隔符“,”和“ ”。 Even excel can't... 即使是Excel也无法...

Any ideas? 有任何想法吗?

If we can assume 如果我们可以假设

  • each line begins with two ints separated by commas, 每行以两个整数开头,两个逗号之间用逗号隔开,
  • each line ends with a date, separated by a comma 每行以一个日期结尾,以逗号分隔
  • everything remaining (in the middle) belongs in the third column 剩下的(中间)所有内容都属于第三列

then your data could be parsed this way: 那么您的数据可以通过以下方式进行解析:

data = list()
with open('data') as f:
    for line in f:
        parts = line.split(',', 2)
        parts[2:4] = parts[2].rsplit(',', 1)
        parts[:2] = map(int, parts[:2])
        parts[2:] = map(str.strip, parts[2:])
        data.append(parts)

for row in data:
    print(row)

yields 产量

[1, 34, '"market, business"', '12/20/2013']
[2, 15, '"market, business", "yesterday, metric', '11/21/2014']
[3, 18, '"market," business and yesterday', '10/20/2014']
[4, 19, 'yesterday, today', '11/22/2014']

You could then make a DataFrame like this: 然后,您可以像这样制作一个DataFrame:

import pandas as pd
df = pd.DataFrame(data, columns=['Id','Value','Content','Date'])
print(df)

yields 产量

   Id  Value                                 Content        Date
0   1     34                      "market, business"  12/20/2013
1   2     15  "market, business", "yesterday, metric  11/21/2014
2   3     18        "market," business and yesterday  10/20/2014
3   4     19                        yesterday, today  11/22/2014

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM