Pandas.read_csv() 由于数据中的逗号而对数据进行标记化时出错

Question

我在读取包含行值中逗号的 csv 时遇到问题。

包含导致问题的数据 (afaik) 的示例行如下：

['true',47,'y','descriptive_evidence','n','true',66,[81,65]]

我认为[81,65] 条目被逐字扫描，因此被视为两个条目 [81 和 65]。 有没有办法在熊猫中覆盖它，或者我必须在读入数据框之前手动替换逗号？

通过阅读其他答案，我知道使用诸如error_bad_lines=False之类的内容跳过行的可能性，但在这种情况下，我不能跳过这些条目。

最好的祝愿：）

Answer 1

您可以尝试使用正则表达式进行sep ，但它将使用python engine而不是c并且它可能会占用内存/时间。 如果您想这样做，这是解决方案：

1,2,3,4,5,6,7,8
'true',47,'y','descriptive_evidence','n','true',66,[81,65]

pd.read_csv("./file_name.csv",sep=r",(?![^[]*\])",engine="python")

|     | 1      | 2   | 3   | 4                      | 5   | 6      | 7   | 8       |
| --- | ------ | --- | --- | ---------------------- | --- | ------ | --- | ------- |
| 0   | 'true' | 47  | 'y' | 'descriptive_evidence' | 'n' | 'true' | 66  | [81,65] |

Answer 2

这种方法将使您的文件标准化一点，然后将其加载到 pandas 中。

一个示例文件：

['Bool','low_number','char','string','char2','bool','high_number','list_using_quotechar']
['true',47,'y','descriptive_evidence','n','true',66,[81,65]]
['true',47,'y','descriptive_evidence','n','true',66,[81,65]]

标准化文件的代码，并加载它：

import pandas as pd

with open('data_with_quote.csv') as original_file:
    with open('data_fixed.csv', 'w') as new_file:
        for line in original_file:
            line = line.replace('\n','') # remove newline so all lines are equal
            line = line[1:-1] # remove first and last charcter, '[' amd ']' respectively
            line = line.replace('[','"') # replace '[' with a quote_character that will work with pandas
            line = line.replace(']','"') # replace ']' with a quote_character that will work with pandas
            new_file.write(line + '\n')
            
your_data_as_df = pd.read_csv('data_fixed.csv',quotechar='"') # load file with quote_character from earlier

Pandas.read_csv() 由于数据中的逗号而对数据进行标记化时出错

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-06-30 14:37:22

解决方案2
0 2022-06-30 14:38:18

Pandas.read_csv() 由于数据中的逗号而对数据进行标记化时出错

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-06-30 14:37:22

解决方案2 0 2022-06-30 14:38:18

解决方案1
1 已采纳 2022-06-30 14:37:22

解决方案2
0 2022-06-30 14:38:18