当 .csv 文件行中的所有值与 Python 相同时删除重复项

Question

我正在处理高度非结构化的 .csv 报告，并且正在努力使用 drop_duplicates 函数。 我的数据集的形状是 4084 行和 39 列。

我的任务相当简单：我想使用 drop_duplicates 以便它删除所有 39 个列值都相同的每一行，但没有其他任何内容。

我尝试了以下代码块，其中没有重复的新文件将保存为“crm_pre_eidup”，但我只是得到 TypeError: 'tuple' object is not callable"。

import pandas as pd
from csv import reader
crm_preprocessed = "CRM_kaikki_data_Pekka1.csv"
crm_pre_eidup = "CRM_kaikki_data_eidup.csv"

df = pd.read_csv(file_name, sep="\t or ,", engine='python')

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
#df=pd.DataFrame(list(reader(crm_pre_eidup)))
df.to_csv(crm_pre_eidup)
df.head()

我很确定，该解决方案仅在于使用： DataFrame.drop_duplicates(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first', inplace: bool = False, ignore_index: bool = False)

Answer 1

您可以尝试以下更改吗

需要添加多个分隔符 | 因为当它长于 1 时，它需要一个正则表达式
文件名 = crm_preprocessed
dedup 默认使用所有列，因此您可以删除该参数
确保您的工作文件夹设置正确或指定完整路径

'''

import pandas as pd
from csv import reader

crm_preprocessed = "CRM_kaikki_data_Pekka1.csv"
crm_pre_eidup = "CRM_kaikki_data_eidup.csv"

df = pd.read_csv(crm_preprocessed , sep='\t|,', engine='python')

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(inplace=True)

# Write the results to a different file
#df=pd.DataFrame(list(reader(crm_pre_eidup)))
df.to_csv(crm_pre_eidup)
df.head()

参考：单个 CSV 文件中的多个分隔符和https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

当 .csv 文件行中的所有值与 Python 相同时删除重复项

问题描述

1 个解决方案

解决方案1
0 2020-02-26 15:37:13

当 .csv 文件行中的所有值与 Python 相同时删除重复项

问题描述

1 个解决方案

解决方案1 0 2020-02-26 15:37:13

解决方案1
0 2020-02-26 15:37:13