[英]Removing duplicates when all values in .csv file row are identical with Python
我正在处理高度非结构化的 .csv 报告,并且正在努力使用 drop_duplicates 函数。 我的数据集的形状是 4084 行和 39 列。
我的任务相当简单:我想使用 drop_duplicates 以便它删除所有 39 个列值都相同的每一行,但没有其他任何内容。
我尝试了以下代码块,其中没有重复的新文件将保存为“crm_pre_eidup”,但我只是得到 TypeError: 'tuple' object is not callable"。
import pandas as pd
from csv import reader
crm_preprocessed = "CRM_kaikki_data_Pekka1.csv"
crm_pre_eidup = "CRM_kaikki_data_eidup.csv"
df = pd.read_csv(file_name, sep="\t or ,", engine='python')
# Notes:
# - the `subset=None` means that every column is used
# to determine if two rows are different; to change that specify
# the columns as an array
# - the `inplace=True` means that the data structure is changed and
# the duplicate rows are gone
df.drop_duplicates(subset=None, inplace=True)
# Write the results to a different file
#df=pd.DataFrame(list(reader(crm_pre_eidup)))
df.to_csv(crm_pre_eidup)
df.head()
我很确定,该解决方案仅在于使用: DataFrame.drop_duplicates(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first', inplace: bool = False, ignore_index: bool = False)
您可以尝试以下更改吗
'''
import pandas as pd
from csv import reader
crm_preprocessed = "CRM_kaikki_data_Pekka1.csv"
crm_pre_eidup = "CRM_kaikki_data_eidup.csv"
df = pd.read_csv(crm_preprocessed , sep='\t|,', engine='python')
# Notes:
# - the `subset=None` means that every column is used
# to determine if two rows are different; to change that specify
# the columns as an array
# - the `inplace=True` means that the data structure is changed and
# the duplicate rows are gone
df.drop_duplicates(inplace=True)
# Write the results to a different file
#df=pd.DataFrame(list(reader(crm_pre_eidup)))
df.to_csv(crm_pre_eidup)
df.head()
参考: 单个 CSV 文件中的多个分隔符和https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.