繁体   English   中英

当 .csv 文件行中的所有值与 Python 相同时删除重复项

[英]Removing duplicates when all values in .csv file row are identical with Python

我正在处理高度非结构化的 .csv 报告,并且正在努力使用 drop_duplicates 函数。 我的数据集的形状是 4084 行和 39 列。

我的任务相当简单:我想使用 drop_duplicates 以便它删除所有 39 个列值都相同的每一行,但没有其他任何内容。

我尝试了以下代码块,其中没有重复的新文件将保存为“crm_pre_eidup”,但我只是得到 TypeError: 'tuple' object is not callable"。

import pandas as pd
from csv import reader
crm_preprocessed = "CRM_kaikki_data_Pekka1.csv"
crm_pre_eidup = "CRM_kaikki_data_eidup.csv"

df = pd.read_csv(file_name, sep="\t or ,", engine='python')

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
#df=pd.DataFrame(list(reader(crm_pre_eidup)))
df.to_csv(crm_pre_eidup)
df.head()

我很确定,该解决方案仅在于使用: DataFrame.drop_duplicates(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first', inplace: bool = False, ignore_index: bool = False)

您可以尝试以下更改吗

  • 需要添加多个分隔符 | 因为当它长于 1 时,它需要一个正则表达式
  • 文件名 = crm_preprocessed
  • dedup 默认使用所有列,因此您可以删除该参数
  • 确保您的工作文件夹设置正确或指定完整路径

'''

import pandas as pd
from csv import reader

crm_preprocessed = "CRM_kaikki_data_Pekka1.csv"
crm_pre_eidup = "CRM_kaikki_data_eidup.csv"

df = pd.read_csv(crm_preprocessed , sep='\t|,', engine='python')

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(inplace=True)

# Write the results to a different file
#df=pd.DataFrame(list(reader(crm_pre_eidup)))
df.to_csv(crm_pre_eidup)
df.head()

参考: 单个 CSV 文件中的多个分隔符https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM