简体   繁体   English

如何在python中防止数组中条目的特定组合的重复数据删除?

[英]How to prevent de-duplication of a particular combination of entries in an array in python?

I am writing a piece of code in python where i am working with arrays. 我正在用Python在数组中编写一段代码。 I am loading data row-wise from a csv into my array. 我正在从csv逐行将数据加载到我的数组中。 The data looks somewhat like this: 数据看起来像这样:

aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh

now in the first and third rows even though the rows do not match exactly, my columns of interest are column 4 and column 8 ie if two rows have same data in these columns as shown in the example, these should be treated as duplicate entries and my array should have only the first and second rows and should not have the third row. 现在在第一行和第三行中,即使各行不完全匹配,我感兴趣的列还是第4列和第8列,即,如示例中所示,如果两行在这些列中具有相同的数据,则应将它们视为重复条目,我的数组应该只有第一行和第二行,而不应该有第三行。

result=[]
for file in input_file:
    f=open(file,'r')
    reader = csv.reader(f, quotechar='"')#read csv 
    for row in reader:
        if row:
            #do some operations on the elements of row
                if(row[3] and row[7] not in result):#
                    result.append(row)#load result in array
                else:
                    continue

I expect the result array to be like this 我希望结果数组像这样

aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888

whereas the output is 而输出是

aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh

1: Load your csv using pands 2: take the data only for interested column 3: user pd.drop_duplicates() 1:使用窗格加载csv 2:仅获取感兴趣的列的数据3:用户pd.drop_duplicates()

refer link [ https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/][1] 请参阅链接[ https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/][1]

import pandas as pd
df = pd.read_csv("YOUR_FILE_NAME")
df.drop_duplicates(subset['first_intrested_column','second_intrested_column'],keep
=False, inplace=True)

The data you want to examine for dups is a pair of two values (columns 3 and 7 using zero based numbering). 您要检查重复数据的数据是两个值对(第3列和第7列使用从零开始的编号)。 A set named seen is often used for that purpose. 为此,通常使用名为seen的集合。 The basic idea is: 基本思想是:

seen = set()
for row in reader:
    data = (row[3], row[7])
    if data in set:
        continue
    set.add(data)
    # process row

The problem with your code is the test for duplicates is incorrect. 您的代码的问题是重复测试不正确。
Here's a version I think does it correctly: 这是我认为正确的版本:

import csv
from io import StringIO
from pprint import pprint, pformat

input_file = ['''
aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh
''',]

result=[]
for file in input_file:
#    f=open(file,'r')
    f = StringIO(file)
    reader = csv.reader(f, quotechar='"')  # read csv
    for row in reader:
        if row and not any((row[3] == r[3] and row[7] == r[7]) for r in result):
            result.append(row)  # load result in array

pprint(result)

Output: 输出:

[['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff', 'ggg', 'hhh'],
 ['111', '222', '333', '444', '555', '666', '777', '888']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM