简体   繁体   English

读取csv文件,并仅在另一个csv文件中添加新条目

[英]Read the csv file and add only new entries in another csv file

I have a csv file and I have duplicate as well as unique data getting add to it on a daily basis. 我有一个csv文件,并且每天都有重复的数据和独特的数据添加到其中。 This involves too many duplicates. 这涉及太多重复项。 I have to remove the duplicates based on specific columns. 我必须删除基于特定列的重复项。 For eg: 例如:

csvfile1: csvfile1:

title1 title2 title3 title4 title5
abcdef 12     13     14     15
jklmn  12     13     56     76
abcdef 12     13     98     89
bvnjkl 56     76     86     96

Now, based on title1, title2 and title3 I have to remove duplicates and add the unique entries in a new csv file. 现在,基于title1,title2和title3,我必须删除重复项,并将唯一项添加到新的csv文件中。 As you can see abcdef row is not unique and repeats based on title1,title2 and title3 so it should be removedand the output should look like: 如您所见,abcdef行不是唯一的,并且基于title1,title2和title3重复,因此应将其删除,并且输出应类似于:

Expected Output CSV File: 预期输出CSV文件:

title1 title2 title3 title4 title5
jklmn  12     13     56     76
bvnjkl 56     76     86     96

My tried code is here below:CSVINPUT file import csv 我尝试过的代码如下:CSVINPUT文件导入csv

f = open("1.csv", 'a+')

writer = csv.writer(f)

writer.writerow(("t1", "t2", "t3"))

a =[["a", 'b', 'c'], ["g", "h", "i"],['a','b','c']] #This list is changed daily so new and duplicates data get added daily


for i in range(2):
    writer.writerow((a[i]))

f.close()

Duplicate removal script: 复制删除脚本:

import csv




with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line not in seen: continue # skip duplicate


        out_file.write(line)

My Output: 2.csv: 我的输出:2.csv:

t1 t2 t3
a  b  c
g  h  i

Now, I do not want abc in the 2.csv based on t1 and t2 only the unique ghi based on t1 and t2 现在,我不希望基于t1和t2的2.csv中的abc仅基于t1和t2的唯一ghi

Some issues in your code - 您的代码中的一些问题-

  1. In the python file to create the csv file, you are only iterating till - range(2) , range is not inclusive, so it only writes the first two columns into the csv, not the third one, you can directly iterate over the csv, rather than iterating over each element. 在用于创建csv文件的python文件中,您仅迭代-range range(2)range不包含在内,因此它仅将前两列写入csv中,而不是将第三列写入csv中,您可以直接在csv上进行迭代,而不是遍历每个元素。 Also, you do not need that many brackets in writer.writerow() , Example - 另外,在writer.writerow()不需要太多括号,例如-

     for i in a: writer.writerow(i) 
  2. In your duplicate removal script, you are actually never adding anything into seen() , so you would never end up removing anything. 在重复的删除脚本中,您实际上从未将任何内容添加到seen() ,因此您永远都不会删除任何内容。 When you want to remove duplicates based on a subset of elements of a list, you can just add those elements (in a specific order) to the seen set (as a tuple) , not a list, since set() only accepts hashable elements. 当您要基于列表元素的子集删除重复项时,只需将这些元素(以特定顺序)添加到seen集合(作为元组),而不是列表中,因为set()仅接受可散列的元素。 And then when checking for containment in set, check only that subset that you had added. 然后,在检查集合中的包含时,仅检查您已添加的那个子集。 Example - 范例-

     import csv with open('1.csv','r') as in_file, open('2.csv','w') as out_file: seen = set() seentwice = set() reader = csv.reader(in_file) writer = csv.writer(out_file) rows = [] for row in reader: if (row[0],row[1]) in seen: seentwice.add((row[0],row[1])) seen.add((row[0],row[1])) rows.append(row) for row in rows: if (row[0],row[1]) not in seentwice: writer.writerow(row) 

This would complete remove any rows which is duplicated based on first and second column . 这将完成删除根据第一列和第二列重复的所有行。 It would not even store a single row for such rows, I am guessing that is what you want. 我什至不会为此类行存储一行,我想这就是您想要的。

seen - set - This is used to store rows that we have already seen. seen -set-用于存储我们已经看到的行。

seentwice - set - This set is only populated with a row, if we encounter a row that was already previously added to seen , so that means that that row is duplicated. seentwice -设置-这组仅填充了一排,如果我们遇到一个已经先前添加到行seen ,这样就意味着该行被复制。

Now in the end, we only want to write rows that are not inside seentwice , since any row in seentwice means that it is duplicated , that row has atleast two different rows with similar values at row[0] and row[1] . 现在,到了最后,我们只希望写rows不在里面seentwice ,因为任何rowseentwice意味着它被复制,该行有类似的值在ATLEAST两个不同行的row[0]row[1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM