简体   繁体   English

Python从CSV提取唯一值

[英]Python extract unique values from CSV

I am using the following python script to remove duplicates from a CSV file 我正在使用以下python脚本从CSV文件中删除重复项

with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

I am trying to modify it so that instead of outputting the list without duplicates to final.csv it outputs the unique values that were found. 我正在尝试对其进行修改,以便与其将没有重复的列表输出到final.csv,而是输出找到的唯一值。

Kind of the opposite to what it does now. 与现在的做法有点相反。 Anyone got an example? 有人举个例子吗?

Using a dict to keep track of how many times each line occurs, then you can process the dict and add only the unique items to the seen set, and write those to the final.csv : 使用dict跟踪每行出现的次数,然后可以处理dict并将仅唯一项添加到seen集合中,然后将其写入final.csv

from collections import defaultdict
uniques = defaultdict(int)
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        uniques[line] +=1
    for k, v in uniques.iteritems():
        if v = 1:
            seen.add(k)
            out_file.write(k)

Or: 要么:

from collections import defaultdict
uniques = defaultdict(int)
with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        uniques[line] +=1

    seen = set(k for k in uniques if uniques[k] == 1)
    for itm in seen:
        out_file.write(itm)

Or, using Counter : 或者,使用Counter

from collections import Counter

with open('test.csv','r') as in_file, open('final.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    lines = Counter(file.readlines())
    seen = set(k for k in lines if lines[k] == 1)
    for itm in seen:
        out_file.write(itm)

This will output only the lines which appear once, depending on what you mean by "uniques", this may or may not be correct. 这将输出出现一次的行,具体取决于您所说的“唯一性”是什么,这可能是正确的,也可能是不正确的。 If, instead, you want to output ALL lines but only one instance per line, using the last method: 相反,如果要使用最后一种方法输出所有行,但每行仅输出一个实例:

with open('test.csv','r') as in_file, open('final.csv','w') as out_file:

    lines = Counter(file.readlines())

    for itm in lines:
        out_file.write(itm)

您可以将重复变量收集到另一个变量中,并使用它们从集合中删除非唯一值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM