简体   繁体   English

从CSV文件中提取没有重复的列表

[英]Extract a list without duplicates from a CSV file

I have a dataset which looks like this: 我有一个数据集,看起来像这样:

id,created_at,username
1,2006-10-09T18:21:51Z,hey
2,2007-10-09T18:30:28Z,bob
3,2008-10-09T18:40:33Z,bob
4,2009-10-09T18:47:42Z,john
5,2010-10-09T18:51:04Z,brad
...

I contains 1M+ lines. 我包含1M +行。 I'd like to extract the list of username without duplicate from it using python. 我想使用python从中提取用户名列表而不重复。 So far my code looks like this: 到目前为止,我的代码如下所示:

import csv

file1 = file("sample.csv", 'r')
file2 = file("users.csv", 'w')

reader = csv.reader(file1)
writer = csv.writer(file2)

rownum = 0
L = []
for row in reader:
    if not rownum == 0:
        if not row[2] in L:
            L.append(row[2])
            writer.writerow(row[2])

    rownum += 1

I have several questions: 1 - my output in users.csv looks like this: 我有几个问题:1-我在users.csv中的输出如下所示:

h,e,y
b,o,b
j,o,h,n
b,r,a,d

How do I remove the commas between each letter? 如何删除每个字母之间的逗号?

2 - My code is not very elegant, is there any way to import the csv file as a matrix to select the last row and then to use an elegant library like underscore.js in javascript to remove the duplicates? 2-我的代码不是很优雅,有什么方法可以将csv文件作为矩阵导入以选择最后一行,然后在javascript中使用诸如underscore.js之类的优雅库来删除重复项?

Many thanks 非常感谢

You can use a set here, it provides O(1) item lookup compared to O(N) of lists. 您可以在此处使用一个set ,与列表的O(N)相比,它提供O(1)项查找。

seen = set()
add_  = seen.add
next(reader) #skip header
writer.writerows([row[-1]] for row in reader if row[-1] not in seen
                                                        and not add_(row[-1]))

And always use the with statement for handling files, it'll automatically close the file for you: 并且始终使用with语句来处理文件,它将自动为您关闭文件:

with file("sample.csv", 'r') as file1, file("users.csv", 'w') as file2:
    #Do stuff with file1 and file2 here

Change 更改

writer.writerow(row[2])

to

writer.writerow([row[2]])

Also, checking for membership in lists is computationally expensive [O(n)]. 同样,检查列表中的成员资格在计算上也很昂贵[O(n)]。 If you will be checking for membership in a large collection of items, and doing it often, use a set [O(1)]: 如果要检查大量项目中的成员资格并经常这样做,请使用set [O(1)]:

L = set()
reader.next() # Skip the header
for row in reader:
    if row[2] not in L:
        L.add(row[2])
        writer.writerow([row[2]])

Alternatively 另外

If you're okay with using a few megabytes of memory, just do this: 如果您可以使用几兆内存,可以这样做:

with open("sample.csv", "rb") as infile:
    reader = csv.reader(infile)
    reader.next()
    no_duplicates = set(tuple(row) for row in reader)

    with open("users.csv", "wb") as outfile:
        csv.writer(outfile).writerows(no_duplicates)

if order is important, use an OrderedDict instead of a set: 如果顺序很重要,请使用OrderedDict而不是集合:

from collections import OrderedDict
with open("sample.csv", "rb") as infile:
    reader = csv.reader(infile)
    reader.next()
    no_duplicates = OrderedDict.fromkeys(tuple(row) for row in reader)

    with open("users.csv", "wb") as outfile:
        csv.writer(outfile).writerows(no_duplicates.keys())

Easy and short! 简单而简短!

for line in reader:
    string = str(line)
    split = string.split("," , 2)
    username = split[2][2:-2]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM