[英]Extract a list without duplicates from a CSV file
我有一個數據集,看起來像這樣:
id,created_at,username
1,2006-10-09T18:21:51Z,hey
2,2007-10-09T18:30:28Z,bob
3,2008-10-09T18:40:33Z,bob
4,2009-10-09T18:47:42Z,john
5,2010-10-09T18:51:04Z,brad
...
我包含1M +行。 我想使用python從中提取用戶名列表而不重復。 到目前為止,我的代碼如下所示:
import csv
file1 = file("sample.csv", 'r')
file2 = file("users.csv", 'w')
reader = csv.reader(file1)
writer = csv.writer(file2)
rownum = 0
L = []
for row in reader:
if not rownum == 0:
if not row[2] in L:
L.append(row[2])
writer.writerow(row[2])
rownum += 1
我有幾個問題:1-我在users.csv中的輸出如下所示:
h,e,y
b,o,b
j,o,h,n
b,r,a,d
如何刪除每個字母之間的逗號?
2-我的代碼不是很優雅,有什么方法可以將csv文件作為矩陣導入以選擇最后一行,然后在javascript中使用諸如underscore.js之類的優雅庫來刪除重復項?
非常感謝
您可以在此處使用一個set
,與列表的O(N)
相比,它提供O(1)
項查找。
seen = set()
add_ = seen.add
next(reader) #skip header
writer.writerows([row[-1]] for row in reader if row[-1] not in seen
and not add_(row[-1]))
並且始終使用with
語句來處理文件,它將自動為您關閉文件:
with file("sample.csv", 'r') as file1, file("users.csv", 'w') as file2:
#Do stuff with file1 and file2 here
更改
writer.writerow(row[2])
至
writer.writerow([row[2]])
同樣,檢查列表中的成員資格在計算上也很昂貴[O(n)]。 如果要檢查大量項目中的成員資格並經常這樣做,請使用set
[O(1)]:
L = set()
reader.next() # Skip the header
for row in reader:
if row[2] not in L:
L.add(row[2])
writer.writerow([row[2]])
如果您可以使用幾兆內存,可以這樣做:
with open("sample.csv", "rb") as infile:
reader = csv.reader(infile)
reader.next()
no_duplicates = set(tuple(row) for row in reader)
with open("users.csv", "wb") as outfile:
csv.writer(outfile).writerows(no_duplicates)
如果順序很重要,請使用OrderedDict
而不是集合:
from collections import OrderedDict
with open("sample.csv", "rb") as infile:
reader = csv.reader(infile)
reader.next()
no_duplicates = OrderedDict.fromkeys(tuple(row) for row in reader)
with open("users.csv", "wb") as outfile:
csv.writer(outfile).writerows(no_duplicates.keys())
簡單而簡短!
for line in reader:
string = str(line)
split = string.split("," , 2)
username = split[2][2:-2]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.