[英]Find and remove duplicates in a CSV file
我有一個包含三列的大型CSV文件(1.8 GB)。 每行包含兩個字符串和一個數值。 問題是它們重復但已交換。 例:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
所需的輸出如下所示:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
因為第三行包含與第一行相同的信息。
編輯
數據基本上看起來像這樣(前兩列為字符串,第三行為4000萬行,為數值):
你能處理awk嗎?
$ awk -F, '++seen[$3]==1' file
輸出:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
解釋:
$ awk -F, ' # set comma as field delimiter
++seen[$3]==1 # count instances of the third field to hash, printing only first
' file
更新 :
$ awk -F, '++seen[($1<$2?$1 FS $2:$2 FS $1)]==1' file
輸出:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
它對第一個字段和第二個字段的每個滿足的組合進行哈希處理,以使"ABC,DEF"=="DEF,ABC"
並計數它們僅打印第一個字段。 ($1<$2?$1 FS $2:$2 FS $1)
: 如果第一個字段小於第二個字段,則哈希1st,2nd
否則哈希2nd,1st
。
從問題描述來看,當連接在一起時,按任意順序排列的第一和第二字段應該唯一時,必須省略一行的要求。 如果是這樣, awk
以下會有所幫助
awk -F, '{seen[$1,$2]++;seen[$2,$1]++}seen[$1,$2]==1 && seen[$2,$1]==1' filename
樣本輸入
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
GHI,ABC,123
DEF,ABC,123
ABC,GHI,123
DEF,GHI,123
樣本輸出
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
GHI,ABC,123
DEF,GHI,123
如果您想使用csv庫本身:-
您可以使用DictReader和DictWriter 。
Import csv
def main():
"""Read csv file, delete duplicates and write it."""
with open('test.csv', 'r',newline='') as inputfile:
with open('testout.csv', 'w', newline='') as outputfile:
duplicatereader = csv.DictReader(inputfile, delimiter=',')
uniquewrite = csv.DictWriter(outputfile, fieldnames=['address', 'floor', 'date', 'price'], delimiter=',')
uniquewrite.writeheader()
keysread = []
for row in duplicatereader:
key = (row['date'], row['price'])
if key not in keysread:
print(row)
keysread.append(key)
uniquewrite.writerow(row)
if __name__ == '__main__':
main()
注意:這個問題是在OP將awk標記的python標記更改之前完成的。
如果您不介意元素的順序,則可以執行以下操作:
with open("in.csv", "r") as file:
lines = set()
for line in file:
lines.add(frozenset(line.strip("\n").split(",")))
with open("out.csv", "w") as file:
for line in lines:
file.write(",".join(line)+"\n")
輸出:
Col2,COL1,Col3
EFG,454,ABC
DEF,123,ABC
請注意,您可能希望以特殊方式對待第一行(標題),以免失去其順序。
但是,如果順序很重要,則可以使用維護凍結集中的元素順序中的代碼:
from itertools import filterfalse
def unique_everseen(iterable, key=None):
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
with open("in.csv", "r") as file:
lines = []
for line in file:
lines.append(line.strip("\n").split(","))
with open("out.csv", "w") as file:
for line in unique_everseen(lines, key=frozenset):
file.write(",".join(line)+"\n")
輸出:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
OP表示這兩個代碼似乎不適用於大文件(1.8 Gb)。 我認為這可能是由於兩個代碼都使用RAM將文件存儲在列表中,而1.8 GB的文件可能會占用內存上的所有可用空間。
為了解決這個問題,我做了更多嘗試。 可悲的是,我必須說,與第一次嘗試相比,它們都非常慢。 第一個代碼犧牲RAM消耗來提高速度,但是下面的代碼犧牲速度,CPU和硬盤驅動器以減少RAM消耗(而不是消耗RAM中的整個文件大小,它們占用的內存少於50 Mb)。
由於所有這些示例都需要更高的硬盤驅動器使用率,因此建議將“ input”和“ output”文件放在不同的硬盤驅動器上。
我第一次嘗試使用較少的RAM的是shelve
模塊:
import shelve, os
with shelve.open("tmp") as db:
with open("in.csv", "r") as file:
for line in file:
l = line.strip("\n").split(",")
l.sort()
db[",".join(l)] = l
with open("out.csv", "w") as file:
for v in db.values():
file.write(",".join(v)+"\n")
os.remove("temp.bak")
os.remove("temp.dat")
os.remove("temp.dir")
可悲的是,此代碼比使用RAM的前兩個代碼要花上一百倍的時間。
另一個嘗試是:
with open("in.csv", "r") as fileRead:
# total = sum(1 for _ in fileRead)
# fileRead.seek(0)
# i = 0
with open("out.csv", "w") as _:
pass
with open("out.csv", "r+") as fileWrite:
for lineRead in fileRead:
# i += 1
line = lineRead.strip("\n").split(",")
lineSet = set(line)
write = True
fileWrite.seek(0)
for lineWrite in fileWrite:
if lineSet == set(lineWrite.strip("\n").split(",")):
write = False
if write:
pass
fileWrite.write(",".join(line)+"\n")
# if i / total * 100 % 1 == 0: print(f"{i / total * 100}% ({i} / {total})")
這稍微快一點,但是不多。
如果您的計算機具有多個核心,則可以嘗試使用多處理 :
from multiprocessing import Process, Queue, cpu_count
from os import remove
def slave(number, qIn, qOut):
name = f"slave-{number}.csv"
with open(name, "w") as file:
pass
with open(name, "r+") as file:
while True:
if not qIn.empty():
get = qIn.get()
if get == False:
qOut.put(name)
break
else:
write = True
file.seek(0)
for line in file:
if set(line.strip("\n").split(",")) == get[1]:
write = False
break
if write:
file.write(get[0])
def master():
qIn = Queue(1)
qOut = Queue()
slaves = cpu_count()
slavesList = []
for n in range(slaves):
slavesList.append(Process(target=slave, daemon=True, args=(n, qIn, qOut)))
for s in slavesList:
s.start()
with open("in.csv", "r") as file:
for line in file:
lineSet = set(line.strip("\n").split(","))
qIn.put((line, lineSet))
for _ in range(slaves):
qIn.put(False)
for s in slavesList:
s.join()
slavesList = []
with open(qOut.get(), "r+") as fileMaster:
for x in range(slaves-1):
file = qOut.get()
with open(file, "r") as fileSlave:
for lineSlave in fileSlave:
lineSet = set(lineSlave.strip("\n").split(","))
write = True
fileMaster.seek(0)
for lineMaster in fileMaster:
if set(lineMaster.strip("\n").split(",")) == lineSet:
write = False
break
if write:
fileMaster.write(lineSlave)
slavesList.append(Process(target=remove, daemon=True, args=(file,)))
slavesList[-1].start()
for s in slavesList:
s.join()
如您所見,我的任務很令人失望,告訴您我的兩次嘗試都非常緩慢。 我希望您能找到一種更好的方法,否則,將需要花費數小時甚至數天的時間來處理1,8 GB的數據(實時時間主要取決於重復值的數量,從而減少了時間)。
一種新的嘗試:該嘗試不是將每個部分存儲在文件中,而是將活動部分存儲在內存中,然后在文件上寫下來以便更快地處理塊。 然后,必須使用上述方法之一再次讀取塊:
lines = set()
maxLines = 1000 # This is the amount of lines that will be stored at the same time on RAM. Higher numbers are faster but requeires more RAM on the computer
perfect = True
with open("in.csv", "r") as fileRead:
total = sum(1 for _ in fileRead)
fileRead.seek(0)
i = 0
with open("tmp.csv", "w") as fileWrite:
for line in fileRead:
if (len(lines) < maxLines):
lines.add(frozenset(line.strip("\n").split(",")))
i += 1
if i / total * 100 % 1 == 0: print(f"Reading {i / total * 100}% ({i} / {total})")
else:
perfect = False
j = 0
for line in lines:
j += 1
fileWrite.write(",".join(line) + "\n")
if i / total * 100 % 1 == 0: print(f"Storing {i / total * 100}% ({i} / {total})")
lines = set()
if (not perfect):
use_one_of_the_above_methods() # Remember to read the tmp.csv and not the in.csv
這可能會提高速度。 您可以根據自己的喜好更改maxLines
,請記住更高的數字,更快的速度(不確定真正的大數字是否相反),但更高的RAM消耗。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.