[英]How do I search through a very large csv file?
I have 2 csv files (well, one of them is.tab), both of them with 2 columns of numbers.我有 2 个 csv 文件(好吧,其中一个是 .tab),它们都有 2 列数字。 My job is to go through each row of the first file, and see if it matches any of the rows in the second file.我的工作是 go 通过第一个文件的每一行,看看它是否与第二个文件中的任何行匹配。 If it does, I print a blank line to my output csv file.如果是这样,我会在我的 output csv 文件中打印一个空行。 Otherwise, I print 'R,R' to the output csv file.否则,我将“R,R”打印到 output csv 文件。 My current algorithm does the following:我当前的算法执行以下操作:
Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this.不幸的是 csv 文件非常大,所以我在运行时立即得到“MemoryError:”。 What is an alternative for scanning through large csv files?扫描 csv 大文件的替代方法是什么?
I am using Jupyter Notebook.我正在使用 Jupyter 笔记本。 My code:我的代码:
import csv
import numpy
def SNP():
thelines = numpy.ndarray((6639,524525))
tempint = 0
tempint2 = 0
with open("SL05_AO_RO.tab") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
thelines[tempint,tempint2] = 1
return thelines
def common_sites():
tempint = 0
tempint2 = 0
temparray = SNP()
print('Checkpoint.')
with open('output_SL05.csv', 'w', newline='') as fp:
with open("covbreadth_common_sites.csv") as tsv:
for line in csv.reader(tsv, dialect="excel-tab"):
tempint = int(line[0])
tempint2 = int(line[1])
if temparray[tempint,tempint2] == 1:
a = csv.writer(fp, delimiter=',')
data = [['','']]
a.writerows(data)
else:
a = csv.writer(fp, delimiter=',')
data = [['R','R']]
a.writerows(data)
print('Done.')
return
common_sites()
Files: https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing文件: https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing和https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing
You're dataset really isn't that big, but it is relatively sparse. 您的数据集确实没有那么大,但是相对来说比较稀疏。 You aren't using a sparse structure to store the data which is causing the problem. 您没有使用稀疏结构来存储导致问题的数据。
Just use a set
of tuples to store the seen data, and then the lookup on that set
is O(1)
, eg: 只需使用一set
元组来存储可见的数据,然后在该set
上的查找就是O(1)
,例如:
In [1]:
import csv
with open("SL05_AO_RO.tab") as tsv:
seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
with open("covbreadth_common_sites.csv") as tsv:
common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
common[:10]
Out[1]:
[['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]
10 loops, best of 3: 150 ms per loop
In [2]:
len(common), len(seen)
Out[2]:
(190, 138205)
I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. 我有2个csv文件(很好,其中一个是.tab),两个文件都有2列数字。 My job is to go through each row of the first file, and see if it matches any of the rows in the second file. 我的工作是浏览第一个文件的每一行,并查看它是否与第二个文件中的任何行匹配。 If it does, I print a blank line to my output csv file. 如果是这样,我将空白行打印到我的输出csv文件中。 Otherwise, I print 'R,R' to the output csv file. 否则,我将'R,R'打印到输出的csv文件中。
import numpy as np
f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')
f1.sort(axis=0)
f2.sort(axis=0)
i, j = 0, 0
while i < f1.shape[0]:
while j < f2.shape[0] and f1[i][0] > f2[j][0]:
j += 1
while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
j += 1
if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
print()
else:
print('R,R')
i += 1
ndarray
to optimize memory usage 将数据加载到ndarray
以优化内存使用 Total complexity is O(n*log(n) + m*log(m))
, where n and m are sizes of input files. 总复杂度为O(n*log(n) + m*log(m))
,其中n和m是输入文件的大小。
Using of set()
will not reduce memory usage per unique entry so I do not recommend to use it with large datasets. 使用set()
不会减少每个唯一条目的内存使用量,因此我不建议将其用于大型数据集。
Since CSV is just a DB dump, import it to any SQL DB, then do query on it.由于 CSV 只是一个数据库转储,将其导入任何 SQL 数据库,然后对其进行查询。 This is very efficient way.这是非常有效的方式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.