如何搜索一个非常大的 csv 文件？

Question

I have 2 csv files (well, one of them is.tab), both of them with 2 columns of numbers.我有 2 个 csv 文件（好吧，其中一个是 .tab），它们都有 2 列数字。 My job is to go through each row of the first file, and see if it matches any of the rows in the second file.我的工作是 go 通过第一个文件的每一行，看看它是否与第二个文件中的任何行匹配。 If it does, I print a blank line to my output csv file.如果是这样，我会在我的 output csv 文件中打印一个空行。 Otherwise, I print 'R,R' to the output csv file.否则，我将“R,R”打印到 output csv 文件。 My current algorithm does the following:我当前的算法执行以下操作：

Scan each row of the second file (two integers each), go to the position of those two integers in a 2D array (so if the integers are 2 and 3, I'll go to position [2,3]) and assign a value of 1.扫描第二个文件的每一行（每个两个整数），二维数组中这两个整数的 go 到 position（所以如果整数是 2 和 3，我将 go 到 position [2,3]）并分配一个值为 1。
Go through each row of the first file, check if the position of the two integers of each row has a value of 1 in the array, and then print the according output to a third csv file. Go遍历第一个文件的每一行，检查每一行的两个整数中的position是否在数组中的值为1，然后将根据output打印到第三个csv文件中。

Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this.不幸的是 csv 文件非常大，所以我在运行时立即得到“MemoryError:”。 What is an alternative for scanning through large csv files?扫描 csv 大文件的替代方法是什么？

I am using Jupyter Notebook.我正在使用 Jupyter 笔记本。 My code:我的代码：

import csv
import numpy

def SNP():
    thelines = numpy.ndarray((6639,524525))
    tempint = 0
    tempint2 = 0
    with open("SL05_AO_RO.tab") as tsv:
        for line in csv.reader(tsv, dialect="excel-tab"):
            tempint = int(line[0])
            tempint2 = int(line[1])
            thelines[tempint,tempint2] = 1
    return thelines

def common_sites():
    tempint = 0
    tempint2 = 0
    temparray = SNP()
    print('Checkpoint.')
    with open('output_SL05.csv', 'w', newline='') as fp:
        with open("covbreadth_common_sites.csv") as tsv:
            for line in csv.reader(tsv, dialect="excel-tab"):
                tempint = int(line[0])
                tempint2 = int(line[1])
                if temparray[tempint,tempint2] == 1:
                    a = csv.writer(fp, delimiter=',')
                    data = [['','']]
                    a.writerows(data)
                else:
                    a = csv.writer(fp, delimiter=',')
                    data = [['R','R']]
                    a.writerows(data)
    print('Done.')
    return

common_sites()

Files: https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing文件： https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing和https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing

Answer 1

You're dataset really isn't that big, but it is relatively sparse. 您的数据集确实没有那么大，但是相对来说比较稀疏。 You aren't using a sparse structure to store the data which is causing the problem. 您没有使用稀疏结构来存储导致问题的数据。
Just use a set of tuples to store the seen data, and then the lookup on that set is O(1) , eg: 只需使用一set元组来存储可见的数据，然后在该set上的查找就是O(1) ，例如：

In [1]:
  import csv
  with open("SL05_AO_RO.tab") as tsv:
      seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
  with open("covbreadth_common_sites.csv") as tsv:
      common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
  common[:10]
Out[1]:
  [['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
   ['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]

10 loops, best of 3: 150 ms per loop

In [2]:
  len(common), len(seen)
Out[2]:
  (190, 138205)

Answer 2

I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. 我有2个csv文件（很好，其中一个是.tab），两个文件都有2列数字。 My job is to go through each row of the first file, and see if it matches any of the rows in the second file. 我的工作是浏览第一个文件的每一行，并查看它是否与第二个文件中的任何行匹配。 If it does, I print a blank line to my output csv file. 如果是这样，我将空白行打印到我的输出csv文件中。 Otherwise, I print 'R,R' to the output csv file. 否则，我将'R，R'打印到输出的csv文件中。

import numpy as np

f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')

f1.sort(axis=0)
f2.sort(axis=0)

i, j = 0, 0
while i < f1.shape[0]:
    while j < f2.shape[0] and f1[i][0] > f2[j][0]:
        j += 1
    while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
        j += 1
    if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
        print()
    else:
        print('R,R')
    i += 1

Load data to ndarray to optimize memory usage 将数据加载到ndarray以优化内存使用
Sort data 排序数据
Find matches in sorted arrays 查找排序数组中的匹配项

Total complexity is O(n*log(n) + m*log(m)) , where n and m are sizes of input files. 总复杂度为O(n*log(n) + m*log(m)) ，其中n和m是输入文件的大小。

Using of set() will not reduce memory usage per unique entry so I do not recommend to use it with large datasets. 使用set()不会减少每个唯一条目的内存使用量，因此我不建议将其用于大型数据集。

Answer 3

Since CSV is just a DB dump, import it to any SQL DB, then do query on it.由于 CSV 只是一个数据库转储，将其导入任何 SQL 数据库，然后对其进行查询。 This is very efficient way.这是非常有效的方式。

如何搜索一个非常大的 csv 文件？

问题描述

3 个解决方案

解决方案1
2 已采纳 2016-12-22 23:13:34

解决方案2
1 2016-12-23 00:33:01

解决方案3
0 2022-04-08 17:33:40

如何搜索一个非常大的 csv 文件？

问题描述

3 个解决方案

解决方案1 2 已采纳 2016-12-22 23:13:34

解决方案2 1 2016-12-23 00:33:01

解决方案3 0 2022-04-08 17:33:40

解决方案1
2 已采纳 2016-12-22 23:13:34

解决方案2
1 2016-12-23 00:33:01

解决方案3
0 2022-04-08 17:33:40