简体   繁体   English

如何搜索一个非常大的 csv 文件?

[英]How do I search through a very large csv file?

I have 2 csv files (well, one of them is.tab), both of them with 2 columns of numbers.我有 2 个 csv 文件(好吧,其中一个是 .tab),它们都有 2 列数字。 My job is to go through each row of the first file, and see if it matches any of the rows in the second file.我的工作是 go 通过第一个文件的每一行,看看它是否与第二个文件中的任何行匹配。 If it does, I print a blank line to my output csv file.如果是这样,我会在我的 output csv 文件中打印一个空行。 Otherwise, I print 'R,R' to the output csv file.否则,我将“R,R”打印到 output csv 文件。 My current algorithm does the following:我当前的算法执行以下操作:

  1. Scan each row of the second file (two integers each), go to the position of those two integers in a 2D array (so if the integers are 2 and 3, I'll go to position [2,3]) and assign a value of 1.扫描第二个文件的每一行(每个两个整数),二维数组中这两个整数的 go 到 position(所以如果整数是 2 和 3,我将 go 到 position [2,3])并分配一个值为 1。
  2. Go through each row of the first file, check if the position of the two integers of each row has a value of 1 in the array, and then print the according output to a third csv file. Go遍历第一个文件的每一行,检查每一行的两个整数中的position是否在数组中的值为1,然后将根据output打印到第三个csv文件中。

Unfortunately the csv files are very large, so I instantly get "MemoryError:" when running this.不幸的是 csv 文件非常大,所以我在运行时立即得到“MemoryError:”。 What is an alternative for scanning through large csv files?扫描 csv 大文件的替代方法是什么?

I am using Jupyter Notebook.我正在使用 Jupyter 笔记本。 My code:我的代码:

import csv
import numpy

def SNP():
    thelines = numpy.ndarray((6639,524525))
    tempint = 0
    tempint2 = 0
    with open("SL05_AO_RO.tab") as tsv:
        for line in csv.reader(tsv, dialect="excel-tab"):
            tempint = int(line[0])
            tempint2 = int(line[1])
            thelines[tempint,tempint2] = 1
    return thelines

def common_sites():
    tempint = 0
    tempint2 = 0
    temparray = SNP()
    print('Checkpoint.')
    with open('output_SL05.csv', 'w', newline='') as fp:
        with open("covbreadth_common_sites.csv") as tsv:
            for line in csv.reader(tsv, dialect="excel-tab"):
                tempint = int(line[0])
                tempint2 = int(line[1])
                if temparray[tempint,tempint2] == 1:
                    a = csv.writer(fp, delimiter=',')
                    data = [['','']]
                    a.writerows(data)
                else:
                    a = csv.writer(fp, delimiter=',')
                    data = [['R','R']]
                    a.writerows(data)
    print('Done.')
    return

common_sites()

Files: https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharing and https://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing文件: https://drive.google.com/file/d/0B5v-nJeoVouHUjlJelZtV01KWFU/view?usp=sharinghttps://drive.google.com/file/d/0B5v-nJeoVouHSDI4a2hQWEh3S3c/view?usp=sharing

You're dataset really isn't that big, but it is relatively sparse. 您的数据集确实没有那么大,但是相对来说比较稀疏。 You aren't using a sparse structure to store the data which is causing the problem. 您没有使用稀疏结构来存储导致问题的数据。
Just use a set of tuples to store the seen data, and then the lookup on that set is O(1) , eg: 只需使用一set元组来存储可见的数据,然后在该set上的查找就是O(1) ,例如:

In [1]:
  import csv
  with open("SL05_AO_RO.tab") as tsv:
      seen = set(map(tuple, csv.reader(tsv, dialect="excel-tab")))
  with open("covbreadth_common_sites.csv") as tsv:
      common = [line for line in csv.reader(tsv, dialect="excel-tab") if tuple(line) in seen]
  common[:10]
Out[1]:
  [['1049', '7280'], ['1073', '39198'], ['1073', '39218'], ['1073', '39224'], ['1073', '39233'],
   ['1098', '661'], ['1098', '841'], ['1103', '15100'], ['1103', '15107'], ['1103', '28210']]

10 loops, best of 3: 150 ms per loop

In [2]:
  len(common), len(seen)
Out[2]:
  (190, 138205)

I have 2 csv files (well, one of them is .tab), both of them with 2 columns of numbers. 我有2个csv文件(很好,其中一个是.tab),两个文件都有2列数字。 My job is to go through each row of the first file, and see if it matches any of the rows in the second file. 我的工作是浏览第一个文件的每一行,并查看它是否与第二个文件中的任何行匹配。 If it does, I print a blank line to my output csv file. 如果是这样,我将空白行打印到我的输出csv文件中。 Otherwise, I print 'R,R' to the output csv file. 否则,我将'R,R'打印到输出的csv文件中。

import numpy as np

f1 = np.loadtxt('SL05_AO_RO.tab')
f2 = np.loadtxt('covbreadth_common_sites.csv')

f1.sort(axis=0)
f2.sort(axis=0)

i, j = 0, 0
while i < f1.shape[0]:
    while j < f2.shape[0] and f1[i][0] > f2[j][0]:
        j += 1
    while j < f2.shape[0] and f1[i][0] == f2[j][0] and f1[i][1] > f2[j][1]:
        j += 1
    if j < f2.shape[0] and np.array_equal(f1[i], f2[j]):
        print()
    else:
        print('R,R')
    i += 1
  1. Load data to ndarray to optimize memory usage 将数据加载到ndarray以优化内存使用
  2. Sort data 排序数据
  3. Find matches in sorted arrays 查找排序数组中的匹配项

Total complexity is O(n*log(n) + m*log(m)) , where n and m are sizes of input files. 总复杂度为O(n*log(n) + m*log(m)) ,其中n和m是输入文件的大小。

Using of set() will not reduce memory usage per unique entry so I do not recommend to use it with large datasets. 使用set()不会减少每个唯一条目的内存使用量,因此我不建议将其用于大型数据集。

Since CSV is just a DB dump, import it to any SQL DB, then do query on it.由于 CSV 只是一个数据库转储,将其导入任何 SQL 数据库,然后对其进行查询。 This is very efficient way.这是非常有效的方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Python 中搜索字符串是否位于非常大的文件中 - How to search if a string is in a very large file in Python 如何通过大型CSV文件更有效地进行迭代? - How can I iterate more efficiently through a large CSV file? 如何在Python中快速搜索.csv文件 - How do quickly search through a .csv file in Python 当我只需要文件中的几个数据点时,如何对非常大的 csv excel 文件中的排名值进行排名? - How do I rank rank values from a very large csv excel file when I only need a few data points from the file? 对于非常大的 csv 文件的基本数学计算,当我的 csv 中混合了数据类型时,如何更快地执行此操作 - 使用 python - For basic maths calculations on very large csv files how can I do this faster when I have mixed datatypes in my csv - with python 如何读取带有 pandas 的大型 csv 文件? - How do I read a large csv file with pandas? 如何在 Google Colab 中读取大型 csv 文件? - How do I read a large csv file in Google Colab? 如何遍历由分号分隔的非常大的文本文件? - How to iterate through very large text file separated by semicolons? 如何通过csv文件搜索单词? - How to search for a word through a csv file? 如何创建在CSV文件中搜索的功能? - How do I create a function to search within a CSV file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM