简体   繁体   English

Python:解析数组中大表的第一列的最快方法

[英]Python: Fastest way of parsing first column of large table in array

So I have got two very big tables that I would like to compare (9 columns and approx 30 million rows). 因此,我有两个非常大的表要比较(9列和约3000万行)。

#!/usr/bin/python
import sys
import csv


def compare(sam1, sam2, output):
    with open(sam1, "r") as s1, open(sam2, "r") as s2, open(output, "w") as out:
    reader1 = csv.reader(s1, delimiter = "\t")
    reader2 = csv.reader(s2, delimiter = "\t")
    writer  = csv.writer(out, delimiter = "\t")
    list = []
    for line in reader1:
        list.append(line[0])
    list = set(list)

    for line in reader2:
        for field in line:
            if field not in list:
                writer.writerow(line)

if __name__ == '__main__':
    compare(sys.argv[1], sys.argv[2], sys.argv[3])

The first column contains the identifier of my rows and I would like to know which ones are only in sam1. 第一列包含我行的标识符,我想知道哪些仅存在于sam1中。

So this is the code I am currently working with, but it takes ages. 因此,这是我目前正在使用的代码,但是需要花费很多时间。 Is there any way to speed it up? 有什么办法可以加快速度吗?

I already tried to speed it up by converting the list to a set, but there was no big difference. 我已经尝试通过将列表转换为集合来加快速度,但是没有太大的区别。

Edit: Now it is running much quicker but now I have to get the whole lines out of my input table and write the lines with exclusive ID to the output file. 编辑:现在它运行起来要快得多,但是现在我必须从输入表中取出整行,并将具有唯一ID的行写到输出文件中。 How could I manage this in a quick way? 我该如何快速处理呢?

A few suggestions: 一些建议:

  • Rather than creating a list that you then turn into a set, just work with a set directly: 与其创建然后变成集合的列表,不如直接使用集合:

     sam1_identifiers = set() for line in reader1: sam1_identifiers.add(line[0]) 

    This is probably more memory efficient, because you have a single set rather than a list and a set. 这可能会提高内存效率,因为您只有一个集合,而不是列表和集合。 That might make it a bit faster. 这可能会使它更快。

    Note also that I've changed the variable name – list is the name of a Python builtin function, so you shouldn't use it for your own variables. 另请注意,我已经改变了变量名称- list是一个Python内建函数的名字,所以你不应该使用它自己的变量。

  • Since you want to find the identifiers which are only in sam1, rather than the nested if/for statements, just compare and throw away any identifiers found in sam2 that are in the set of IDs in sam1. 由于您要查找仅在sam1中的标识符,而不是嵌套的if / for语句,因此只需比较并丢弃sam1中ID集中在sam2中找到的所有标识符。

     sam2_identifiers = set() for line in reader2: sam2_identifiers.add(line[0]) print sam1 - sam2 

    or even 甚至

     sam2_identifiers = set() for line in reader2: sam1_identifiers.discard(line[0]) print sam1_identifiers 

    I suspect that's faster than the nested loops. 我怀疑这比嵌套循环快。

  • Perhaps I've missed something, but don't you look through every column for each line of sam2? 也许我错过了一些东西,但是您是否不仔细查看sam2每一行的每一列? Isn't it sufficient just to look at line[0] for the identifier, as with sam1? 像sam1一样,仅仅查看line[0]就够了吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM