Python：解析数组中大表的第一列的最快方法

Question

So I have got two very big tables that I would like to compare (9 columns and approx 30 million rows). 因此，我有两个非常大的表要比较（9列和约3000万行）。

#!/usr/bin/python
import sys
import csv


def compare(sam1, sam2, output):
    with open(sam1, "r") as s1, open(sam2, "r") as s2, open(output, "w") as out:
    reader1 = csv.reader(s1, delimiter = "\t")
    reader2 = csv.reader(s2, delimiter = "\t")
    writer  = csv.writer(out, delimiter = "\t")
    list = []
    for line in reader1:
        list.append(line[0])
    list = set(list)

    for line in reader2:
        for field in line:
            if field not in list:
                writer.writerow(line)

if __name__ == '__main__':
    compare(sys.argv[1], sys.argv[2], sys.argv[3])

The first column contains the identifier of my rows and I would like to know which ones are only in sam1. 第一列包含我行的标识符，我想知道哪些仅存在于sam1中。

So this is the code I am currently working with, but it takes ages. 因此，这是我目前正在使用的代码，但是需要花费很多时间。 Is there any way to speed it up? 有什么办法可以加快速度吗？

I already tried to speed it up by converting the list to a set, but there was no big difference. 我已经尝试通过将列表转换为集合来加快速度，但是没有太大的区别。

Edit: Now it is running much quicker but now I have to get the whole lines out of my input table and write the lines with exclusive ID to the output file. 编辑：现在它运行起来要快得多，但是现在我必须从输入表中取出整行，并将具有唯一ID的行写到输出文件中。 How could I manage this in a quick way? 我该如何快速处理呢？

Answer 1

A few suggestions: 一些建议：

Rather than creating a list that you then turn into a set, just work with a set directly: 与其创建然后变成集合的列表，不如直接使用集合：
```
 sam1_identifiers = set() for line in reader1: sam1_identifiers.add(line[0]) 
```
This is probably more memory efficient, because you have a single set rather than a list and a set. 这可能会提高内存效率，因为您只有一个集合，而不是列表和集合。 That might make it a bit faster. 这可能会使它更快。
Note also that I've changed the variable name – list is the name of a Python builtin function, so you shouldn't use it for your own variables. 另请注意，我已经改变了变量名称- list是一个Python内建函数的名字，所以你不应该使用它自己的变量。
Since you want to find the identifiers which are only in sam1, rather than the nested if/for statements, just compare and throw away any identifiers found in sam2 that are in the set of IDs in sam1. 由于您要查找仅在sam1中的标识符，而不是嵌套的if / for语句，因此只需比较并丢弃sam1中ID集中在sam2中找到的所有标识符。
```
 sam2_identifiers = set() for line in reader2: sam2_identifiers.add(line[0]) print sam1 - sam2 
```
or even 甚至
```
 sam2_identifiers = set() for line in reader2: sam1_identifiers.discard(line[0]) print sam1_identifiers 
```
I suspect that's faster than the nested loops. 我怀疑这比嵌套循环快。
Perhaps I've missed something, but don't you look through every column for each line of sam2? 也许我错过了一些东西，但是您是否不仔细查看sam2每一行的每一列？ Isn't it sufficient just to look at line[0] for the identifier, as with sam1? 像sam1一样，仅仅查看line[0]就够了吗？

Python：解析数组中大表的第一列的最快方法

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-07-14 10:59:41

Python：解析数组中大表的第一列的最快方法

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-07-14 10:59:41

解决方案1
1 已采纳 2015-07-14 10:59:41