简体   繁体   English

获取不匹配的行号python

[英]get non-matching line numbers python

Hi I wrote a simple code in python to do the following: 嗨,我用python写了一个简单的代码来执行以下操作:

I have two files summarizing genomic data. 我有两个文件总结基因组数据。 The first file has the names of loci I want to get rid of, it looks something like this 第一个文件具有我要删除的基因座的名称,看起来像这样

File_1: File_1:

R000002
R000003
R000006

The second file has the names and position of all my loci and looks like this: 第二个文件具有我所有基因座的名称和位置,如下所示:

File_2: File_2:

R000001 1
R000001 2
R000001 3
R000002 10
R000002 2
R000002 3
R000003 20
R000003 3
R000004 1
R000004 20
R000004 4
R000005 2
R000005 3
R000006 10
R000006 11
R000006 123

What I wish to do is get all the corresponding line numbers of loci from File2 that are not in File1, so the end result should look like this: 我想做的是从File2获得不在File1中的所有相应位点行号,因此最终结果应如下所示:

Result: 结果:

1
2
3
9
10
11
12
13

I wrote the following simple code and it gets the job done 我编写了以下简单代码,即可完成工作

#!/usr/bin/env python

import sys

File1 = sys.argv[1]
File2 = sys.argv[2]

F1 = open(File1).readlines()
F2 = open(File2).readlines()
F3 = open(File2 + '.np', 'w')
Loci = []

for line in F1:
        Loci.append(line.strip())

for x, y in enumerate(F2):
        y2 = y.strip().split()
        if y2[0] not in Loci:
                F3.write(str(x+1) + '\n')

However when I run this on my real data set where the first file has 58470 lines and the second file has 12881010 lines it seems to take forever. 但是,当我在第一个文件包含58470行而第二个文件包含12881010行的真实数据集上运行此文件时,这似乎需要花费很多时间。 I am guessing that the bottleneck is in the 我猜瓶颈在

if y2[0] not in Loci:

part where the code has to search through the whole of File_2 repeatedly but I have not been able to find a speedier solution. 该部分的代码必须重复搜索整个File_2,但是我找不到更快的解决方案。

Can anybody help me out and show a more pythonic way of doing things. 谁能帮我解决这个问题,并展示出更Python化的做事方式。

Thanks in advance 提前致谢

Here's some slightly more Pythonic code that doesn't care if your files are ordered. 以下是一些Pythonic代码,它们与文件是否有序无关。 I'd prefer to just print everything out and redirect it to a file ./myscript.py > outfile.txt , but you could also pass in another filename and write to that. 我宁愿只打印所有内容并将其重定向到文件./myscript.py > outfile.txt ,但是您也可以传入另一个文件名并写入。

#!/usr/bin/env python
import sys

ignore_f = sys.argv[1]
loci_f = sys.argv[2]

with open(ignore_f) as f:
    ignore = set(x.strip() for x in f)

with open(loci_f) as f:
    for n, line in enumerate(f, start=1):
        if line.split()[0] not in ignore:
            print n

Searching for something in a list is O(n), while it takes only O(1) for a set. 在列表中搜索内容是O(n),而集合只需要O(1)。 If order doesn't matter and you have unique things, use a set over a list. 如果顺序无关紧要,并且您有独特的事物,请在列表上使用一set While this isn't optimal, it should be O(n) instead of O(n × m) like your code. 虽然这不是最佳选择,但应像代码一样为O(n)而不是O(n×m)。

You're also not closing your files, which when reading from isn't that big of a deal, but when writing it is. 您也不会关闭文件,从中读取文件时没什么大不了的,但是在写入文件时就没了。 I use context managers ( with ) so Python does that for me. 我使用了上下文管理器( with ),因此Python为我做到了。

Style-wise, use descriptive variable names . 在样式方面,请使用描述性变量名 and avoid UpperCase names, those are typically used for classes (see PEP-8 ). 并且避免使用UpperCase名称,这些名称通常用于类(请参阅PEP-8 )。

If your files are ordered, you can step through them together, ignoring lines where the loci names are the same, then when they differ, take another step in your ignore file and recheck. 如果文件是有序的,则可以一起浏览它们,而忽略基因座名称相同的行,然后在它们不同的地方,在忽略文件中执行另一步骤并重新检查。

To make the searching for matches more efficient you can simply use a set instead of list : 为了使搜索匹配更加有效,您可以简单地使用set而不是list

Loci = set()

for line in F1:
    Loci.add(line.strip())

The rest should work the same, but faster. 其余的应该相同,但是速度更快。

Even more efficient would be to walk down the files in a sort of lockstep, since they're both sorted, but that will require more code and maybe isn't necessary. 更有效率的是,由于文件和文件都已排序,因此可以按一定的步伐浏览文件,但这将需要更多的代码,也许不是必需的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM