将两个文件与长列表进行比较，以获得共同的元素和相邻的信息

Question

I have two large files. 我有两个大文件。 File A looks like: 文件A看起来像：

SNP_A-1780270 rs987435 7 78599583 - C G
SNP_A-1780271 rs345783 15 33395779 - C G
SNP_A-1780272 rs955894 1 189807684 - G T
SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T

... and has 950,000 lines. ......并且有950,000行。

File B looks like: 文件B看起来像：

SNP_A-1780274
SNP_A-1780277
SNP_A-1780278
SNP_A-1780283
SNP_A-1780285
SNP_A-1780286
SNP_A-1780287

... and has 900,000 lines. ......并且有900,000行。

I need to find the common elements of file B in file A from column 1 and get an output file like: 我需要从第1列中找到文件A中文件B的公共元素，并获得如下输出文件：

SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T

How can I do it in the most efficient way in Python? 我怎样才能在Python中以最有效的方式完成它？

Answer 1

I think a dict is ideal: 我认为dict是理想的：

>>> sa = """SNP_A-1780270 rs987435 7 78599583 - C G
SNP_A-1780271 rs345783 15 33395779 - C G
SNP_A-1780272 rs955894 1 189807684 - G T
SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T"""
>>> dict_lines = {}
>>> for line in sa.split('\n'):
    dict_lines[line.split()[0]] = line


>>> sb = """SNP_A-1780274
SNP_A-1780277
SNP_A-1780278
SNP_A-1780283
SNP_A-1780285
SNP_A-1780286
SNP_A-1780287"""
>>> for val in sb.split('\n'):
    line = dict_lines.get(val, None)
    if line:
        print line


SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T

Answer 2

If File A's lines are long compared to the "key" column 1, you could try this approach: 如果文件A的行与“键”列1相比很长，您可以尝试这种方法：

positions = {}
with open('fileA.txt') as fA:
    pos = 0
    for lineA in fA:
        uid = lineA.split(' ')[0] #gets SNP_A-1780270
        positions[uid] = pos
        pos += len(lineA)
with open('fileB.txt') as fB, open('fileA.txt') as fA, open('fileC.txt', 'w') as out:
    for lineB in fB:
        pos = positions[lineB.strip()]
        fA.seek(pos)
        lineA = fA.readline()
        out.write('%s\n', lineA)

You should check if the pos += ... is more reliable or file.tell() . 您应该检查pos += ...是否更可靠或file.tell() 。 I think, as bufferin is involved. 我认为，因为涉及缓冲液。 file.tell() doesn't work, but it might be that the pos += ... needs readjustment as well. file.tell()不起作用，但可能是pos += ...需要重新调整。

This needs less memory as the dict version, but is probably slower due to the treatment of file A. 这需要较少的内存作为dict版本，但由于文件A的处理可能会更慢。

Answer 3

如果您可以从Python代码调用join filea fileb > filec ，它将为您提供所需的内容。

将两个文件与长列表进行比较，以获得共同的元素和相邻的信息

问题描述

3 个解决方案

解决方案1
2 2012-12-12 09:28:04

解决方案2
0 已采纳 2012-12-12 09:36:27

解决方案3
0 2012-12-13 04:05:43

将两个文件与长列表进行比较，以获得共同的元素和相邻的信息

问题描述

3 个解决方案

解决方案1 2 2012-12-12 09:28:04

解决方案2 0 已采纳 2012-12-12 09:36:27

解决方案3 0 2012-12-13 04:05:43

解决方案1
2 2012-12-12 09:28:04

解决方案2
0 已采纳 2012-12-12 09:36:27

解决方案3
0 2012-12-13 04:05:43