比较文件的连续列并返回不匹配元素的数量

Question

I have a text file which looks like this: 我有一个看起来像这样的文本文件：

# sampleID  HGDP00511  HGDP00511   HGDP00512   HGDP00512   HGDP00513  HGDP00513   

M rs4124251       0       0            A            G          0          A

M rs6650104       0       A            C            T          0          0

M rs12184279      0       0            G            A          T          0

I want to compare the consecutive columns and return the number of matching elements. 我想比较连续的列并返回匹配元素的数量。 I want to do this in Python. 我想用Python做到这一点。 Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process. 之前，我是使用Bash和AWK（shell脚本）完成的，但是它非常慢，因为我要处理大量数据。 I believe Python would be a faster solution to this. 我相信Python将是一个更快的解决方案。 But, I am very new to Python and I already have something like this: 但是，我是Python的新手，我已经有了类似的东西：

for line in open("phased.txt"):
    columns = line.split("\t")

    for i in range(len(columns)-1):
        a = columns[i+3]
        b = columns[i+4]
        for j in range(len(a)):
            if a[j] != b[j]:
                print j

which is obviously not working. 这显然是行不通的。 As I am very new to Python, I don't really know what changes to make to get this to work. 由于我是Python的新手，所以我真的不知道要进行哪些更改才能使其正常工作。 (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed) （这是完全错误的代码，我想我可以使用difflib等。但是，我以前从来没有熟练地用Python编写过代码，因此怀疑是否继续进行）

I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. 我想比较并返回文件中每列（从第三列开始）到每个其他列的不匹配元素的数量。 I have 828 columns in totality. 我总共有828列。 Hence I would need 828*828 number of outputs. 因此，我需要828 * 828的输出数量。 (You can think of an*n matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be: （您可以考虑一个* n矩阵，其中第（i，j）个元素是它们之间不匹配元素的数量。在上面的代码段中，我想要的输出是：

3 4: 1

3 5: 3

3 6: 3

......

4 6: 3
..etc

Any help on this would be appreciated. 任何帮助，将不胜感激。 Thanks. 谢谢。

Answer 1

I highly recommend you use pandas for this rather than writing your own code: 我强烈建议您为此使用熊猫，而不要编写自己的代码：

import numpy as np
import pandas as pd
df = pd.read_csv("phased.txt")
match_counts = {(i,j): 
                   np.sum(df[df.columns[i]] != df[df.columns[j]]) 
                           for i in range(3,len(df.columns))
                           for j in range(3,len(df.columns))}

match_counts
{(6, 4): 3,
 (4, 7): 2,
 (4, 4): 0,
 (4, 3): 3,
 (6, 6): 0,
 (4, 5): 3,
 (5, 4): 3,
 (3, 5): 3,
 (7, 7): 0,
 (7, 5): 3,
 (3, 7): 2,
 (6, 5): 3,
 (5, 5): 0,
 (7, 4): 2,
 (5, 3): 3,
 (6, 7): 2,
 (4, 6): 3,
 (7, 6): 2,
 (5, 7): 3,
 (6, 3): 2,
 (5, 6): 3,
 (3, 6): 2,
 (3, 3): 0,
 (7, 3): 2,
 (3, 4): 3}

Answer 2

A Pure native python library way of solving this - let us know how it compares with bash 828 x 828 should be a walk in the park. 解决此问题的一种纯本机python库方法-让我们知道它与bash 828 x 828相比如何应该在公园散步。

Element Column counts: 元素列数：

I purposely wrote this with a step in flipping of the sequences, for simplicity and illustrative purposes - you can improve it with changed logic or usages of class objects, function decorators maybe etc... 为了简化和说明性目的，我故意在翻转序列的步骤中编写了此代码-您可以通过更改逻辑或使用类对象，功能装饰器等来改进它...

Code Python 2.7: 代码Python 2.7：

shiftcol = 2  # shift columns as first two are to be ignored
with open('phased.txt') as f:
    data = [x.strip().split('\t')[shiftcol:] for x in f.readlines()][1:]

# Step 1: Flipping the data first
flip = []
for idx, rows in enumerate(data):
    for i in range(len(rows)):
        if len(flip) <= i:
            flip.append([])
        flip[i].append(rows[i])

# Step 2: counts store in temp dictionary
for idx, v in enumerate(flip):
    for e in v:
        tmp = {}
        for i, z in enumerate(flip):
            if i != idx and e != '0':
                # Dictionary to store results
                if i+1 not in tmp:  # note has_key will be deprecated
                    tmp[i+1] = {'match': 0, 'notma': 0}
                tmp[i+1]['match'] += z.count(e)
                tmp[i+1]['notma'] += len([x for x in z if x != e])

        # results compensate for column shift..
        for key, count in tmp.iteritems():
            print idx+shiftcol+1, key+shiftcol, ': ', count

sample output 样本输出

>>> 3 4 :  {'match': 0, 'notma': 3}
>>> 3 5 :  {'match': 0, 'notma': 3}
>>> 3 6 :  {'match': 2, 'notma': 1}
>>> 3 7 :  {'match': 2, 'notma': 1}
>>> 3 3 :  {'match': 1, 'notma': 2}
>>> 3 4 :  {'match': 1, 'notma': 2}
>>> 3 5 :  {'match': 1, 'notma': 2}

比较文件的连续列并返回不匹配元素的数量

问题描述

2 个解决方案

解决方案1
0 2015-06-05 05:19:12

解决方案2
0 已采纳 2015-06-05 08:11:55

Element Column counts: 元素列数：

Code Python 2.7: 代码Python 2.7：

sample output 样本输出

比较文件的连续列并返回不匹配元素的数量

问题描述

2 个解决方案

解决方案1 0 2015-06-05 05:19:12

解决方案2 0 已采纳 2015-06-05 08:11:55

Element Column counts: 元素列数：

Code Python 2.7: 代码Python 2.7：

sample output 样本输出

解决方案1
0 2015-06-05 05:19:12

解决方案2
0 已采纳 2015-06-05 08:11:55