[英]Compare consecutive columns of a file and return the number of non-matching elements
I have a text file which looks like this: 我有一个看起来像这样的文本文件:
# sampleID HGDP00511 HGDP00511 HGDP00512 HGDP00512 HGDP00513 HGDP00513
M rs4124251 0 0 A G 0 A
M rs6650104 0 A C T 0 0
M rs12184279 0 0 G A T 0
I want to compare the consecutive columns and return the number of matching elements. 我想比较连续的列并返回匹配元素的数量。 I want to do this in Python.
我想用Python做到这一点。 Earlier, I did it using Bash and AWK (shell scripting), but its very slow, as I have huge data to process.
之前,我是使用Bash和AWK(shell脚本)完成的,但是它非常慢,因为我要处理大量数据。 I believe Python would be a faster solution to this.
我相信Python将是一个更快的解决方案。 But, I am very new to Python and I already have something like this:
但是,我是Python的新手,我已经有了类似的东西:
for line in open("phased.txt"):
columns = line.split("\t")
for i in range(len(columns)-1):
a = columns[i+3]
b = columns[i+4]
for j in range(len(a)):
if a[j] != b[j]:
print j
which is obviously not working. 这显然是行不通的。 As I am very new to Python, I don't really know what changes to make to get this to work.
由于我是Python的新手,所以我真的不知道要进行哪些更改才能使其正常工作。 (This is code is completely wrong and I guess I could use difflib, etc. But, I have never proficiently coded in Python before, so, skeptical to proceed)
(这是完全错误的代码,我想我可以使用difflib等。但是,我以前从来没有熟练地用Python编写过代码,因此怀疑是否继续进行)
I want to compare and return the number of non matching elements in each column(starting from the third) to every other column in the file. 我想比较并返回文件中每列(从第三列开始)到每个其他列的不匹配元素的数量。 I have 828 columns in totality.
我总共有828列。 Hence I would need 828*828 number of outputs.
因此,我需要828 * 828的输出数量。 (You can think of an*n matrix where the (i,j)th element would be the number of non matching elements between them. My desired output in case of the above snippet would be:
(您可以考虑一个* n矩阵,其中第(i,j)个元素是它们之间不匹配元素的数量。在上面的代码段中,我想要的输出是:
3 4: 1
3 5: 3
3 6: 3
......
4 6: 3
..etc
Any help on this would be appreciated. 任何帮助,将不胜感激。 Thanks.
谢谢。
I highly recommend you use pandas for this rather than writing your own code: 我强烈建议您为此使用熊猫,而不要编写自己的代码:
import numpy as np
import pandas as pd
df = pd.read_csv("phased.txt")
match_counts = {(i,j):
np.sum(df[df.columns[i]] != df[df.columns[j]])
for i in range(3,len(df.columns))
for j in range(3,len(df.columns))}
match_counts
{(6, 4): 3,
(4, 7): 2,
(4, 4): 0,
(4, 3): 3,
(6, 6): 0,
(4, 5): 3,
(5, 4): 3,
(3, 5): 3,
(7, 7): 0,
(7, 5): 3,
(3, 7): 2,
(6, 5): 3,
(5, 5): 0,
(7, 4): 2,
(5, 3): 3,
(6, 7): 2,
(4, 6): 3,
(7, 6): 2,
(5, 7): 3,
(6, 3): 2,
(5, 6): 3,
(3, 6): 2,
(3, 3): 0,
(7, 3): 2,
(3, 4): 3}
A Pure native python library way of solving this - let us know how it compares with bash 828 x 828 should be a walk in the park. 解决此问题的一种纯本机python库方法-让我们知道它与bash 828 x 828相比如何应该在公园散步。
I purposely wrote this with a step in flipping of the sequences, for simplicity and illustrative purposes - you can improve it with changed logic or usages of class objects, function decorators maybe etc... 为了简化和说明性目的,我故意在翻转序列的步骤中编写了此代码-您可以通过更改逻辑或使用类对象,功能装饰器等来改进它...
shiftcol = 2 # shift columns as first two are to be ignored
with open('phased.txt') as f:
data = [x.strip().split('\t')[shiftcol:] for x in f.readlines()][1:]
# Step 1: Flipping the data first
flip = []
for idx, rows in enumerate(data):
for i in range(len(rows)):
if len(flip) <= i:
flip.append([])
flip[i].append(rows[i])
# Step 2: counts store in temp dictionary
for idx, v in enumerate(flip):
for e in v:
tmp = {}
for i, z in enumerate(flip):
if i != idx and e != '0':
# Dictionary to store results
if i+1 not in tmp: # note has_key will be deprecated
tmp[i+1] = {'match': 0, 'notma': 0}
tmp[i+1]['match'] += z.count(e)
tmp[i+1]['notma'] += len([x for x in z if x != e])
# results compensate for column shift..
for key, count in tmp.iteritems():
print idx+shiftcol+1, key+shiftcol, ': ', count
>>> 3 4 : {'match': 0, 'notma': 3}
>>> 3 5 : {'match': 0, 'notma': 3}
>>> 3 6 : {'match': 2, 'notma': 1}
>>> 3 7 : {'match': 2, 'notma': 1}
>>> 3 3 : {'match': 1, 'notma': 2}
>>> 3 4 : {'match': 1, 'notma': 2}
>>> 3 5 : {'match': 1, 'notma': 2}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.