使用python在兩個文件中查找匹配項

Question

我正在分析測序數據，我很少有候選基因，我需要找到它們的功能。

在編輯了可用的人類數據庫之后，我想將我的候選基因與數據庫進行比較，並輸出我的候選基因的功能。

我只有基本的蟒蛇技能所以我認為這可以幫助我加快我的工作，找到我的候選基因的功能。

所以包含候選基因的file1看起來像這樣

Gene
AQP7
RLIM
SMCO3
COASY
HSPA6

和數據庫，file2.csv看起來像這樣：

Gene   function 
PDCD6  Programmed cell death protein 6 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a

期望的輸出

 Gene(from file1) ,function(matching from file2)

我試圖使用這段代碼：

file1 = 'file1.csv'
file2 = 'file2.csv'
output = 'file3.txt'

with open(file1) as inf:
    match = set(line.strip() for line in inf)

with open(file2) as inf, open(output, 'w') as outf:
    for line in inf:
        if line.split(' ',1)[0] in match:
            outf.write(line)

我只得到空白頁。

我嘗試使用交叉函數

with open('file1.csv', 'r') as ref:
    with open('file2.csv','r') as com:
       with open('common_genes_function','w') as output:
           same = set(ref).intersection(com)
                print same

不工作..

請幫助，否則我需要手動執行此操作

Answer 1

我建議使用pandas merge功能。 但是，它需要在“基因”和“功能”列之間有一個清晰的分隔符。 在我的例子中，我假設它在tab：

import pandas as pd
#open files as pandas datasets
file1 = pd.read_csv(filepath1, sep = '\t')
file2 = pd.read_csv(filepath2, sep = '\t')

#merge files by column 'Gene' using 'inner', so it comes up
#with the intersection of both datasets
file3 = pd.merge(file1, file2, how = 'inner', on = ['Gene'], suffixes = ['1','2'])
file3.to_csv(filepath3, sep = ',')

Answer 2

使用基本Python，您可以嘗試以下方法：

import re

gene_function = {}
with open('file2.csv','r') as input:
    lines = [line.strip() for line in input.readlines()[1:]]
    for line in lines:
        match = re.search("(\w+)\s+(.*)",line)
        gene = match.group(1)
        function = match.group(2)
        if gene not in gene_function:
            gene_function[gene] = function

with open('file1.csv','r') as input:
    genes = [i.strip() for i in input.readlines()[1:]]
    for gene in genes:
        if gene in gene_function:
            print "{}, {}".format(gene, gene_function[gene])

使用python在兩個文件中查找匹配項

問題描述

2 個解決方案

解決方案1
2 已采納 2015-04-29 08:01:48

解決方案2
1 2015-04-29 08:11:19

使用python在兩個文件中查找匹配項

問題描述

2 個解決方案

解決方案1 2 已采納 2015-04-29 08:01:48

解決方案2 1 2015-04-29 08:11:19

解決方案1
2 已采納 2015-04-29 08:01:48

解決方案2
1 2015-04-29 08:11:19