如何找到兩個列表之間的匹配並根據匹配寫入輸出？

Question

我不確定我是否適當地放置了問題標題。 但是，我試圖在下面解釋問題。 如果可以考慮這個問題，請提出適當的標題。

假設我有兩種類型的列表數據：

list_headers = ['gene_id', 'gene_name', 'trans_id'] 
# these are the features to be mined from each line of `attri_values`

attri_values = 

['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"']
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"']
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']

我正在嘗試根據list in the header和attribute in the attri_values list in the header匹配項來創建表。

output = open('gtf_table', 'w')
output.write('\t'.join(list_headers) + '\n') # this will first write the header

# then I want to read each line
for values in attri_values:
    for list in list_headers:
        if values.startswith(list):
            attr_id = ''.join([x for x in attri_values if list in x])
            attr_id = attr_id.replace('"', '').split(' ')[1]
            output.write('\t' + '\t'.join([attr_id]))

        elif not values.startswith(list):
            attr_id = 'NA'
            output.write('\t' + '\t'.join([attr_id]))

        output.write('\n')

問題：當在values of attri_values找到list of list_headers的匹配字符串時，所有字符串都運行良好，但是當不存在匹配項時，會出現很多重復的“ NA”。

最終預期結果：

gene_id    gene_name    trans_id
scaffold_200001.1    NA    NA
scaffold_200001.1    NA    scaffold_200001.1
scaffold_200002.1    NA    scaffold_200002.1

編輯后： 這就是我編寫elif （因為每一次不匹配，它都會寫'NA'）。 我試圖以不同的方式移動NA的條件，但沒有成功。 如果我刪除elif它將得到輸出（ NA丟失）：

gene_id    gene_name    trans_id
scaffold_200001.1
scaffold_200001.1    scaffold_200001.1
scaffold_200002.1    scaffold_200002.1

Answer 1

python有一個用於字符串的find方法，您可以使用該方法為每個attri_values迭代每個列表頭。 嘗試使用此功能：

def Get_Match(search_space,search_string):
    start_character = search_space.find(search_string)

    if start_character == -1:
        return "N/A"
    else:
        return search_space[(start_character + len(search_string)):]

for  i in range(len(attri_values_1)):
    for j in range(len(list_headers)):
        print Get_Match(attri_values_1[i],list_headers[j])

Answer 2

我用熊貓回答

import pandas as pd

# input data
list_headers = ['gene_id', 'gene_name', 'trans_id']

attri_values = [
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'],
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'],
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']]

# process input data
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values]

# Create DataFrame with the desired columns
df = pd.DataFrame(attri_values_X, columns=list_headers)

# print dataframe
print df

產量

               gene_id  gene_name             trans_id
0  "scaffold_200001.1"        NaN                  NaN
1  "scaffold_200001.1"        NaN  "scaffold_200001.1"
2  "scaffold_200002.1"        NaN  "scaffold_200002.1"

沒有大熊貓也很容易。 我已經給了您attri_values_X ，那么您就attri_values_X ，只需從不需要的字典中刪除鍵即可。

Answer 3

我設法編寫了一個有助於解析數據的函數。 我試圖修改您發布的原始代碼，這使事情復雜化的是您存儲需要解析的數據的方式，無論如何我無法判斷，這是我的代碼：

def searchHeader(title, values):
    """"
    searchHeader(title, values) --> list 

    *Return all the words of strings in an iterable object in which title is a substring, 
    without including title. Else write 'N\A' for strings that title is not a substring.
    Example:
             >>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza']
             >>> searchHeader('spam', attri_values)
             ['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A'] 
    """
    res = [] 
    for x in values: 
        if title in x: 
            res.append(x)
        else:
            res.append('N\A')                     # If no match found append N\A for every string in values

    res = ' '.join(res)
    # res = res.replace('"', '')                  You can use this for your code or use it after you call the function on res
    res = res.split(' ')
    res = [x for x in res if x != title]          # Remove title string from res
    return  res

在這種情況下，正則表達式也很方便。 使用此功能解析數據，然后格式化結果以將表寫入文件。 此函數僅使用一個for循環和一個列表推導，而在您的代碼中，您使用兩個嵌套的for循環和一個列表推導。

將每個標頭字符串分別傳遞給函數，如下所示：

for title in list_headers: 
    result = searchHeader(title, attri_values)
    ...format as table...
    ...write to file...

如果可能的話，請考慮從attri_values的簡單列表移動到字典，以這種方式可以將字符串及其標題分組：

attri_values = {'header': ('data1', 'data2',...)}

在我看來，這比使用列表更好。 還要注意，您要覆蓋代碼中的list名稱，這不是一件好事，因為list實際上是創建列表的內置類。

如何找到兩個列表之間的匹配並根據匹配寫入輸出？

問題描述

3 個解決方案

解決方案1
1 2017-04-24 20:01:18

解決方案2
1 2017-04-24 21:11:16

解決方案3
1 2017-04-24 21:27:32

如何找到兩個列表之間的匹配並根據匹配寫入輸出？

問題描述

3 個解決方案

解決方案1 1 2017-04-24 20:01:18

解決方案2 1 2017-04-24 21:11:16

解決方案3 1 2017-04-24 21:27:32

解決方案1
1 2017-04-24 20:01:18

解決方案2
1 2017-04-24 21:11:16

解決方案3
1 2017-04-24 21:27:32