[英]How to find the match between two lists and write the output based on matches?
我不確定我是否適當地放置了問題標題。 但是,我試圖在下面解釋問題。 如果可以考慮這個問題,請提出適當的標題。
假設我有兩種類型的列表數據:
list_headers = ['gene_id', 'gene_name', 'trans_id']
# these are the features to be mined from each line of `attri_values`
attri_values =
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"']
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"']
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']
我正在嘗試根據list in the header
和attribute in the attri_values
list in the header
匹配項來創建表。
output = open('gtf_table', 'w')
output.write('\t'.join(list_headers) + '\n') # this will first write the header
# then I want to read each line
for values in attri_values:
for list in list_headers:
if values.startswith(list):
attr_id = ''.join([x for x in attri_values if list in x])
attr_id = attr_id.replace('"', '').split(' ')[1]
output.write('\t' + '\t'.join([attr_id]))
elif not values.startswith(list):
attr_id = 'NA'
output.write('\t' + '\t'.join([attr_id]))
output.write('\n')
問題:當在values of attri_values
找到list of list_headers
的匹配字符串時,所有字符串都運行良好,但是當不存在匹配項時,會出現很多重復的“ NA”。
最終預期結果:
gene_id gene_name trans_id
scaffold_200001.1 NA NA
scaffold_200001.1 NA scaffold_200001.1
scaffold_200002.1 NA scaffold_200002.1
編輯后: 這就是我編寫elif
(因為每一次不匹配,它都會寫'NA')。 我試圖以不同的方式移動NA
的條件,但沒有成功。 如果我刪除elif
它將得到輸出( NA
丟失):
gene_id gene_name trans_id
scaffold_200001.1
scaffold_200001.1 scaffold_200001.1
scaffold_200002.1 scaffold_200002.1
python有一個用於字符串的find
方法,您可以使用該方法為每個attri_values迭代每個列表頭。 嘗試使用此功能:
def Get_Match(search_space,search_string):
start_character = search_space.find(search_string)
if start_character == -1:
return "N/A"
else:
return search_space[(start_character + len(search_string)):]
for i in range(len(attri_values_1)):
for j in range(len(list_headers)):
print Get_Match(attri_values_1[i],list_headers[j])
我用熊貓回答
import pandas as pd
# input data
list_headers = ['gene_id', 'gene_name', 'trans_id']
attri_values = [
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'],
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'],
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']]
# process input data
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values]
# Create DataFrame with the desired columns
df = pd.DataFrame(attri_values_X, columns=list_headers)
# print dataframe
print df
產量
gene_id gene_name trans_id
0 "scaffold_200001.1" NaN NaN
1 "scaffold_200001.1" NaN "scaffold_200001.1"
2 "scaffold_200002.1" NaN "scaffold_200002.1"
沒有大熊貓也很容易。 我已經給了您attri_values_X
,那么您就attri_values_X
,只需從不需要的字典中刪除鍵即可。
我設法編寫了一個有助於解析數據的函數。 我試圖修改您發布的原始代碼,這使事情復雜化的是您存儲需要解析的數據的方式,無論如何我無法判斷,這是我的代碼:
def searchHeader(title, values):
""""
searchHeader(title, values) --> list
*Return all the words of strings in an iterable object in which title is a substring,
without including title. Else write 'N\A' for strings that title is not a substring.
Example:
>>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza']
>>> searchHeader('spam', attri_values)
['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A']
"""
res = []
for x in values:
if title in x:
res.append(x)
else:
res.append('N\A') # If no match found append N\A for every string in values
res = ' '.join(res)
# res = res.replace('"', '') You can use this for your code or use it after you call the function on res
res = res.split(' ')
res = [x for x in res if x != title] # Remove title string from res
return res
在這種情況下,正則表達式也很方便。 使用此功能解析數據,然后格式化結果以將表寫入文件。 此函數僅使用一個for
循環和一個列表推導,而在您的代碼中,您使用兩個嵌套的for
循環和一個列表推導。
將每個標頭字符串分別傳遞給函數,如下所示:
for title in list_headers:
result = searchHeader(title, attri_values)
...format as table...
...write to file...
如果可能的話,請考慮從attri_values
的簡單列表移動到字典,以這種方式可以將字符串及其標題分組:
attri_values = {'header': ('data1', 'data2',...)}
在我看來,這比使用列表更好。 還要注意,您要覆蓋代碼中的list
名稱,這不是一件好事,因為list
實際上是創建列表的內置類。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.