从blastx输出文件中提取特定条目，写入新文件

Question

I have created a script that successfully searches for keywords (specified by user) within a Blastx output file in XML format. 我创建了一个脚本，该脚本可以成功搜索XML格式的Blastx输出文件中的关键字（由用户指定）。 Now, I need to write those records (query, hit, score, evalue, etc) that contain the keyword in the alignment title to a new file. 现在，我需要将包含对齐标题中的关键字的记录（查询，命中，得分，评估等）写入新文件。

I have created separate lists for each of the query titles, hit title, e-value and alignment lengths but cannot seem to write them to a new file. 我为每个查询标题，命中标题，电子值和对齐长度创建了单独的列表，但似乎无法将它们写入新文件。

Problem #1: what if Python errors, and one of the lists is missing a value...? 问题＃1：如果Python错误，并且其中一个列表缺少值...该怎么办？ Then all the other lists will be giving wrong information in reference to the query ("line slippage", if you will...). 然后，所有其他列表将在查询中给出错误的信息（如果您愿意，则为“线滑移”）。
Problem #2: even if Python doesn't error, and all the lists are the same length, how can I write them to a file so that the first item in each list is associated with each other (and thus, item #10 from each list is also associated?) Should I create a dictionary instead? 问题2：即使Python没有错误，并且所有列表的长度都相同，我如何将它们写入文件，以使每个列表中的第一项相互关联（因此，第10项来自每个列表也都关联吗？）我应该创建字典吗？

Problem#3: dictionaries have only a single value for a key, what if my query has several different hits? 问题＃3：字典的键只有一个值，如果我的查询有多个不同的匹配，该怎么办？ Not sure if it will be overwritten or skipped, or if it will just error. 不知道它会被覆盖还是被跳过，或者仅仅是错误。 Any suggestions? 有什么建议么？ My current script: 我当前的脚本：

 from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML import re #obtain full path to blast output file (*.xml) outfile = input("Full path to Blast output file (XML format only): ") #obtain string to search for search_string = input("String to search for: ") #open the output file result_handle = open(outfile) #parse the blast record blast_records = NCBIXML.parse(result_handle) #initialize lists query_list=[] hit_list=[] expect_list=[] length_list=[] #create 'for loop' that loops through each HIGH SCORING PAIR in each ALIGNMENT from each RECORD for record in blast_records: for alignment in record.alignments: #for description in record.descriptions??? for hsp in alignment.hsps: #for title in description.title??? #search for designated string search = re.search(search_string, alignment.title) #if search comes up with nothing, end if search is None: print ("Search string not found.") break #if search comes up with something, add it to a list of entries that match search string else: #option to include an 'exception' (if it finds keyword then DOES NOT add that entry to list) if search is "trichomonas" or "entamoeba" or "arabidopsis": print ("found exception.") break else: query_list.append(record.query) hit_list.append(alignment.title) expect_list.append(expect_val) length_list.append(length) #explicitly convert 'variables' ['int' object or 'float'] to strings length = str(alignment.length) expect_val = str(hsp.expect) #print ("\\nquery name: " + record.query) #print ("alignment title: " + alignment.title) #print ("alignment length: " + length) #print ("expect value: " + expect_val) #print ("\\n***Alignment***\\n") #print (hsp.query) #print (hsp.match) #print (hsp.sbjct + "\\n\\n") if query_len is not hit_len is not expect_len is not length_len: print ("list lengths don't match!") break else: qrylen = len(query_list) query_len = str(qrylen) hitlen = len(hit_list) hit_len = str(hitlen) expectlen = len(expect_list) expect_len = str(expectlen) lengthlen = len(length_list) length_len = str(lengthlen) outpath = str(outfile) #create new file outfile = open("__Blast_Parse_Search.txt", "w") outfile.write("File contains entries from [" + outpath + "] that contain [" + search_string + "]") outfile.close #write list to file i = 0 list_len = int(query_len) for i in range(0, list_len): #append new file outfile = open("__Blast_Parse_Search.txt", "a") outfile.writelines(query_list + hit_list + expect_list + length_list) i = i + 1 #write to disk, close file outfile.flush() outfile.close print ("query list length " + query_len) print ("hit list length " + hit_len) print ("expect list length " + expect_len) print ("length list length " + length_len + "\\n\\n") print ("first record: " + query_list[0] + " " + hit_list[0] + " " + expect_list[0] + " " + length_list[0]) print ("last record: " + query_list[-1] + " " + hit_list[-1] + " " + expect_list[-1] + " " + length_list[-1]) print ("\\nFinished.\\n")

Answer 1

If I understand your problem correctly you could use a default value for the line slippage thing like: 如果我正确理解了您的问题，则可以使用默认值来表示滑行，例如：

try:
  x(list)
except exception:
  append_default_value(list)

http://docs.python.org/tutorial/errors.html#handling-exceptions http://docs.python.org/tutorial/errors.html#handling-exceptions

or use tuples for dictionary keys like (0,1,1) and use the get method for your default value. 或将元组用于字典键(0,1,1)例如(0,1,1) ，并将get方法用作默认值。

http://docs.python.org/py3k/library/stdtypes.html#mapping-types-dict http://docs.python.org/py3k/library/stdtypes.html#mapping-types-dict

If you need to maintain data structures in your output files you might try using shelve: 如果您需要在输出文件中维护数据结构，则可以尝试使用货架：

or you could append some type of reference after each record and give each record a unique id for example ' #32{somekey:value}#21#22#44# ' 或者您可以在每条记录后附加某种类型的引用，并为每条记录赋予唯一的ID，例如' #32{somekey:value}#21#22#44# '

again you can have multiple keys using a tuple. 同样，您可以使用一个元组使用多个键。

I don't know if that helps, you might clarify exactly what parts of your code you have trouble with. 我不知道这是否有帮助，您可能会确切说明遇到问题的代码部分。 Like x() gives me output y but I expect z . 像x()给我输出y但我期望z 。

从blastx输出文件中提取特定条目，写入新文件

问题描述

1 个解决方案

解决方案1
0 2012-08-22 21:52:09

从blastx输出文件中提取特定条目，写入新文件

问题描述

1 个解决方案

解决方案1 0 2012-08-22 21:52:09

解决方案1
0 2012-08-22 21:52:09