简体   繁体   English

从blastx输出文件中提取特定条目,写入新文件

[英]Extract specific entries from blastx output file, write to new file

I have created a script that successfully searches for keywords (specified by user) within a Blastx output file in XML format. 我创建了一个脚本,该脚本可以成功搜索XML格式的Blastx输出文件中的关键字(由用户指定)。 Now, I need to write those records (query, hit, score, evalue, etc) that contain the keyword in the alignment title to a new file. 现在,我需要将包含对齐标题中的关键字的记录(查询,命中,得分,评估等)写入新文件。

I have created separate lists for each of the query titles, hit title, e-value and alignment lengths but cannot seem to write them to a new file. 我为每个查询标题,命中标题,电子值和对齐长度创建了单独的列表,但似乎无法将它们写入新文件。

  • Problem #1: what if Python errors, and one of the lists is missing a value...? 问题#1:如果Python错误,并且其中一个列表缺少值...该怎么办? Then all the other lists will be giving wrong information in reference to the query ("line slippage", if you will...). 然后,所有其他列表将在查询中给出错误的信息(如果您愿意,则为“线滑移”)。

  • Problem #2: even if Python doesn't error, and all the lists are the same length, how can I write them to a file so that the first item in each list is associated with each other (and thus, item #10 from each list is also associated?) Should I create a dictionary instead? 问题2:即使Python没有错误,并且所有列表的长度都相同,我如何将它们写入文件,以使每个列表中的第一项相互关联(因此,第10项来自每个列表也都关联吗?)我应该创建字典吗?

  • Problem#3: dictionaries have only a single value for a key, what if my query has several different hits? 问题#3:字典的键只有一个值,如果我的查询有多个不同的匹配,该怎么办? Not sure if it will be overwritten or skipped, or if it will just error. 不知道它会被覆盖还是被跳过,或者仅仅是错误。 Any suggestions? 有什么建议么? My current script: 我当前的脚本:

     from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML import re #obtain full path to blast output file (*.xml) outfile = input("Full path to Blast output file (XML format only): ") #obtain string to search for search_string = input("String to search for: ") #open the output file result_handle = open(outfile) #parse the blast record blast_records = NCBIXML.parse(result_handle) #initialize lists query_list=[] hit_list=[] expect_list=[] length_list=[] #create 'for loop' that loops through each HIGH SCORING PAIR in each ALIGNMENT from each RECORD for record in blast_records: for alignment in record.alignments: #for description in record.descriptions??? for hsp in alignment.hsps: #for title in description.title??? #search for designated string search = re.search(search_string, alignment.title) #if search comes up with nothing, end if search is None: print ("Search string not found.") break #if search comes up with something, add it to a list of entries that match search string else: #option to include an 'exception' (if it finds keyword then DOES NOT add that entry to list) if search is "trichomonas" or "entamoeba" or "arabidopsis": print ("found exception.") break else: query_list.append(record.query) hit_list.append(alignment.title) expect_list.append(expect_val) length_list.append(length) #explicitly convert 'variables' ['int' object or 'float'] to strings length = str(alignment.length) expect_val = str(hsp.expect) #print ("\\nquery name: " + record.query) #print ("alignment title: " + alignment.title) #print ("alignment length: " + length) #print ("expect value: " + expect_val) #print ("\\n***Alignment***\\n") #print (hsp.query) #print (hsp.match) #print (hsp.sbjct + "\\n\\n") if query_len is not hit_len is not expect_len is not length_len: print ("list lengths don't match!") break else: qrylen = len(query_list) query_len = str(qrylen) hitlen = len(hit_list) hit_len = str(hitlen) expectlen = len(expect_list) expect_len = str(expectlen) lengthlen = len(length_list) length_len = str(lengthlen) outpath = str(outfile) #create new file outfile = open("__Blast_Parse_Search.txt", "w") outfile.write("File contains entries from [" + outpath + "] that contain [" + search_string + "]") outfile.close #write list to file i = 0 list_len = int(query_len) for i in range(0, list_len): #append new file outfile = open("__Blast_Parse_Search.txt", "a") outfile.writelines(query_list + hit_list + expect_list + length_list) i = i + 1 #write to disk, close file outfile.flush() outfile.close print ("query list length " + query_len) print ("hit list length " + hit_len) print ("expect list length " + expect_len) print ("length list length " + length_len + "\\n\\n") print ("first record: " + query_list[0] + " " + hit_list[0] + " " + expect_list[0] + " " + length_list[0]) print ("last record: " + query_list[-1] + " " + hit_list[-1] + " " + expect_list[-1] + " " + length_list[-1]) print ("\\nFinished.\\n") 

If I understand your problem correctly you could use a default value for the line slippage thing like: 如果我正确理解了您的问题,则可以使用默认值来表示滑行,例如:

try:
  x(list)
except exception:
  append_default_value(list)

http://docs.python.org/tutorial/errors.html#handling-exceptions http://docs.python.org/tutorial/errors.html#handling-exceptions

or use tuples for dictionary keys like (0,1,1) and use the get method for your default value. 或将元组用于字典键(0,1,1)例如(0,1,1) ,并将get方法用作默认值。

http://docs.python.org/py3k/library/stdtypes.html#mapping-types-dict http://docs.python.org/py3k/library/stdtypes.html#mapping-types-dict

If you need to maintain data structures in your output files you might try using shelve: 如果您需要在输出文件中维护数据结构,则可以尝试使用货架:

or you could append some type of reference after each record and give each record a unique id for example ' #32{somekey:value}#21#22#44# ' 或者您可以在每条记录后附加某种类型的引用,并为每条记录赋予唯一的ID,例如' #32{somekey:value}#21#22#44# '

again you can have multiple keys using a tuple. 同样,您可以使用一个元组使用多个键。

I don't know if that helps, you might clarify exactly what parts of your code you have trouble with. 我不知道这是否有帮助,您可能会确切说明遇到问题的代码部分。 Like x() gives me output y but I expect z . x()给我输出y但我期望z

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM