简体   繁体   English

如何为生物信息学查询细化python脚本

[英]How to refine a python script for a bioinformatics query

I am quite new to python and I would be grateful for some assistance if possible. 我是python的新手,如果可能的话我会感激一些帮助。 I am comparing the genomes of two closely related organisms [E_C & E_F] and trying to identify some basic insertions and deletions. 我正在比较两个密切相关的生物[E_C和E_F]的基因组,并试图找出一些基本的插入和删除。 I have run a FASTA pairwise alignment (glsearch36) using sequences from both organisms. 我使用来自两种生物的序列运行FASTA成对比对(glsearch36)。

The below is a section of my python script where I have been able to identify a 7 nucleotide (heptamer) in one sequence (database) that corresponds to a gap in the other sequence (query). 下面是我的python脚本的一部分,我已经能够在一个序列(数据库)中识别出与另一个序列(查询)中的缺口相对应的7个核苷酸(七聚体)。 This is an example of what I have: 这是我的一个例子:

ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG

Assume the gap is at position 9. I am trying to refine the script to select gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides also match 假设差距在第9位。我正在尝试改进脚本以选择在两个序列上分开20个核苷酸或更多的间隙,并且仅当周围的核苷酸也匹配时

ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
                 ||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC

This is the section of my script, the top half deals with opening different files. 这是我脚本的一部分,上半部分涉及打开不同的文件。 it also prints a dictionary with the count of each sequence at the end. 它还会打印一个字典,其中包含最后每个序列的计数。

list_of_positions = []

for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]): 
    list_of_positions.append(match.start())

set_of_positions = set(list_of_positions)
for position in list_of_positions:
    list_no_indels = []
    for number in range(position-20, position) :
        list_no_indels.append(number)
    for number in range(position+1, position+21) :
        list_no_indels.append(number)
    set_no_indels = set(list_no_indels)
    if len(set_no_indels.intersection(set_of_positions))> 0 : continue

    if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue


    print position 
    #print match.start()

    print dict_seqs[E_F][position -3:position +3]

    key = dict_seqs[E_F][position -3: position +3]

    if nt_dict.has_key(key):
        nt_dict[key] += 1 
    else:
        nt_dict[key] = 1


print nt_dict

Essentially, I was trying to edit the results of pairwise alignments to try and identify the nucleotides opposite the gaps in both the query and database sequences in order to conduct some basic Insertion/Deletion analysis. 基本上,我试图编辑成对比对的结果,以尝试识别查询和数据库序列中与缺口相对的核苷酸,以进行一些基本的插入/删除分析。

I was able to solve one of my earlier issues by increasing the distance between gaps "-" to 20 nt's in an attempt to reduce noise, this has improved my results. 我能够通过增加间隙“ - ”到20 nt之间的距离来解决我之前的一个问题,以减少噪音,这改善了我的结果。 Script edited above. 上面编辑的脚本。

This is an example of my results and at the end I have a dictionary which counts the occurences of each sequence. 这是我的结果的一个例子,最后我有一个字典,计算每个序列的出现次数。

ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)


ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)

However, I am still trying to fix the script where I get the nucleotides around the gap to match exactly such as this, where the | 但是,我仍然试图修复脚本,我得到的间隙周围的核苷酸完全匹配,如此,其中| is just to show matching nt's on each sequence: 只是为了在每个序列上显示匹配的nt:

GGTTACCG-ACCTGTATGTGAACTCAACA # query
     ||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database

9
ACCGAC

Any help with this would be gratefully appreciated! 如有任何帮助,将不胜感激!

I think I understand what you are trying to do but as @alko has said - comments in your code will definitely help a lot. 我想我明白你要做什么,但正如@alko所说 - 你的代码中的评论肯定会有很多帮助。

As to finding an exact match around the gap you could run a string comparison: 至于在间隙周围找到完全匹配,您可以运行字符串比较:

Something along the lines of: 有点像:

if query[position -3: position] == database[position -3: position] and query[position +1: position +3] == database[position +1: position +3]:
   # Do something

You will need to replace "query" and "database" with what you have called your strings that you want to compare. 您需要将“查询”和“数据库”替换为您要比较的字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM