繁体   English   中英

从文本文件中提取一组行

[英]Extracting a set of lines from a text file

我有一组文本文件,比如https://www.uniprot.org/uniprot/A0R4Q6.txt

我正在尝试编写一个 function ,它将 UniProt ID 作为输入,然后输出 dataframe (最好我可以用作 scikit-learn 的输入,格式如下)(仅限以下格式:)

UniProt-ID,Position,AA   
A0R4Q6,1,M
A0R4Q6,2,T
A0R4Q6,3,Q

这是我目前正在使用的:

def get_features(ID):
    featureList=[]
    #set and open link to uniprot webiste
    link="https://www.uniprot.org/uniprot/{}.txt".format(ID) 
    file = urllib.request.urlopen(link)
    #find amino acid sequence
    for line in file:
        nextLine = next(file)
        #print(nextLine)
        if b'SQ' in line:
            print(line)
            #unsure how to extract more than 1 line
            #additionally, the number of lines that
            #I will need will be variable, depending on the protein length
            
            #this is what I think the extracted lines put into a string will look like
            aaSeq='MTQMLTRPDV\tDLVNGMFYAD\tGGAREAYRWM\tRANEPVFRDR\tNGLAAATTYQ\tAVLDAERNPE\nLFSSTGGIRP\tDQPGMPYMID'
            #remove \t and \n characters
            ActualSeq=re.sub('\s+', '', aaSeq)
            print(ActualSeq)
    #now iterate through the string to create dataframe?
    p=1
    for i in ActualSeq:
        featureList.append([ID,p,i])
        p+=1
    return featureList
seq=get_features('A0R4Q6')
print(seq)

我有两个问题:

  1. 搜索 b'SQ' 不会返回任何内容,但是如果我搜索 b'ID' 或 b'FT' 等,此语法确实有效。任何想法为什么它无法识别 'SQ'?
  2. 我不知道如何让这个for循环返回'SQ'行之后的所有行,直到包含'//'的最后一行并将其压缩成一个字符串。

此外,这种将“数据框”放入元组列表的方法是最有效的,还是我应该做一些完全不同的事情? 最终目标是使用这个 dataframe 作为 SciKit-Learn 随机森林的输入。

蒂亚!

要获得您要求的确切 output,请尝试以下操作:

def get_features(ID):
    featureList=[]

    # Set and open link to uniprot webiste
    link="https://www.uniprot.org/uniprot/{}.txt".format(ID) 
    file = urllib.request.urlopen(link)

    found_seq = False
    full_sec = ''
    
    # Find amino acid sequence
    for line in file:
      if line.startswith(b'SQ   '):
        found_seq = True
      elif found_seq and line.startswith(b'     '):
        line = ''.join(line.decode("utf-8").split())
        # print(line)
        full_sec += line
      else:
        found_seq = False

    # Enumerate items
    for i, a in enumerate(full_sec):
      featureList.append([ID, i+1, a])
    return featureList


seq = get_features('A0R4Q6')

for item in seq:
  print(item)

它将打印以下内容:

['A0R4Q6', 1, 'M']
['A0R4Q6', 2, 'T']
['A0R4Q6', 3, 'Q']
['A0R4Q6', 4, 'M']
['A0R4Q6', 5, 'L']
...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM