从文本文件中提取一组行

Question

我有一组文本文件，比如https://www.uniprot.org/uniprot/A0R4Q6.txt

我正在尝试编写一个 function ，它将 UniProt ID 作为输入，然后输出 dataframe （最好我可以用作 scikit-learn 的输入，格式如下）（仅限以下格式：）

UniProt-ID,Position,AA   
A0R4Q6,1,M
A0R4Q6,2,T
A0R4Q6,3,Q

这是我目前正在使用的：

def get_features(ID):
    featureList=[]
    #set and open link to uniprot webiste
    link="https://www.uniprot.org/uniprot/{}.txt".format(ID) 
    file = urllib.request.urlopen(link)
    #find amino acid sequence
    for line in file:
        nextLine = next(file)
        #print(nextLine)
        if b'SQ' in line:
            print(line)
            #unsure how to extract more than 1 line
            #additionally, the number of lines that
            #I will need will be variable, depending on the protein length
            
            #this is what I think the extracted lines put into a string will look like
            aaSeq='MTQMLTRPDV\tDLVNGMFYAD\tGGAREAYRWM\tRANEPVFRDR\tNGLAAATTYQ\tAVLDAERNPE\nLFSSTGGIRP\tDQPGMPYMID'
            #remove \t and \n characters
            ActualSeq=re.sub('\s+', '', aaSeq)
            print(ActualSeq)
    #now iterate through the string to create dataframe?
    p=1
    for i in ActualSeq:
        featureList.append([ID,p,i])
        p+=1
    return featureList
seq=get_features('A0R4Q6')
print(seq)

我有两个问题：

搜索 b'SQ' 不会返回任何内容，但是如果我搜索 b'ID' 或 b'FT' 等，此语法确实有效。任何想法为什么它无法识别 'SQ'？
我不知道如何让这个for循环返回'SQ'行之后的所有行，直到包含'//'的最后一行并将其压缩成一个字符串。

此外，这种将“数据框”放入元组列表的方法是最有效的，还是我应该做一些完全不同的事情？ 最终目标是使用这个 dataframe 作为 SciKit-Learn 随机森林的输入。

蒂亚！

Answer 1

要获得您要求的确切 output，请尝试以下操作：

def get_features(ID):
    featureList=[]

    # Set and open link to uniprot webiste
    link="https://www.uniprot.org/uniprot/{}.txt".format(ID) 
    file = urllib.request.urlopen(link)

    found_seq = False
    full_sec = ''
    
    # Find amino acid sequence
    for line in file:
      if line.startswith(b'SQ   '):
        found_seq = True
      elif found_seq and line.startswith(b'     '):
        line = ''.join(line.decode("utf-8").split())
        # print(line)
        full_sec += line
      else:
        found_seq = False

    # Enumerate items
    for i, a in enumerate(full_sec):
      featureList.append([ID, i+1, a])
    return featureList


seq = get_features('A0R4Q6')

for item in seq:
  print(item)

它将打印以下内容：

['A0R4Q6', 1, 'M']
['A0R4Q6', 2, 'T']
['A0R4Q6', 3, 'Q']
['A0R4Q6', 4, 'M']
['A0R4Q6', 5, 'L']
...

从文本文件中提取一组行

问题描述

1 个解决方案

解决方案1
0 2020-12-06 20:07:07

从文本文件中提取一组行

问题描述

1 个解决方案

解决方案1 0 2020-12-06 20:07:07

解决方案1
0 2020-12-06 20:07:07