[英]Extracting a set of lines from a text file
我有一组文本文件,比如https://www.uniprot.org/uniprot/A0R4Q6.txt
我正在尝试编写一个 function ,它将 UniProt ID 作为输入,然后输出 dataframe (最好我可以用作 scikit-learn 的输入,格式如下)(仅限以下格式:)
UniProt-ID,Position,AA
A0R4Q6,1,M
A0R4Q6,2,T
A0R4Q6,3,Q
这是我目前正在使用的:
def get_features(ID):
featureList=[]
#set and open link to uniprot webiste
link="https://www.uniprot.org/uniprot/{}.txt".format(ID)
file = urllib.request.urlopen(link)
#find amino acid sequence
for line in file:
nextLine = next(file)
#print(nextLine)
if b'SQ' in line:
print(line)
#unsure how to extract more than 1 line
#additionally, the number of lines that
#I will need will be variable, depending on the protein length
#this is what I think the extracted lines put into a string will look like
aaSeq='MTQMLTRPDV\tDLVNGMFYAD\tGGAREAYRWM\tRANEPVFRDR\tNGLAAATTYQ\tAVLDAERNPE\nLFSSTGGIRP\tDQPGMPYMID'
#remove \t and \n characters
ActualSeq=re.sub('\s+', '', aaSeq)
print(ActualSeq)
#now iterate through the string to create dataframe?
p=1
for i in ActualSeq:
featureList.append([ID,p,i])
p+=1
return featureList
seq=get_features('A0R4Q6')
print(seq)
我有两个问题:
此外,这种将“数据框”放入元组列表的方法是最有效的,还是我应该做一些完全不同的事情? 最终目标是使用这个 dataframe 作为 SciKit-Learn 随机森林的输入。
蒂亚!
要获得您要求的确切 output,请尝试以下操作:
def get_features(ID):
featureList=[]
# Set and open link to uniprot webiste
link="https://www.uniprot.org/uniprot/{}.txt".format(ID)
file = urllib.request.urlopen(link)
found_seq = False
full_sec = ''
# Find amino acid sequence
for line in file:
if line.startswith(b'SQ '):
found_seq = True
elif found_seq and line.startswith(b' '):
line = ''.join(line.decode("utf-8").split())
# print(line)
full_sec += line
else:
found_seq = False
# Enumerate items
for i, a in enumerate(full_sec):
featureList.append([ID, i+1, a])
return featureList
seq = get_features('A0R4Q6')
for item in seq:
print(item)
它将打印以下内容:
['A0R4Q6', 1, 'M']
['A0R4Q6', 2, 'T']
['A0R4Q6', 3, 'Q']
['A0R4Q6', 4, 'M']
['A0R4Q6', 5, 'L']
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.