简体   繁体   English

将字节类型文件转换为 python 中的可用格式

[英]Converting byte type file to workable format in python

I have to read the table in below link(html page) into a dict() and then work on it.我必须将下面链接(html 页面)中的表格读入 dict(),然后处理它。 However, with the below code I gave, the table still looks clumsy and I do not understand from where to start working to make it a dictionary of codon sequence(eg AGU) to respective Amino Acid.但是,使用我给出的以下代码,该表仍然看起来很笨拙,我不明白从哪里开始工作以使其成为相应氨基酸的密码子序列(例如 AGU)字典。 Any way to make it look better?有什么办法让它看起来更好吗? May be something like a DataFrame or any other suggestions?可能类似于 DataFrame 或任何其他建议? Please help.请帮忙。 Thanks.谢谢。

link = "http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606&aa=1&style=N"
f = urllib.request.urlopen(link)
myfile = f.read()
s = myfile.decode()
s.strip(" ")

If you have looked at the page http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606&aa=1&style=N you would have noticed that it contains not just the codon sequence you want, but a lot of HTML around it.如果您查看页面http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606&aa=1&style=N您会注意到它不仅包含您想要的密码子序列,但它周围有很多HTML。 To extract just the codons, the best way is likely to use BeautifulSoup:要仅提取密码子,最好的方法可能是使用 BeautifulSoup:

from bs4 import BeautifulSoup
link = "http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606&aa=1&style=N"
f = urllib.request.urlopen(link)
myfile = f.read()
s = myfile.decode()
codons = BeautifulSoup(s).find('pre').text

Now you should probably process this string further to get the form you want - dict, list, dataframe, whatever.现在你可能应该进一步处理这个字符串以获得你想要的形式 - dict、list、dataframe 等等。 Assuming you just want a dict, since you mentioned a dictionary:假设您只想要一个字典,因为您提到了字典:

import re
codons_dict = { t[0]: t[1] for t in sorted(re.findall(r'(\w{3})\s+\w\s+(\S+)\s+\S+\s+[(]\d+[)]', codons)) }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM