简体   繁体   English

如何从HTML代码中提取信息

[英]how can i extract information from the HTML code

This code is a Single line of .html file which is extracted from the HTML file with a unique identifier "|Rv0153c|": 此代码是单行.html文件,它是从HTML文件中提取的,具有唯一标识符“| Rv0153c |”:

<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE<br />VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND<br />AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL<br />DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA<br />AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv0153c.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv0153c.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">

i want to write a code which is able to extract the given information(given below) in a given format from this line of .html code: 我想编写一个能够从这行.html代码中以给定格式提取给定信息(如下所示)的代码:

>M. tuberculosis H37Rv|Rv0153c|ptbB
MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE
VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND
AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL
DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA
AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG

我认为您正在寻找的是一个HTML解析器: 简单的HTML和XHTML解析器

You can use Python and regular expression library. 您可以使用Python和正则表达式库。

from bs4 import BeautifulSoup
import re
sentence = '<TR><TD><small style=font-family:courier> >M. tuberculosis H37Rv|Rv0153c|ptbB<br />MAVRELPGAWNFRDVADTATALRPGRLFRSSELSRLDDAGRATLRRLGITDVADLRSSRE<br />VARRGPGRVPDGIDVHLLPFPDLADDDADDSAPHETAFKRLLTNDGSNGESGESSQSIND<br />AATRYMTDEYRQFPTRNGAQRALHRVVTLLAAGRPVLTHCFAGKDRTGFVVALVLEAVGL<br />DRDVIVADYLRSNDSVPQLRARISEMIQQRFDTELAPEVVTFTKARLSDGVLGVRAEYLA<br />AARQTIDETYGSLGGYLRDAGISQATVNRMRGVLLG<br /></small><TR><td><b><big>Blastp: <a href="http://tuberculist.epfl.ch/blast_output/Rv0153c.fasta.out"> Pre-computed results</a></big></b><TR><td><b><big>TransMembrane prediction using Hidden Markov Models: <a href="http://tuberculist.epfl.ch/tmhmm/Rv0153c.html"> TMHMM</a></big></b><base target="_blank"/><TR><td><b><big>Genomic sequence</big></b><br /><br /><form action="dnaseq.php" method="get">'   
print re.sub('<[^>]*>', '',  sentence)

HTH. HTH。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM