简体   繁体   English

从html文件python中提取文本

[英]extract text from html file python

I have write down a code to extract some text from the html file, This code extract the requested line from the webpage now I want to extract sequence data.Unfortunately I am not able to extract the text, its showing some error. 我已经写了一个代码来从html文件中提取一些文本,现在我想从序列文件中提取代码以从网页中提取所需的行。不幸的是,我无法提取文本,但显示了一些错误。

import urllib2
from HTMLParser import HTMLParser
import nltk 
from bs4 import BeautifulSoup

# Proxy information were removed  
# from these two lines 

proxyOpener = urllib2.build_opener(proxyHandler)
urllib2.install_opener(proxyOpener)

response = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c')

################## BS Block ################################

soup = BeautifulSoup(response)
text = soup.get_text()
print text 

##########################################################

html = response.readline()

for l in html:
    if "|Rv0470c|" in l:
        print l       # code is running successfully till here 

raw = nltk.clean_html(html) 
print raw

How can I run this code successfully? 如何成功运行此代码? I have already checked all the available threads and solution, but nothing is working. 我已经检查了所有可用的线程和解决方案,但是没有任何反应。

i want to extract this part: 我想提取这部分:

M. tuberculosis H37Rv|Rv0470c|pcaA
MSVQLTPHFGNVQAHYDLSDDFFRLFLDPTQTYSCAYFERDDMTLQEAQIAKIDLALGKLNLEPGMTLLDIGCGWGATMRRAIEKYDVNVVGLTLSENQAGHVQKMFDQMDTPRSRRVLLEGWEKFDEPVDRIVSIGAFEHFGHQRYHHFFEVTHRTLPADGKMLLHTIVRPTFKEGREKGLTLTHELVHFTKFILAEIFPGGWLPSIPTVHEYAEKVGFRVTAVQSLQLHYARTLDMWATALEANKDQAIAIQSQTVYDRYMKYLTGCAKLFRQGYTDVDQFTLEK

i am able to extract desired text after writing down this code: which works without any dependencies accept "urllib2" and for my case it works like a charm. 写下此代码后,我能够提取所需的文本:无需任何依赖项即可工作,接受“ urllib2”,对我而言,它就像一个魅力。

import urllib2

httpProxy = {'username': '------', '-----': '-------', 'host': '------', 'port': '-----'}
proxyHandler = urllib2.ProxyHandler({'http': 'http://'+httpProxy['username']+':'+httpProxy['password']+'@'+httpProxy['host']+':'+httpProxy['port']})
proxyOpener = urllib2.build_opener(proxyHandler)
urllib2.install_opener(proxyOpener)



response = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c')

html = response.readlines()

f = open("/home/zebrafish/Desktop/output.txt",'w')


for l in html:
    if "|Rv0470c|" in l:
        l =  l.split("</small>")[0].split("<TR><TD><small style=font-family:courier>")[1]
        l = l.split("<br />")
        ttl =  l[:1]
        seq =  "".join(l[1:])
        f.write("".join(ttl))
        f.write(seq)
f.close()

I'm not quite sure about what exactly you are requesting as a whole, but here's my ad hoc take on your problem (similar to yours actually) which does retrieve the part of the html you request. 我不太确定您整体上到底要求什么,但是这是我的专案(确实类似于您的问题),它确实检索了您所请求的html部分。 Maybe you can get some ideas. 也许您可以得到一些想法。 (adjust for Python2) (针对Python2进行调整)

import requests
from bs4 import BeautifulSoup

url = 'http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c'
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html, "lxml")
for n in soup.find_all('tr'):
    if "|Rv0470c|" in n.text:
        nt = n.text
        while '\n' in nt:
            nt.replace('\n','\t')
        nt=nt.split('\t')
        nt = [x for x in nt if "|Rv0470c|" in x][0].strip()  
        print (nt.lstrip('>'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM