从html文件python中提取文本

Question

I have write down a code to extract some text from the html file, This code extract the requested line from the webpage now I want to extract sequence data.Unfortunately I am not able to extract the text, its showing some error. 我已经写了一个代码来从html文件中提取一些文本，现在我想从序列文件中提取代码以从网页中提取所需的行。不幸的是，我无法提取文本，但显示了一些错误。

import urllib2
from HTMLParser import HTMLParser
import nltk 
from bs4 import BeautifulSoup

# Proxy information were removed  
# from these two lines 

proxyOpener = urllib2.build_opener(proxyHandler)
urllib2.install_opener(proxyOpener)

response = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c')

################## BS Block ################################

soup = BeautifulSoup(response)
text = soup.get_text()
print text 

##########################################################

html = response.readline()

for l in html:
    if "|Rv0470c|" in l:
        print l       # code is running successfully till here 

raw = nltk.clean_html(html) 
print raw

How can I run this code successfully? 如何成功运行此代码？ I have already checked all the available threads and solution, but nothing is working. 我已经检查了所有可用的线程和解决方案，但是没有任何反应。

i want to extract this part: 我想提取这部分：

M. tuberculosis H37Rv|Rv0470c|pcaA
MSVQLTPHFGNVQAHYDLSDDFFRLFLDPTQTYSCAYFERDDMTLQEAQIAKIDLALGKLNLEPGMTLLDIGCGWGATMRRAIEKYDVNVVGLTLSENQAGHVQKMFDQMDTPRSRRVLLEGWEKFDEPVDRIVSIGAFEHFGHQRYHHFFEVTHRTLPADGKMLLHTIVRPTFKEGREKGLTLTHELVHFTKFILAEIFPGGWLPSIPTVHEYAEKVGFRVTAVQSLQLHYARTLDMWATALEANKDQAIAIQSQTVYDRYMKYLTGCAKLFRQGYTDVDQFTLEK

Answer 1

i am able to extract desired text after writing down this code: which works without any dependencies accept "urllib2" and for my case it works like a charm. 写下此代码后，我能够提取所需的文本：无需任何依赖项即可工作，接受“ urllib2”，对我而言，它就像一个魅力。

import urllib2

httpProxy = {'username': '------', '-----': '-------', 'host': '------', 'port': '-----'}
proxyHandler = urllib2.ProxyHandler({'http': 'http://'+httpProxy['username']+':'+httpProxy['password']+'@'+httpProxy['host']+':'+httpProxy['port']})
proxyOpener = urllib2.build_opener(proxyHandler)
urllib2.install_opener(proxyOpener)



response = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c')

html = response.readlines()

f = open("/home/zebrafish/Desktop/output.txt",'w')


for l in html:
    if "|Rv0470c|" in l:
        l =  l.split("</small>")[0].split("<TR><TD><small style=font-family:courier>")[1]
        l = l.split("<br />")
        ttl =  l[:1]
        seq =  "".join(l[1:])
        f.write("".join(ttl))
        f.write(seq)
f.close()

Answer 2

I'm not quite sure about what exactly you are requesting as a whole, but here's my ad hoc take on your problem (similar to yours actually) which does retrieve the part of the html you request. 我不太确定您整体上到底要求什么，但是这是我的专案（确实类似于您的问题），它确实检索了您所请求的html部分。 Maybe you can get some ideas. 也许您可以得到一些想法。 (adjust for Python2) （针对Python2进行调整）

import requests
from bs4 import BeautifulSoup

url = 'http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c'
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html, "lxml")
for n in soup.find_all('tr'):
    if "|Rv0470c|" in n.text:
        nt = n.text
        while '\n' in nt:
            nt.replace('\n','\t')
        nt=nt.split('\t')
        nt = [x for x in nt if "|Rv0470c|" in x][0].strip()  
        print (nt.lstrip('>'))

从html文件python中提取文本

问题描述

2 个解决方案

解决方案1
0 2016-03-07 10:22:18

解决方案2
0 2016-03-07 14:54:10

从html文件python中提取文本

问题描述

2 个解决方案

解决方案1 0 2016-03-07 10:22:18

解决方案2 0 2016-03-07 14:54:10

解决方案1
0 2016-03-07 10:22:18

解决方案2
0 2016-03-07 14:54:10