简体   繁体   English

如何在BeautifulSoup中将html提取为文本?

[英]How to extract html to text in BeautifulSoup?

I am using the following code to browse an html page and try to get the data i need using BeautifulSoup. 我正在使用以下代码浏览html页面,并尝试使用BeautifulSoup获取所需的数据。 everything looks good but i am hitting a wall and got stuck. 一切看起来都不错,但是我撞墙了,被卡住了。

What i need to accomplish is to extract this 9h7a2m value from this line : 我需要完成的是从此行中提取此9h7a2m值:

D: string-1.string2 15030 9h7a2m string3

The result im getting is this : 我得到的结果是这样的:

<p>D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string</p>
<p><span id="more-1203"></span></p>
<p>D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string<br/>
D: string-1.string2 15030 9h7a2m string3.string<br/>
<p>pinging test is positive but no works</p>
<p>how much time are online?</p>
<p><input aria-required="true" id="author" name="author" size="22" tabindex="1" type="text" value=""/>
<label for="author"><small>Name (required)</small></label></p>
<p><input aria-required="true" id="email" name="email" size="22" tabindex="2" type="text" value=""/>
<label for="email"><small>Mail (will not be published) (required)</small></label></p>
<p><input id="url" name="url" size="22" tabindex="3" type="text" value=""/>
<label for="url"><small>Website</small></label></p>
<p><textarea cols="100%" id="comment" name="comment" rows="10" tabindex="4"></textarea></p>
<p><input id="submit" name="submit" tabindex="5" type="submit" value="Submit Comment"/>
<input id="comment_post_ID" name="comment_post_ID" type="hidden" value="41"/>
<input id="comment_parent" name="comment_parent" type="hidden" value="0"/>
</p>
<p style="display: none;"><input id="akismet_comment_nonce" name="akismet_comment_nonce" type="hidden" value="1709964457"/></p>
<p style="display: none;"><input id="ak_js" name="ak_js" type="hidden" value="99"/></p>

At the end i need to save it to a text file. 最后,我需要将其保存到文本文件。

My code : 我的代码:

import mechanize
from bs4 import BeautifulSoup



# mechanize 

mech = mechanize.Browser()
mech.set_handle_robots(True)   
mech.set_handle_refresh(True)  
mech.addheaders =  [('User-agent', 'Firefox')]
url = ('http://example.com/')
response = mech.open(url)
resp = response.read()

# beautifulsoup


soup = BeautifulSoup(resp)
soup.prettify()



# test code



for i in soup.find('div',{'id':'content'}).findAll('p'):

    print i

Thanks in advance. 提前致谢。

You can extract it using a regular expression: 您可以使用正则表达式提取它:

import re
from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)

s = soup.find('p').br.previous_sibling  # find "p" element and get the part before the 1st br
match = re.search('string\-1\.string2 \d+ (\w+) string3\.string', s)
print match.group(1)

prints 9h7a2m . 打印9h7a2m


UPD (real web-site): UPD(真实网站):

from urllib2 import urlopen
from bs4 import BeautifulSoup

data = urlopen('your URL here')
soup = BeautifulSoup(data)

entry = soup.find('div', class_="entry")

for p in entry.find_all('p'):
    for row in p.find_all(text=True):
        try:
            print row.split(' ')[-2]
        except IndexError:
            continue

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM