[英]Python: Extract Seperated Text from HTML with BeautifulSoup
I have the following HTML repeated several times on a page (please do not judge): 我在页面上将以下HTML重复了几次(请不要判断):
<div class="container">
<div class="image">
<a href="#" title="#" class="#">
<img src="img.jpg" alt="#" class="#">
</a>
</div>
<div class="text">
<a href="#">
<h4 class="h4-class">{TITLE}</h4>
{SOME TEXT 1}<br />
<h5><img src="img.jpg" alt="#" /> {SOME TEXT 2}</h5>
{SOME TEXT 3} </a>
</div>
</div>
I would like to extract {TITLE}
, {SOME TEXT 1}
, {SOME TEXT 2}
and {SOME TEXT 3}
我想提取
{TITLE}
, {SOME TEXT 1}
, {SOME TEXT 2}
和{SOME TEXT 3}
My code is as follows: 我的代码如下:
from BeautifulSoup import BeautifulSoup as bs
import urllib2
html = urllib2.urlopen('text')
soup = bs(html)
divs = soup.findAll("div", { "class" : "text" })
for div in divs:
inner_text = div.text
strings = inner_text.split("\n")
print strings[0] ## I want this to print just {TITLE}
On printing it out, it prints one line connecting all the values eg 打印出来时,它打印一行连接所有值,例如
{TITLE}{SOME TEXT 1}{SOME TEXT 2}{SOME TEXT 3}
Is there anyway around this? 有没有办法解决? What have I missed?
我错过了什么?
You can prettify
( see documentation here ) the div content first and then manipulate each line as needed. 您可以
prettify
( 见文档这里第一)的DIV内容,然后操纵各行根据需要。 This will work if the divs with class name text
have same structure. 如果具有类名
text
的div具有相同的结构,则此方法将起作用。
Code (Python 2): 代码(Python 2):
from BeautifulSoup import BeautifulSoup as bs
html = '''
<div class="container">
<div class="image">
<a href="#" title="#" class="#">
<img src="img.jpg" alt="#" class="#">
</a>
</div>
<div class="text">
<a href="#">
<h4 class="h4-class">{TITLE}</h4>
{SOME TEXT 1}<br />
<h5><img src="img.jpg" alt="#" /> {SOME TEXT 2}</h5>
{SOME TEXT 3} </a>
</div>
</div>
'''
soup = bs(html)
divs = soup.findAll("div",{"class":"text"})
for div in divs:
pretty_div = div.prettify()
content_list = pretty_div.split("\n")
content_list = [s.strip() for s in content_list]
print content_list[3]
print content_list[5]
print content_list[9]
print content_list[11]
Output: 输出:
{TITLE}
{SOME TEXT 1}
{SOME TEXT 2}
{SOME TEXT 3}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.