[英]extract data from html content
我想下載一些html頁面並提取信息,每個HTML頁面都有這個table tag
:
<table class="sobi2Details" style='background-image: url(http://www.imd.ir/components/com_sobi2/images/backgrounds/grey.gif);border-style: solid; border-color: #808080' >
<tr>
<td><h1>Dr Jhon Doe</h1></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>
<div id="sobi2outer">
<br/>
<span id="sobi2Details_field_name" ><span id="sobi2Listing_field_name_label">name:</span>Jhon</span><br/>
<span id="sobi2Details_field_family" ><span id="sobi2Listing_field_family_label">family:</span> Doe</span><br/>
<span id="sobi2Details_field_tel1" ><span id="sobi2Listing_field_tel1_label">tel:</span> 33727464</span><br/>
</div>
</td>
</tr>
</table>
我想訪問名稱( Jhone
),家庭( Doe
)和tel( 33727464
),我使用beausiful湯通過id訪問這些span標簽:
name=soup.find(id="sobi2Details_field_name").__str__()
family=soup.find(id="sobi2Details_field_family").__str__()
tel=soup.find(id="sobi2Details_field_tel1").__str__()
但我不知道如何將數據提取到這些tag
。我試圖使用children
和content
屬性,但當我使用主題作為tag
它返回None
:
name=soup.find(id="sobi2Details_field_name")
for child in name.children:
#process content inside
但我得到這個錯誤:
'NoneType' object has no attribute 'children'
當我在它上面使用str ()時,它不是None
!! 任何想法?
編輯:我的最終解決方案
soup = BeautifulSoup(page,from_encoding="utf-8")
name_span=soup.find(id="sobi2Details_field_name").__str__()
name=name_span.split(':')[-1]
result = re.sub('</span>', '',name)
我發現了幾種方法。
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(path_to_html_file))
name_span = soup.find(id="sobi2Details_field_name")
# First way: split text over ':'
# This only works because there's always a ':' before the target field
name = name_span.text.split(':')[1]
# Second way: iterate over the span strings
# The element you look for is always the last one
name = list(name_span.strings)[-1]
# Third way: iterate over 'next' elements
name = name_span.next.next.next # you can create a function to do that, it looks ugly :)
告訴我它是否有幫助。
如果您熟悉xpath,請使用帶有etree的lxml:
import urllib2
from lxml import etree
opener = urllib2.build_opener()
root = etree.HTML(opener.open("myUrl").read())
print root.xpath("//span[@id='sobi2Details_field_name']/text()")[0]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.