[英]Extracting table contents from html with python and BeautifulSoup
I want to extract certain information out of an html document. 我想从html文档中提取某些信息。 Eg it contains a table (among other tables with other contents) like this:
例如,它包含一个表(在其他表以及其他内容中),如下所示:
<table class="details">
<tr>
<th>Advisory:</th>
<td>RHBA-2013:0947-1</td>
</tr>
<tr>
<th>Type:</th>
<td>Bug Fix Advisory</td>
</tr>
<tr>
<th>Severity:</th>
<td>N/A</td>
</tr>
<tr>
<th>Issued on:</th>
<td>2013-06-13</td>
</tr>
<tr>
<th>Last updated on:</th>
<td>2013-06-13</td>
</tr>
<tr>
<th valign="top">Affected Products:</th>
<td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td>
</tr>
</table>
I want to extract Information like the date of "Issued on:". 我想提取“发行日期:”之类的信息。 It looks like BeautifulSoup4 could do this easyly, but somehow I don't manage to get it right.
看起来BeautifulSoup4可以轻松地做到这一点,但是以某种方式,我无法做到这一点。 My code so far:
到目前为止,我的代码:
from bs4 import BeautifulSoup
soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
table_tag=soup.table
if table_tag['class'] == ['details']:
print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text()
a=table_tag.next_sibling
print unicode(a)
print table_tag.contents
This gets me the contents of the first table row, and also a listing of the contents. 这使我获得了第一行数据的内容以及内容列表。 But the next sibling thing is not working right, I guess I am just using it wrong.
但是下一个兄弟姐妹的东西不能正常工作,我想我只是在错误地使用它。 Of course I could just parse the contents thingy, but it seems to me that beautiful soup was designed to prevent us from doing exactly this (if I start parsing myself, I might as well parse the whole doc ...).
当然我可以解析内容,但是在我看来,设计漂亮的汤是为了防止我们这样做(如果我开始解析自己,我可能会解析整个文档……)。 If someone could enlighten me on how to acomplish this, I would be gratefull.
如果有人能启发我完成这项工作,我将不胜感激。 If there is a better way then BeautifulSoup, I would be interested to hear about it.
如果有比BeautifulSoup更好的方法,我很想听听。
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
>>> table = soup.find('table', {'class': 'details'})
>>> th = table.find('th', text='Issued on:')
>>> th
<th>Issued on:</th>
>>> td = th.findNext('td')
>>> td
<td>2013-06-13</td>
>>> td.text
u'2013-06-13'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.