[英]Parsing html table with BeautifulSoup to python dictionary
This is an html code than I'm trying to parse with BeautifulSoup: 这是比我尝试用BeautifulSoup解析的html代码:
<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1<a href="/link/to/bar1">Bar1</a></li>
... (amount of this tags isn't fixed)
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2<a href="/link/to/bar2">Bar2</a></li>
<li>Foo3<a href="/link/to/bar3">Bar3</a></li>
<li>Some data3</li>
... (amount of this tags isn't fixed too)
</ul>
</td>
</tr>
</table>
The output I would like to get is a dictionary like this: 我想得到的输出是这样的字典:
DICT = {
'menu1': ['Some data1','Foo1 Bar1'],
'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}
As I already mentioned in the code, amount of <li>
tags is not fixed. 正如我在代码中已经提到的,
<li>
标签的数量是固定的。 Additionally, there could be: 此外,可能有:
<table></table>
) <table></table>
) so eg it could looks just like this: 所以例如看起来可能像这样:
<table> <tr> <th width="100">menu1</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>Some data1</li> <li>Foo1<a href="/link/to/bar1">Bar1</a></li> ... (amount of this tags isn't fixed) </ul> </td> </tr> </table>
I was trying to use this example but with no success. 我试图使用此示例,但没有成功。 I think it's because of that
<ul>
tags, I can't read proper data from table. 我认为是因为
<ul>
标签,我无法从表中读取适当的数据。 Problem for me is also variable amount of menus
and <li>
tags. 对我来说,问题还在于
menus
和<li>
标签的数量可变。 So my question is how to parse this particular table to python dictionary? 所以我的问题是如何将此特定表解析为python字典? I should mention that I already parsed some simple data with
.text
attribute of BeautifulSoup handler so it would be nice if I could just keep it as is. 我应该提到的是,我已经使用BeautifulSoup处理程序的
.text
属性解析了一些简单的数据,因此,如果我可以保持原样,那将是很好的。
request = c.get('http://example.com/somepage.html) soup = bs(request.text)
and this is always the first table of the page, so I can get it with: 这始终是页面的第一张表,因此我可以通过以下方式获得它:
table = soup.find_all('table')[0]
Thank you in advance for any help. 预先感谢您的任何帮助。
html = """<table>
<tr>
<th width="100">menu1</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data1</li>
<li>Foo1<a href="/link/to/bar1">Bar1</a></li>
</ul>
</td>
</tr>
<tr>
<th width="100">menu2</th>
<td>
<ul class="classno1" style="margin-bottom:10;">
<li>Some data2</li>
<li>Foo2<a href="/link/to/bar2">Bar2</a></li>
<li>Foo3<a href="/link/to/bar3">Bar3</a></li>
<li>Some data3</li>
</ul>
</td>
</tr>
</table>"""
import BeautifulSoup as bs
soup = bs.BeautifulSoup(html)
table = soup.findAll('table')[0]
results = {}
th = table.findChildren('th')#,text=['menu1','menu2'])
for x in th:
#print x
results_li = []
li = x.nextSibling.nextSibling.findChildren('li')
for y in li:
#print y.next
results_li.append(y.next)
results[x.next] = results_li
print results
. 。
{
u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'],
u'menu1': [u'Some data1', u'Foo1']
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.