将html表转换为字典而不丢失结构

Question

I'm new to python (and programming) and using BeautifulSoup for the first time. 我是python（和编程）的新手，并且是第一次使用BeautifulSoup 。

I'm trying to find the best way to parse the contents of a table in html and convert to a dictionary - ideally in the least brittle way. 我试图找到最好的方法来解析html中的表的内容并将其转换为字典-理想情况下以最不脆弱的方式进行。

Here is an example of the HTML I'm trying to parse (I've put key value numbers for the text I'm trying to pick up). 这是我要解析的HTML的示例（我为要提取的文本添加了键值数字）。

<div class="tablename">
<table border="0" cellpadding="0" cellspacing="0" style="border: 1px solid #dddddd;  border-collapse: collapse; font-family: Arial, Helvetica, sans-serif; font-size: 14px; margin: 0; padding: 0; width: 100%">
<thead>
<tr>
<th colspan="4" style="background-color: #000; border: 1px solid #616161; color: #ffffff; font-size: 14px; font-weight: bold; line-height: 20px; padding: 14px 20px 12px 20px; text-align: left">Some text not needed</th>
</tr>
</thead>
<tbody>
<tr>
<td style="width: 20px"> </td>
<td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; width: 42.5%; vertical-align: middle">Key 1</td>
<td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">Value 1</td>
<td style="width: 20px"> </td>
</tr>
<tr>
<td> </td>
<td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; vertical-align: middle">Key 2</td>
<td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">Value 2</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; vertical-align: middle">Key 3</td>
<td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">Value 3</td>
<td> </td>
</tr>
<tr>

And the code I'm using: 我正在使用的代码：

import requests
from bs4 import BeautifulSoup

html = requests.get('https://examplewebaddress.com')
soup = BeautifulSoup(html.text)
print(soup.tbody.text)

I could then loop over the soup.tbody.text string and split this into key value pairs. 然后，我可以遍历soup.tbody.text字符串并将其拆分为键值对。 But this doesn't seem to be a good way and I seem to be losing the structure of the table by converting it to a string and then building that back up again into a dictionary. 但这似乎不是一个好方法，而且我似乎正在丢失表的结构，方法是将其转换为字符串，然后再次将其构建为字典。

Is there a more direct way to parse a table with BeautifulSoup (or something more suitable) into a dictionary which I can then use? 有没有更直接的方法将使用BeautifulSoup （或更合适的东西）解析成一个表，然后可以使用它呢？

Answer 1

The idea is to iterate over table rows and for each row extract the text of the second and the third cells that would represent key and a value of the future dictionary: 想法是遍历表行，并为每一行提取第二个和第三个单元格的文本，这些文本将表示键和将来字典的值：

soup = BeautifulSoup(html.text)

result = dict([[item.get_text(strip=True) for item in row.find_all('td')[1:3]]
               for row in soup.select("div.tablename table tr")[1:]])

print result

For the provided sample data, it prints: 对于提供的样本数据，它打印：

{u'Key 1': u'Value 1', u'Key 2': u'Value 2', u'Key 3': u'Value 3'}

div.tablename table tr is a CSS selector that would match all tr elements under table element that has div with a class="tablename" as a parent. div.tablename table tr是一个CSS选择器，它将匹配以div class="tablename" div table元素下的所有tr元素。 We are slicing the result of select ( [1:] ) to skip the first header row. 我们将对select （ [1:] ）的结果进行切片以跳过第一行标题。

将html表转换为字典而不丢失结构

问题描述

1 个解决方案

解决方案1
2 2015-04-01 16:21:12

将html表转换为字典而不丢失结构

问题描述

1 个解决方案

解决方案1 2 2015-04-01 16:21:12

解决方案1
2 2015-04-01 16:21:12