简体   繁体   English

将html表转换为字典而不丢失结构

[英]Convert html table to dictionary without losing structure

I'm new to python (and programming) and using BeautifulSoup for the first time. 我是python(和编程)的新手,并且是第一次使用BeautifulSoup

I'm trying to find the best way to parse the contents of a table in html and convert to a dictionary - ideally in the least brittle way. 我试图找到最好的方法来解析html中的表的内容并将其转换为字典-理想情况下以最不脆弱的方式进行。

Here is an example of the HTML I'm trying to parse (I've put key value numbers for the text I'm trying to pick up). 这是我要解析的HTML的示例(我为要提取的文本添加了键值数字)。

<div class="tablename">
<table border="0" cellpadding="0" cellspacing="0" style="border: 1px solid #dddddd;  border-collapse: collapse; font-family: Arial, Helvetica, sans-serif; font-size: 14px; margin: 0; padding: 0; width: 100%">
<thead>
<tr>
<th colspan="4" style="background-color: #000; border: 1px solid #616161; color: #ffffff; font-size: 14px; font-weight: bold; line-height: 20px; padding: 14px 20px 12px 20px; text-align: left">Some text not needed</th>
</tr>
</thead>
<tbody>
<tr>
<td style="width: 20px"> </td>
<td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; width: 42.5%; vertical-align: middle">Key 1</td>
<td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">Value 1</td>
<td style="width: 20px"> </td>
</tr>
<tr>
<td> </td>
<td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; vertical-align: middle">Key 2</td>
<td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">Value 2</td>
<td> </td>
</tr>
<tr>
<td> </td>
<td style="border-bottom: 1px solid #dddddd; color: #666666; font-size: 14px; line-height: 20px; padding: 11px 20px 10px 0; text-align: left; vertical-align: middle">Key 3</td>
<td style="border-bottom: 1px solid #dddddd; color: #000; font-size: 14px; line-height: 20px; padding: 11px 0 10px 0; text-align: left; vertical-align: middle">Value 3</td>
<td> </td>
</tr>
<tr>

And the code I'm using: 我正在使用的代码:

import requests
from bs4 import BeautifulSoup

html = requests.get('https://examplewebaddress.com')
soup = BeautifulSoup(html.text)
print(soup.tbody.text)

I could then loop over the soup.tbody.text string and split this into key value pairs. 然后,我可以遍历soup.tbody.text字符串并将其拆分为键值对。 But this doesn't seem to be a good way and I seem to be losing the structure of the table by converting it to a string and then building that back up again into a dictionary. 但这似乎不是一个好方法,而且我似乎正在丢失表的结构,方法是将其转换为字符串,然后再次将其构建为字典。

Is there a more direct way to parse a table with BeautifulSoup (or something more suitable) into a dictionary which I can then use? 有没有更直接的方法将使用BeautifulSoup (或更合适的东西)解析成一个表,然后可以使用它呢?

The idea is to iterate over table rows and for each row extract the text of the second and the third cells that would represent key and a value of the future dictionary: 想法是遍历表行,并为每一行提取第二个和第三个单元格的文本,这些文本将表示键和将来字典的值:

soup = BeautifulSoup(html.text)

result = dict([[item.get_text(strip=True) for item in row.find_all('td')[1:3]]
               for row in soup.select("div.tablename table tr")[1:]])

print result

For the provided sample data, it prints: 对于提供的样本数据,它打印:

{u'Key 1': u'Value 1', u'Key 2': u'Value 2', u'Key 3': u'Value 3'}

div.tablename table tr is a CSS selector that would match all tr elements under table element that has div with a class="tablename" as a parent. div.tablename table tr是一个CSS选择器 ,它将匹配以div class="tablename" div table元素下的所有tr元素。 We are slicing the result of select ( [1:] ) to skip the first header row. 我们将对select[1:] )的结果进行切片以跳过第一行标题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM