简体   繁体   English

使用python和漂亮汤从html表中提取数据

[英]Extracting data from html table using python and beautiful soup

<table class="softwares" border="1" cellpadding="0" width="99%">
    <thead style="background-color: #ededed">
        <tr>
            <td colspan="5"><b>Windows</b></td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><b>Type</b></td>
            <td><b>Issue</b></td>
            <td><b>Restart</b></td>
            <td><b>Severity</b></td>  
            <td><b>Impact</b></td>  
        </tr>
        <tr>
            <td>some item</td>
            <td><a href="some website">some website</a><br></td>
            <td>Yes<br></td>
            <td>Critical<br></td>
            <td>stuff<br></td>
        </tr>    
        <tr>
            <td>some item</td>
            <td><a href="some website">some website</a><br></td>
            <td>Yes<br></td>
            <td>Important<br></td>
            <td>stuff<br></td>    
        </tr>
    </tbody>
</table>

The html page that I am trying to get the data from is a local file that I have saved onto my pc and is filled with multiple tables formatted the same as this. 我试图从中获取数据的html页面是一个本地文件,已保存到我的PC上,并填充了多个与此格式相同的表。 I'm trying to get the both the title for each of these tables, in this specific case "Windows," as well as the urls that are located in the table body. 我正在尝试获取每个表的标题(在这种特定情况下为“ Windows”)以及位于表主体中的URL。 I have been trying to use beautiful soup and python to get the table titles and the websites and print them in a table with the title on the left, and the corresponding urls on the right, but I am unable to do so. 我一直在尝试使用漂亮的汤和python获取表标题和网站,并将它们打印在一个表中,该表的标题在左侧,而相应的url在右侧,但是我无法这样做。 Any help would be greatly appreciated. 任何帮助将不胜感激。

You can use find_all to gather all td tag objects, and then apply additional logic to store the href s: 您可以使用find_all收集所有td标签对象,然后应用其他逻辑来存储href

from bs4 import BeautifulSoup as soup
import re
s = soup(re.sub('\s', '', open('filename.html').read()), 'lxml')
final_results = [[i.text, i.find('a')['href']] if i.find('a') else i.text for i in s.find_all('td')]
name = final_results[0]
header = final_results[1:6]
full_results = final_results[6:]

Output: 输出:

u'Windows'
[u'Type', u'Issue', u'Restart', u'Severity', u'Impact']
[u'some item', [u'some website', 'some website'], u'Yes', u'Critical', u'stuff', u'some item', [u'some website', 'some website'], u'Yes', u'Important', u'stuff']

The results can be further combined to a dictionary: 结果可以进一步合并为字典:

table = [dict(zip(header, full_results[i:i+5])) for i in range(0, len(full_results), 5)]

Output: 输出:

[{u'Impact': u'stuff', u'Issue': [u'some website', 'some website'], u'Type': u'some item', u'Severity': u'Critical', u'Restart': u'Yes'}, {u'Impact': u'stuff', u'Issue': [u'some website', 'some website'], u'Type': u'some item', u'Severity': u'Important', u'Restart': u'Yes'}]

I'm not sure if this is what you wanted to do: 我不确定这是否是您要执行的操作:

soup = BeautifulSoup(content,'lxml') # content variable holds the `table elements`
title = soup.select_one(".softwares thead td b").text
links = [item.a.get("href") for item in soup.select(".softwares tr td") if item.a]
print(title,links)

Output: 输出:

Windows ['some website', 'some website']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM