简体   繁体   English

使用beautifulsoup从网页中的表格中提取文本信息

[英]Using beautifulsoup to extract text information from table in webpage

I have a table from a webpage that I am attempting to extract the text data from.我有一个来自网页的表格,我试图从中提取文本数据。

A snippet of the HTML table looks as follows: HTML 表的片段如下所示: 在此处输入图像描述

You can see the following table headings 'Effective Date', 'Type', 'Note' that I would like to extract text data from.您可以看到我想从中提取文本数据的下表标题“生效日期”、“类型”、“注意”。

I have used the following code to attempt to extract the data:我使用以下代码尝试提取数据:

content = driver.page_source
soup = BeautifulSoup(content)

for child in soup.find_all('table')[7].children:
    for td in child:
        print(td.text)

However, am met with a 'str' object has no attribute 'text'然而,我遇到了一个'str' object has no attribute 'text'

Based on this HTML layout, what is the best way to find the right table - iterate through it ie the td's and select the text data appropriately?基于此 HTML 布局,找到正确表的最佳方法是什么 - 适当地遍历它,即 td 和 select 文本数据? (Note that for the 'Note Headings' there are also 'br's' that may need to be iterated through). (请注意,对于“注释标题”,还有可能需要迭代的“br”)。 Thanks.谢谢。

Thanks for that.感谢那。 After the content = driver.page_source , try and see if pandas will pull out the tables.content = driver.page_source之后,尝试查看 pandas 是否会拉出表格。 If there are <table> tags, it will put them into a list of tables.如果有<table>标签,它会将它们放入表列表中。

So:所以:

import pandas as pd

content = driver.page_source
dfs = pd.read_html(content)

See if that returns your table in that list.看看这是否会在该列表中返回您的表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM