使用beautifulsoup从网页中的表格中提取文本信息

Question

I have a table from a webpage that I am attempting to extract the text data from.我有一个来自网页的表格，我试图从中提取文本数据。

A snippet of the HTML table looks as follows: HTML 表的片段如下所示：

You can see the following table headings 'Effective Date', 'Type', 'Note' that I would like to extract text data from.您可以看到我想从中提取文本数据的下表标题“生效日期”、“类型”、“注意”。

I have used the following code to attempt to extract the data:我使用以下代码尝试提取数据：

content = driver.page_source
soup = BeautifulSoup(content)

for child in soup.find_all('table')[7].children:
    for td in child:
        print(td.text)

However, am met with a 'str' object has no attribute 'text'然而，我遇到了一个'str' object has no attribute 'text'

Based on this HTML layout, what is the best way to find the right table - iterate through it ie the td's and select the text data appropriately?基于此 HTML 布局，找到正确表的最佳方法是什么 - 适当地遍历它，即 td 和 select 文本数据？ (Note that for the 'Note Headings' there are also 'br's' that may need to be iterated through). （请注意，对于“注释标题”，还有可能需要迭代的“br”）。 Thanks.谢谢。

Answer 1

Thanks for that.感谢那。 After the content = driver.page_source , try and see if pandas will pull out the tables.在content = driver.page_source之后，尝试查看 pandas 是否会拉出表格。 If there are <table> tags, it will put them into a list of tables.如果有<table>标签，它会将它们放入表列表中。

So:所以：

import pandas as pd

content = driver.page_source
dfs = pd.read_html(content)

See if that returns your table in that list.看看这是否会在该列表中返回您的表。

使用beautifulsoup从网页中的表格中提取文本信息

问题描述

1 个解决方案

解决方案1
0 2021-01-20 15:47:06

使用beautifulsoup从网页中的表格中提取文本信息

问题描述

1 个解决方案

解决方案1 0 2021-01-20 15:47:06

解决方案1
0 2021-01-20 15:47:06