熊猫read_html缺少一些表

Question

I am using pandas read_html to find all tables in a specific webpage; 我正在使用pandas read_html在特定网页中查找所有表； however, the process seems to be missing some of the tables. 但是，该过程似乎缺少一些表格。

Here is the webpage: https://www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm 这是网页： https : //www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm

and here is my simple example: 这是我的简单示例：

import pandas as pd

df_list = pd.read_html("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm")

print(len(df_list))

This process finds 9 of the 17 tables. 此过程找到17个表中的9个。 How can I use this method to find all the tables? 如何使用此方法查找所有表？

Note: if I try this on pages for other geographical areas, I have the same problem. 注意：如果我在其他地理区域的页面上尝试此操作，则会遇到相同的问题。

Answer 1

It seems that pd.read_html function can't find all table tags. 似乎pd.read_html函数找不到所有表标签。 I can suggest you to use BeautifulSoup and urllib2 packages for this task. 我可以建议您使用BeautifulSoup和urllib2软件包来完成此任务。 You can install it via pip install <package_name> . 您可以通过pip install <package_name>进行pip install <package_name> 。

import urllib2
from bs4 import BeautifulSoup

html_text = urllib2.urlopen("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm")
bs_obj = BeautifulSoup(html_text)
tables = bs_obj.findAll('table')
dfs = list()
for table in tables:
    df = pd.read_html(str(table))[0]
    dfs.append(df)

In result, you'l have all tables (in DataFrame type) in dfs list. 结果，您将在dfs列表中拥有所有表（DataFrame类型）。

熊猫read_html缺少一些表

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-10-03 09:15:13

熊猫read_html缺少一些表

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-10-03 09:15:13

解决方案1
0 已采纳 2017-10-03 09:15:13