简体   繁体   English

熊猫read_html缺少一些表

[英]Pandas read_html missing some tables

I am using pandas read_html to find all tables in a specific webpage; 我正在使用pandas read_html在特定网页中查找所有表; however, the process seems to be missing some of the tables. 但是,该过程似乎缺少一些表格。

Here is the webpage: https://www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm 这是网页: https : //www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm

and here is my simple example: 这是我的简单示例:

import pandas as pd

df_list = pd.read_html("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm")

print(len(df_list))

This process finds 9 of the 17 tables. 此过程找到17个表中的9个。 How can I use this method to find all the tables? 如何使用此方法查找所有表?

Note: if I try this on pages for other geographical areas, I have the same problem. 注意:如果我在其他地理区域的页面上尝试此操作,则会遇到相同的问题。

It seems that pd.read_html function can't find all table tags. 似乎pd.read_html函数找不到所有表标签。 I can suggest you to use BeautifulSoup and urllib2 packages for this task. 我可以建议您使用BeautifulSoupurllib2软件包来完成此任务。 You can install it via pip install <package_name> . 您可以通过pip install <package_name>进行pip install <package_name>

import urllib2
from bs4 import BeautifulSoup

html_text = urllib2.urlopen("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/mclsstc/mcls1.htm")
bs_obj = BeautifulSoup(html_text)
tables = bs_obj.findAll('table')
dfs = list()
for table in tables:
    df = pd.read_html(str(table))[0]
    dfs.append(df)

In result, you'l have all tables (in DataFrame type) in dfs list. 结果,您将在dfs列表中拥有所有表(DataFrame类型)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM