简体   繁体   English

使用熊猫read_html函数提取表?

[英]Extracting tables using pandas read_html function?

This is an unusual problem. 这是一个不寻常的问题。 I am trying to extract a table from certain website(link cant be given because of security). 我正在尝试从某些网站提取表格(出于安全原因,无法提供链接)。 The problem is that the site will load the table when accessed through website but when we use inspect element on any values/tables on that table it is not visible. 问题在于,当通过网站访问网站时,站点将加载该表,但是当我们在该表上的任何值/表上使用inspect element时,该表将不可见。 It just show <html>_</html> with some scripts and links inside. 它只是显示<html>_</html>以及一些脚本和链接。 Initially I tried to extract table using beautifulsoup but it was unsuccessful. 最初,我尝试使用beautifulsoup提取表,但未成功。 Then I used pandas pandas.read_html(html) but the site contains more than one table and its output is something like this 然后,我使用了pandas pandas.read_html(html)但该站点包含多个表,其输出类似这样

[     Code                   Name  
 0    A                      John   
 1    B                      Terry
 2    C                      Kitty 


    Column 1 Column 2    Column 3
0       1   0.6173661242    8
1       2   0.7232098163    20
2       3   0.9954581943    39
3       4   0.5595425507    18
4       5   0.9644025159    20
5       6   0.3914102544    29
6       7   0.0154642132    49

....

[873 rows x 3 columns],

0\n\t\t\t\t\t\t\t\t\t  
 0                                                  0    ]

Then I tried something like this pandas.read_html(html, match="Column 1") it returns this error 然后我尝试了类似这样的东西pandas.read_html(html, match="Column 1")它返回此错误

ValueError: No tables found matching pattern 'Column 1' ValueError:找不到与模式“列1”匹配的表

any idea how we can use read_html to extract tables? 知道如何使用read_html提取表吗?

When data scraping off a secure website, the website can be using Java to load the tables so you never see the HTML-styled code. 从安全网站上抓取数据时,该网站可能正在使用Java加载表,因此您永远不会看到HTML样式的代码。 This could be why BeautifulSoup is not returning anything. 这就是为什么BeautifulSoup不返回任何内容的原因。

Does the "scripts and links inside" look like Java? “内部的脚本和链接”看起来像Java吗?

Maybe have a look at Selenium? 也许看看硒?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM