简体   繁体   中英

Extracting tables using pandas read_html function?

This is an unusual problem. I am trying to extract a table from certain website(link cant be given because of security). The problem is that the site will load the table when accessed through website but when we use inspect element on any values/tables on that table it is not visible. It just show <html>_</html> with some scripts and links inside. Initially I tried to extract table using beautifulsoup but it was unsuccessful. Then I used pandas pandas.read_html(html) but the site contains more than one table and its output is something like this

[     Code                   Name  
 0    A                      John   
 1    B                      Terry
 2    C                      Kitty 


    Column 1 Column 2    Column 3
0       1   0.6173661242    8
1       2   0.7232098163    20
2       3   0.9954581943    39
3       4   0.5595425507    18
4       5   0.9644025159    20
5       6   0.3914102544    29
6       7   0.0154642132    49

....

[873 rows x 3 columns],

0\n\t\t\t\t\t\t\t\t\t  
 0                                                  0    ]

Then I tried something like this pandas.read_html(html, match="Column 1") it returns this error

ValueError: No tables found matching pattern 'Column 1'

any idea how we can use read_html to extract tables?

When data scraping off a secure website, the website can be using Java to load the tables so you never see the HTML-styled code. This could be why BeautifulSoup is not returning anything.

Does the "scripts and links inside" look like Java?

Maybe have a look at Selenium?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM