Extracting tables using pandas read_html function?

Question

This is an unusual problem. I am trying to extract a table from certain website(link cant be given because of security). The problem is that the site will load the table when accessed through website but when we use inspect element on any values/tables on that table it is not visible. It just show <html>_</html> with some scripts and links inside. Initially I tried to extract table using beautifulsoup but it was unsuccessful. Then I used pandas pandas.read_html(html) but the site contains more than one table and its output is something like this

[     Code                   Name  
 0    A                      John   
 1    B                      Terry
 2    C                      Kitty 


    Column 1 Column 2    Column 3
0       1   0.6173661242    8
1       2   0.7232098163    20
2       3   0.9954581943    39
3       4   0.5595425507    18
4       5   0.9644025159    20
5       6   0.3914102544    29
6       7   0.0154642132    49

....

[873 rows x 3 columns],

0\n\t\t\t\t\t\t\t\t\t  
 0                                                  0    ]

Then I tried something like this pandas.read_html(html, match="Column 1") it returns this error

ValueError: No tables found matching pattern 'Column 1'

any idea how we can use read_html to extract tables?

Answer 1

When data scraping off a secure website, the website can be using Java to load the tables so you never see the HTML-styled code. This could be why BeautifulSoup is not returning anything.

Does the "scripts and links inside" look like Java?

Maybe have a look at Selenium?

Extracting tables using pandas read_html function?

Question

1 answers

solution1
0 2016-08-30 16:01:55

Extracting tables using pandas read_html function?

Question

1 answers

solution1 0 2016-08-30 16:01:55

solution1
0 2016-08-30 16:01:55