简体   繁体   中英

Incomplete data after scraping a website for Data

I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder . Below is a snapshot of the tables I am trying to scrap values from.

Here is the codes which I am trying to use in the scraping.

#Import packages
import pandas as pd
import requests

#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent': 
'Mozilla/5.0'}).text)


#printing the scraped data to screen 
print(etf_df)

# Output the read data into dataframes
for i in range(0,len(etf_df)):
    frame[i] = pd.DataFrame(etf_df[i])
    print(frame[i])

I have several issues.

  • The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
  • Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?

ETF表

You get only 20 rows of the table, because only 20 rows are present on the html page by default. View the source-code of the page, you are trying to parse. There could be a possible solution to iterate through the pagination til the end, but pagination there is implemented with JS, it is not reflected in the URL, so I don't see, how you can access next pages of the table directly.

Looks like there is a request to

http://www.etf.com/etf-finder-funds-api//-aum/100/100/1

on that page, when I try to load the 2nd group of 100 rows. But getting an access to that URL might very tricky if possible. Maybe for this particular site you should use something, like WebBrowser in C# (I don't know what it will be in python, but I'm sure that python can do everything). You will be able to imitate browser and execute javascript.

Edit: I've tried to run the next JS code in console on the page, you provided.

jQuery.ajax({
  url: "http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1", 
  success: function(data) {
    console.log(JSON.parse(data));
  }
});

It logged an array of all 2166 objects, representing table rows, you are looking for. Try it yourself to see the result. Looks like in the request url "0" is a start index and "3000" is a limit.

But if you try this from some other domain you will get 403 Forbidden. This is because of they have a Referer header check.

Edit again as mentioned by @stranac it is easy to set that header. Just set it to http://www.etf.com/etfanalytics/etf-finder and enjoy.

As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1 , which checks the Referer header to see if you're allowed to see it.

However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests :

>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166

At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM