简体   繁体   中英

Python - How to use scrape table from website with dropdown of available rows

I am trying to scrape the earnings calendar data from the table from zacks.com and the url is attached below.

https://www.zacks.com/stock/research/aapl/earnings-calendar

The thing is I am trying to scrape all data from the table, but it has a dropdown list to select 10, 25, 50 and 100 rows on a page. Ideally I want to scrape for all 100 rows but when I select 100 from the dropdown list, the url doesn't change. My code is below.

To note that the website blocks user-agent so I had to use chrome driver to impersonate human visiting the web. The obtained result from the pd.read_html is a list of all the tables and the d[4] returns the earnings calendar with only 10 rows (which I want to change to 100)

driver = webdriver.Chrome('../files/chromedriver96')
symbol = 'AAPL'
url = 'https://www.zacks.com/stock/research/{}/earnings-calendar'.format(symbol)
driver.get(url)
content = driver.page_source
d = pd.read_html(content)
d[4]

So calling help for anyone to guide me on this

Thanks!


UPDATE: it looks like my last post was downgraded due to lack of clear articulation and evidence of showing the past research. Maybe I am still a newbie to posting questions on this site. Actually, I have found several pages including this page with the same issue but the solutions didn't seem to work for me, which is why I came to post this as a new question

Having taken a look this is not going to be something that is easy to scrape. Given that the table is produced from the javascript I would say you have two options.

Option one:

Use selenium to render the page allowing the javascript to run. This way you can simply use the id/class of the drop down to interact with it. You can then scrape the data by looking at the values in the table.

Option two:

This is the more challenging one. Look through the data that the page gets in response and try to find requests which result in the data you then see on the page. By cross-referencing these there will be a way to directly request the data you want. You may find that to get at the data you want you need to accept a key from the original request to the page and then send that key as part of a second request. This way should allow you to scrape the data without having to run a selenium instance which will run more efficiently.

My personal suggestion is to go with option one as computer resources are cheap and developer time expensive.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM