简体   繁体   中英

Web scraping JS content with Python (Yahoo Finance)

I am currently struggling with this page of Yahoo Finance : https://sg.finance.yahoo.com/quote/1B0.SI/history?period1=1426780800&period2=1489939200&interval=div%7Csplit&filter=split&frequency=1mo

I would need to get the date and ratio of the stock split, but I dove into a json file in which I do not see any of these information!

I'm using the script mentionned here How to understand this raw HTML of Yahoo! Finance when retrieving data using Python?

from bs4 import BeautifulSoup
from pprint import pprint as pp
import re
import json
import requests  

url='https://sg.finance.yahoo.com/quote/1B0.SI/history?period1=1426780800&period2=1489939200&interval=div%7Csplit&filter=split&frequency=1mo'
soup = BeautifulSoup(requests.get(url).content)
script = soup.find("script",text=re.compile("root.App.main")).text
data = json.loads(re.search("root.App.main\s+=\s+(\{.*\})", script).group(1))
stores = data["context"]["dispatcher"]["stores"]
pp(stores)

Please let me know if your have the idea where I can find it.

Thanks!

My surmise is that you can do this with selenium , thus.

>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('https://sg.finance.yahoo.com/quote/1B0.SI/history?period1=1426780800&period2=1489939200&interval=div%7Csplit&filter=split&frequency=1mo')
>>> driver.get('https://sg.finance.yahoo.com/quote/1B0.SI/history?period1=1426780800&period2=1489939200&interval=div%7Csplit&filter=split&frequency=1mo')
>>> tableRows = driver.find_elements_by_xpath('//tr')
>>> len(tableRows)
5
>>> tableRows[1].text
'Date Open High Low Close Adj close* Volume'
>>> tableRows[2].text
'Oct 11, 2016 2/1 Stock split'
>>> tableRows[3].text
'Feb 25, 2016 2/1 Stock split'

Notice especially that I had to load the page twice. The first load failed. You can learn how to deal with this contingency in the selenium documentation. (Use a try-except rather than an asssert .) The main difficulty one faces in scraping this page is that one cannot see the HTML. I made the assumption that the desired content would be in a table and that assumption proved correct.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM