简体   繁体   中英

Beautiful Soup only returning 100 rows from Yahoo! Finance

I am just getting started with web scraping and thought I was making good progress using the script below and Beautiful Soup to parse simple Yahoo finance data. The script below worked great but it only returned 100 rows even though I requested a year's worth. I found some SO posts that suggested adding a data argument to the request setting it AJAX and mobile to no but that didn't work either. I also tried passing different header info and that didn't do it. Does Beautiful soup have an arugument I am missing that would return the full list? When I print the full HTML content out from the request the full results are in there, so I am stumped.

from datetime import datetime, timedelta
import time
import requests
from bs4 import BeautifulSoup

def format_date_int_as_str(date_datetime):
     date_timetuple = date_datetime.timetuple()
     date_mktime = time.mktime(date_timetuple)
     date_int = int(date_mktime)
     date_str = str(date_int)
     return date_str
def subdomain(symbol, start, end, filter='history'):
     subdoma="/quote/{0}/history?period1={1}&period2={2}&interval=1d&filter={3}&frequency=1d"
     subdomain = subdoma.format(symbol, start, end, filter)
     return subdomain
 
def header_function(subdomain):
     hdrs =  {"authority": "finance.yahoo.com",
              "method": "GET",
              "path": subdomain,
              "scheme": "https",
              "accept": "text/html",
              "accept-encoding": "gzip, deflate, br",
              "accept-language": "en-US,en;q=0.9",
              "cache-control": "no-cache",
              "dnt": "1",
              "pragma": "no-cache",
              "sec-fetch-mode": "navigate",
              "sec-fetch-site": "same-origin",
              "sec-fetch-user": "?1",
              "upgrade-insecure-requests": "1",
              "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64)"}
     
     return hdrs
if __name__ == '__main__':
     symbol = 'AAPL'
     
     dt_start = datetime.today() - timedelta(days=365)
     dt_end = datetime.today()- timedelta(days=100)
    
     start = format_date_int_as_str(dt_start)
     end = format_date_int_as_str(dt_end)
     
     sub = subdomain(symbol, start, end)
     header = header_function(sub)
     url = "https://finance.yahoo.com" + sub

print("\nREQUESTING: " + url + "\n" + str(dt_start) + "  to  " + str(dt_end))

# Build HTML request and content for BS
r = requests.get (url, headers=header)
c = r.content
classNameTr = "BdT Bdc($seperatorColor) Ta(end) Fz(s) Whs(nw)"
classNameDt = "Py(10px) Ta(start) Pend(10px)"
classNameTd = "Py(10px) Pstart(10px)"
className = " Pb(10px) Ovx(a) W(100%)"

soup = BeautifulSoup(c,"html.parser")
all = soup.find_all( "tr", {"class": classNameTr} )
print("LIST LENGTH: " + str(len(all)))
# https://finance.yahoo.com/quote/AAPL/history?period1=1572244075&period2=1615447675&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true
# https://finance.yahoo.com/quote/AAPL/history?period1=1572244075&period2=1615447675&interval=1d&filter=history&frequency=1d
# Why aren't we getting full look back list ... check to see if its in the HTML just not the soup?  Does soup have  100 array item limit?
for stockDay in all:
     dt = stockDay.find( "td", {"class": classNameDt} )
     td = stockDay.find_all( "td", {"class": classNameTd} )
     if len(td) == 6 :
          print( dt.text + " --|-- OPEN:" + td[0].text + " --|-- HIGH:" + td[1].text + " --|-- LOW:" + td[2].text + " --|-- CLOSE:" + td[3].text + " --|-- ADJCLOSE:" + td[4].text + " --|-- VOLUME:" + td[5].text + " --|-- ") 
     else:
          print(dt.text + " --|-- Skipping-- this is the dividend date!")

The reason you are getting 100 results is because the web page is dynamically loading the rest of the results when you scroll down. Unfortunately what you are asking is not possible with beautiful soup. I would suggest looking into Selenium.

Though if you still prefer using bs here, I believe you can filter periods into scrape-able time spans and scrape the data without scrolling. That might be an option.

More information can be found here:

scraping a website that requires you to scroll down

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM