简体   繁体   中英

How to get data and tabulate it from multiple links on a single website HTML

This code is executing and providing multiple links to data from a single website . Code mentions the website . Website has data from multiple links which then tabulates as one single table

Can you suggest what are the changes to made in this code in order to get data without importing any further libraries and tabulate it?

    #import libraries
    import re 
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import urllib.request as ur
    from bs4 import BeautifulSoup

    s = ur.urlopen("https://financials.morningstar.com/ratios/r.html?t=AAPL")
    s1 = s.read()
    print(s1)

    soup = BeautifulSoup(ur.urlopen('https://financials.morningstar.com/ratios/r.html?t=AAPL'),"html.parser")
title = soup.title
print(title)

text = soup.get_text()
print(text)
links = []
for link in soup.find_all(attrs={'href': re.compile("http")}):
    links.append(link.get('href'))

print(links)

The expected results should be a tabular form of ratios as listed each of which can be listed as dictionary with key being the year and value being the ratio

1) Here is one way with selenium and pandas. You can view the final structure here . The content is JavaScript loaded so I think it likely you need additional libraries.

2) There was a call being made to this:

https://financials.morningstar.com/finan/financials/getKeyStatPart.html?&callback=jsonp1555262165867&t=XNAS:AAPL&region=usa&culture=en-US&cur=&order=asc&_=1555262166853

that returns json containing info for the page. You might try using requests with that.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import copy

d = webdriver.Chrome()
d.get('https://financials.morningstar.com/ratios/r.html?t=AAPL')
tables = WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-profitability table")))
results = []

for table in tables:
    t = pd.read_html(table.get_attribute('outerHTML'))[0].dropna()
    years = t.columns[1:]
    for row in t.itertuples(index=True, name='Pandas'):
        record = {row[1] : dict(zip(years, row[2:]))}
        results.append(copy.deepcopy(record))
print(results)

d.quit()

You end up with all 17 rows being listed. First two rows shown here with row 2 expanded to show pairing of years with values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM