简体   繁体   English

如何从单个网站HTML上的多个链接获取数据并将其制成表格

[英]How to get data and tabulate it from multiple links on a single website HTML

This code is executing and providing multiple links to data from a single website . 该代码正在执行并提供来自单个网站的数据的多个链接。 Code mentions the website . 代码中提到了该网站。 Website has data from multiple links which then tabulates as one single table 网站具有来自多个链接的数据,然后将这些数据制成一个表格

Can you suggest what are the changes to made in this code in order to get data without importing any further libraries and tabulate it? 您能建议对这段代码进行哪些更改,以便在不导入任何其他库并将其制成表格的情况下获取数据?

    #import libraries
    import re 
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import urllib.request as ur
    from bs4 import BeautifulSoup

    s = ur.urlopen("https://financials.morningstar.com/ratios/r.html?t=AAPL")
    s1 = s.read()
    print(s1)

    soup = BeautifulSoup(ur.urlopen('https://financials.morningstar.com/ratios/r.html?t=AAPL'),"html.parser")
title = soup.title
print(title)

text = soup.get_text()
print(text)
links = []
for link in soup.find_all(attrs={'href': re.compile("http")}):
    links.append(link.get('href'))

print(links)

The expected results should be a tabular form of ratios as listed each of which can be listed as dictionary with key being the year and value being the ratio 预期结果应为表格形式的比率,其中列出的每个比率都可以列为字典,键为年份,值为比率

1) Here is one way with selenium and pandas. 1)这是硒和熊猫的一种方法。 You can view the final structure here . 您可以在此处查看最终结构。 The content is JavaScript loaded so I think it likely you need additional libraries. 内容已加载JavaScript,因此我认为您可能需要其他库。

2) There was a call being made to this: 2)正在对此进行调用:

https://financials.morningstar.com/finan/financials/getKeyStatPart.html?&callback=jsonp1555262165867&t=XNAS:AAPL&region=usa&culture=en-US&cur=&order=asc&_=1555262166853 https://financials.morningstar.com/finan/financials/getKeyStatPart.html?&callback=jsonp1555262165867&t=XNAS:AAPL&region=usa&culture=zh-CN&cur=&order=asc&_=1555262166853

that returns json containing info for the page. 返回包含页面信息的json。 You might try using requests with that. 您可以尝试与此一起使用requests

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import copy

d = webdriver.Chrome()
d.get('https://financials.morningstar.com/ratios/r.html?t=AAPL')
tables = WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-profitability table")))
results = []

for table in tables:
    t = pd.read_html(table.get_attribute('outerHTML'))[0].dropna()
    years = t.columns[1:]
    for row in t.itertuples(index=True, name='Pandas'):
        record = {row[1] : dict(zip(years, row[2:]))}
        results.append(copy.deepcopy(record))
print(results)

d.quit()

You end up with all 17 rows being listed. 最后列出所有17行。 First two rows shown here with row 2 expanded to show pairing of years with values. 此处显示的前两行和第2行已展开,以显示年份与值的配对。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM