简体   繁体   English

beautifulsoup 抓取 - 可扩展 header 文本缺失

[英]beautifulsoup scraping - expandable header text missing

I was trying to extract data from a Y.Finance website using BeautifulSoup and store everything in a list, In the list, the headers of the expandable lines (Total Revenue. Operating Expense) are missing but the figures are still there?我试图使用 BeautifulSoup 从 Y.Finance 网站提取数据并将所有内容存储在列表中,在列表中,可扩展行的标题(总收入。运营费用)丢失但数字仍然存在? Is there a way to include the headers in the output?有没有办法在输出中包含标题?

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur

url = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'

read_data = ur.urlopen(url).read() 
soup= BeautifulSoup(read_data,'lxml')

ls= [] # Create empty list
for l in soup.find_all('div'): 
  ls.append(l.string) 


new_ls = list(filter(None,ls))

Current output:当前 output:

 'Expand All',
 'ttm',
 '9/30/2019',
 '9/30/2018',
 '9/30/2017',
 '9/30/2016',
 '273,857,000',
 '260,174,000',
 '265,595,000',
 '229,234,000',
 '215,639,000',

Expected output:预期 output:

 'Expand All',
 'ttm',
 '9/30/2019',
 '9/30/2018',
 '9/30/2017',
 '9/30/2016',
 'Total Revenue',
 '273,857,000',
 '260,174,000',
 '265,595,000',
 '229,234,000',
 '215,639,000',

Update: if I extract from "span", figures that are 0 were missing from the output which creates another problem when I construct the data frame later on更新:如果我从“span”中提取,output 中缺少 0 的数字,这在我稍后构建数据框时会产生另一个问题

for l in soup.select('div.D\(tbr\)'): 
    for n in l.select('span'):
        print(n.text)

I know this is kind of off topic, but it looks like you just want the data from Yahoo finance right?我知道这有点离题,但看起来你只是想要雅虎财经的数据,对吧? If so, they have a python package already available that would probably be easier to work with then web scraping.如果是这样,他们有一个 python package 已经可用,可能会更容易使用然后 web 抓取。

https://pypi.org/project/yahoo-finance/ https://pypi.org/project/yahoo-finance/

You can enter a share您可以输入一个共享

apple = Share('AAPL')

And also get a bunch of data by just using the following command还可以通过以下命令获取一堆数据

from pprint import pprint
pprint(yahoo.get_historical('2019-08-10', '2020-01-10'))

The following will get you all the data, and then you can filter out what you don't need:以下将为您获取所有数据,然后您可以过滤掉不需要的数据:

for row in soup.select('div[data-test="fin-row"]'):     
    for r in row:
        for l in r:
            print(l.text)
    print('-------\n')

Output: Output:

Total Revenue
273,857,000
260,174,000
265,595,000
-
215,639,000
-------

Cost of Revenue
169,277,000
161,782,000
163,756,000
-
131,376,000
-------

Gross Profit

etc.等等

If you also want to get the headers programmatically, try:如果您还想以编程方式获取标头,请尝试:

head_ind = [55,58,60,62,64,66]
for i in head_ind:
    heads = f'span[data-reactid="{i}"]:not([class])'
    for head in soup.select(heads):
        print(head.text)

Output: Output:

Breakdown
ttm
9/30/2019
9/30/2018
9/30/2017
9/30/2016

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM