beautifulsoup 抓取 - 可扩展 header 文本缺失

Question

I was trying to extract data from a Y.Finance website using BeautifulSoup and store everything in a list, In the list, the headers of the expandable lines (Total Revenue. Operating Expense) are missing but the figures are still there?我试图使用 BeautifulSoup 从 Y.Finance 网站提取数据并将所有内容存储在列表中，在列表中，可扩展行的标题（总收入。运营费用）丢失但数字仍然存在？ Is there a way to include the headers in the output?有没有办法在输出中包含标题？

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur

url = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'

read_data = ur.urlopen(url).read() 
soup= BeautifulSoup(read_data,'lxml')

ls= [] # Create empty list
for l in soup.find_all('div'): 
  ls.append(l.string) 


new_ls = list(filter(None,ls))

Current output:当前 output：

 'Expand All',
 'ttm',
 '9/30/2019',
 '9/30/2018',
 '9/30/2017',
 '9/30/2016',
 '273,857,000',
 '260,174,000',
 '265,595,000',
 '229,234,000',
 '215,639,000',

Expected output:预期 output：

 'Expand All',
 'ttm',
 '9/30/2019',
 '9/30/2018',
 '9/30/2017',
 '9/30/2016',
 'Total Revenue',
 '273,857,000',
 '260,174,000',
 '265,595,000',
 '229,234,000',
 '215,639,000',

Update: if I extract from "span", figures that are 0 were missing from the output which creates another problem when I construct the data frame later on更新：如果我从“span”中提取，output 中缺少 0 的数字，这在我稍后构建数据框时会产生另一个问题

for l in soup.select('div.D\(tbr\)'): 
    for n in l.select('span'):
        print(n.text)

Answer 1

I know this is kind of off topic, but it looks like you just want the data from Yahoo finance right?我知道这有点离题，但看起来你只是想要雅虎财经的数据，对吧？ If so, they have a python package already available that would probably be easier to work with then web scraping.如果是这样，他们有一个 python package 已经可用，可能会更容易使用然后 web 抓取。

https://pypi.org/project/yahoo-finance/ https://pypi.org/project/yahoo-finance/

You can enter a share您可以输入一个共享

apple = Share('AAPL')

And also get a bunch of data by just using the following command还可以通过以下命令获取一堆数据

from pprint import pprint
pprint(yahoo.get_historical('2019-08-10', '2020-01-10'))

Answer 2

The following will get you all the data, and then you can filter out what you don't need:以下将为您获取所有数据，然后您可以过滤掉不需要的数据：

for row in soup.select('div[data-test="fin-row"]'):     
    for r in row:
        for l in r:
            print(l.text)
    print('-------\n')

Output: Output：

Total Revenue
273,857,000
260,174,000
265,595,000
-
215,639,000
-------

Cost of Revenue
169,277,000
161,782,000
163,756,000
-
131,376,000
-------

Gross Profit

etc.等等

If you also want to get the headers programmatically, try:如果您还想以编程方式获取标头，请尝试：

head_ind = [55,58,60,62,64,66]
for i in head_ind:
    heads = f'span[data-reactid="{i}"]:not([class])'
    for head in soup.select(heads):
        print(head.text)

Output: Output：

Breakdown
ttm
9/30/2019
9/30/2018
9/30/2017
9/30/2016

beautifulsoup 抓取 - 可扩展 header 文本缺失

问题描述

2 个解决方案

解决方案1
2 2020-08-11 02:27:58

解决方案2
0 已采纳 2020-08-11 02:50:57

beautifulsoup 抓取 - 可扩展 header 文本缺失

问题描述

2 个解决方案

解决方案1 2 2020-08-11 02:27:58

解决方案2 0 已采纳 2020-08-11 02:50:57

解决方案1
2 2020-08-11 02:27:58

解决方案2
0 已采纳 2020-08-11 02:50:57