[英]beautifulsoup scraping - expandable header text missing
I was trying to extract data from a Y.Finance website using BeautifulSoup and store everything in a list, In the list, the headers of the expandable lines (Total Revenue. Operating Expense) are missing but the figures are still there?我试图使用 BeautifulSoup 从 Y.Finance 网站提取数据并将所有内容存储在列表中,在列表中,可扩展行的标题(总收入。运营费用)丢失但数字仍然存在? Is there a way to include the headers in the output?
有没有办法在输出中包含标题?
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur
url = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'
read_data = ur.urlopen(url).read()
soup= BeautifulSoup(read_data,'lxml')
ls= [] # Create empty list
for l in soup.find_all('div'):
ls.append(l.string)
new_ls = list(filter(None,ls))
Current output:当前 output:
'Expand All',
'ttm',
'9/30/2019',
'9/30/2018',
'9/30/2017',
'9/30/2016',
'273,857,000',
'260,174,000',
'265,595,000',
'229,234,000',
'215,639,000',
Expected output:预期 output:
'Expand All',
'ttm',
'9/30/2019',
'9/30/2018',
'9/30/2017',
'9/30/2016',
'Total Revenue',
'273,857,000',
'260,174,000',
'265,595,000',
'229,234,000',
'215,639,000',
Update: if I extract from "span", figures that are 0 were missing from the output which creates another problem when I construct the data frame later on更新:如果我从“span”中提取,output 中缺少 0 的数字,这在我稍后构建数据框时会产生另一个问题
for l in soup.select('div.D\(tbr\)'):
for n in l.select('span'):
print(n.text)
I know this is kind of off topic, but it looks like you just want the data from Yahoo finance right?我知道这有点离题,但看起来你只是想要雅虎财经的数据,对吧? If so, they have a python package already available that would probably be easier to work with then web scraping.
如果是这样,他们有一个 python package 已经可用,可能会更容易使用然后 web 抓取。
https://pypi.org/project/yahoo-finance/ https://pypi.org/project/yahoo-finance/
You can enter a share您可以输入一个共享
apple = Share('AAPL')
And also get a bunch of data by just using the following command还可以通过以下命令获取一堆数据
from pprint import pprint
pprint(yahoo.get_historical('2019-08-10', '2020-01-10'))
The following will get you all the data, and then you can filter out what you don't need:以下将为您获取所有数据,然后您可以过滤掉不需要的数据:
for row in soup.select('div[data-test="fin-row"]'):
for r in row:
for l in r:
print(l.text)
print('-------\n')
Output: Output:
Total Revenue
273,857,000
260,174,000
265,595,000
-
215,639,000
-------
Cost of Revenue
169,277,000
161,782,000
163,756,000
-
131,376,000
-------
Gross Profit
etc.等等
If you also want to get the headers programmatically, try:如果您还想以编程方式获取标头,请尝试:
head_ind = [55,58,60,62,64,66]
for i in head_ind:
heads = f'span[data-reactid="{i}"]:not([class])'
for head in soup.select(heads):
print(head.text)
Output: Output:
Breakdown
ttm
9/30/2019
9/30/2018
9/30/2017
9/30/2016
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.