I am trying to get the total assets
values from the 10-K text filings. The problem is that the html format varies from one company to another.
Take Apple 10-K as an example: total assets is in a table that has balance sheet
header and typical terms like cash, inventories, ... exist in some rows of that table. In the last row, there is a summation of assets of 290,479 for 2015 and 231,839 for 2014. I wanted to get the number for the 2015 --> 290,479. I have not been able to find a way that
1) finds the relevant table that has some specific headings (like balance sheet) and words in rows (cash, ...)
2) get the value in the row that has the word total assets
and belongs to the greater year (2015 for our example).
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "xml")
for tag in soup.find_all(text=re.compile('Total\sassets')):
print(tag.findParent('table').findParent('table'))
Using lxml
or html.parser
instead of xml
I can get
title > CONSOLIDATED BALANCE SHEETS
row > Total assets
column 0 > Total assets
column 1 >
column 2 > $
column 3 > 290,479
column 4 >
column 5 >
column 6 > $
column 7 > 231,839
column 8 >
using code
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')# "lxml")
# get all `b` to find title
all_b = soup.find_all('b')
for item in all_b:
# check text in every `b`
title = item.get_text(strip=True)
if title == 'CONSOLIDATED BALANCE SHEETS':
print('title >', title)
# get first `table` after `b`
table = item.parent.findNext('table')
# all rows in table
all_tr = table.find_all('tr')
for tr in all_tr:
# all columns in row
all_td = tr.find_all('td')
# text in first column
text = all_td[0].get_text(strip=True)
if text == 'Total assets':
print('row >', text)
for i, td in enumerate(all_td):
print('column', i, '>', td.get_text(strip=True))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.