[英]How to get a value from a text document that has an unstructured table
I am trying to get the total assets
values from the 10-K text filings. 我正在尝试从10-K文本文件中获取
total assets
值。 The problem is that the html format varies from one company to another. 问题是html格式因一家公司而异。
Take Apple 10-K as an example: total assets is in a table that has balance sheet
header and typical terms like cash, inventories, ... exist in some rows of that table. 以Apple 10-K为例:总资产在具有
balance sheet
标题的表中,并且该表的某些行中存在现金,存货等典型术语。 In the last row, there is a summation of assets of 290,479 for 2015 and 231,839 for 2014. I wanted to get the number for the 2015 --> 290,479. 在最后一行中,2015年的资产总计为290,479,2014年的资产总计为231,839。我想获得2015年的资产-> 290,479。 I have not been able to find a way that
我一直无法找到一种方法
1) finds the relevant table that has some specific headings (like balance sheet) and words in rows (cash, ...) 1)查找具有某些特定标题(如资产负债表)和行中单词(现金,...)的相关表
2) get the value in the row that has the word total assets
and belongs to the greater year (2015 for our example). 2)在具有
total assets
一词并属于较大年份的行中获取值(在我们的示例中为2015)。
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "xml")
for tag in soup.find_all(text=re.compile('Total\sassets')):
print(tag.findParent('table').findParent('table'))
Using lxml
or html.parser
instead of xml
I can get 使用
lxml
或html.parser
而不是xml
可以得到
title > CONSOLIDATED BALANCE SHEETS
row > Total assets
column 0 > Total assets
column 1 >
column 2 > $
column 3 > 290,479
column 4 >
column 5 >
column 6 > $
column 7 > 231,839
column 8 >
using code 使用代码
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')# "lxml")
# get all `b` to find title
all_b = soup.find_all('b')
for item in all_b:
# check text in every `b`
title = item.get_text(strip=True)
if title == 'CONSOLIDATED BALANCE SHEETS':
print('title >', title)
# get first `table` after `b`
table = item.parent.findNext('table')
# all rows in table
all_tr = table.find_all('tr')
for tr in all_tr:
# all columns in row
all_td = tr.find_all('td')
# text in first column
text = all_td[0].get_text(strip=True)
if text == 'Total assets':
print('row >', text)
for i, td in enumerate(all_td):
print('column', i, '>', td.get_text(strip=True))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.