如何从具有非结构化表的文本文档中获取值

Question

I am trying to get the total assets values from the 10-K text filings. 我正在尝试从10-K文本文件中获取total assets值。 The problem is that the html format varies from one company to another. 问题是html格式因一家公司而异。

Take Apple 10-K as an example: total assets is in a table that has balance sheet header and typical terms like cash, inventories, ... exist in some rows of that table. 以Apple 10-K为例：总资产在具有balance sheet标题的表中，并且该表的某些行中存在现金，存货等典型术语。 In the last row, there is a summation of assets of 290,479 for 2015 and 231,839 for 2014. I wanted to get the number for the 2015 --> 290,479. 在最后一行中，2015年的资产总计为290,479，2014年的资产总计为231,839。我想获得2015年的资产-> 290,479。 I have not been able to find a way that 我一直无法找到一种方法

1) finds the relevant table that has some specific headings (like balance sheet) and words in rows (cash, ...) 1）查找具有某些特定标题（如资产负债表）和行中单词（现金，...）的相关表

2) get the value in the row that has the word total assets and belongs to the greater year (2015 for our example). 2）在具有total assets一词并属于较大年份的行中获取值（在我们的示例中为2015）。

import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "xml")
for tag in soup.find_all(text=re.compile('Total\sassets')):
            print(tag.findParent('table').findParent('table'))

Answer 1

Using lxml or html.parser instead of xml I can get 使用lxml或html.parser而不是xml可以得到

title > CONSOLIDATED BALANCE SHEETS
row > Total assets
column 0 > Total assets
column 1 > 
column 2 > $
column 3 > 290,479
column 4 > 
column 5 > 
column 6 > $
column 7 > 231,839
column 8 >

using code 使用代码

import requests
from bs4 import BeautifulSoup
import re

url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')# "lxml")

# get all `b` to find title
all_b = soup.find_all('b')
for item in all_b:
    # check text in every `b`
    title = item.get_text(strip=True)
    if title == 'CONSOLIDATED BALANCE SHEETS':
        print('title >', title)
        # get first `table` after `b`
        table = item.parent.findNext('table')
        # all rows in table
        all_tr = table.find_all('tr')
        for tr in all_tr:
            # all columns in row
            all_td = tr.find_all('td')
            # text in first column
            text = all_td[0].get_text(strip=True)
            if text == 'Total assets':
                print('row >', text)
                for i, td in enumerate(all_td):
                    print('column', i, '>', td.get_text(strip=True))

如何从具有非结构化表的文本文档中获取值

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-12-10 19:25:11

如何从具有非结构化表的文本文档中获取值

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-12-10 19:25:11

解决方案1
0 已采纳 2017-12-10 19:25:11