簡體   English   中英

從 HTML 中提取特定列

[英]Extract specific column from HTML

url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'

df = pd.read_html(url, parse_dates=[0])
df1=df[0]
df2=df[1]
df3=df[2]
df4=df[3]

這是我的代碼,我可以看到像這樣的每個表

0   1   2   3   4   5   6   7   8   9   ... 35  36  37  38  39  40  41  42  43  44
0   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   ... NaN NaN NaN NaN {1} NaN NaN NaN 205713029.83    NaN
4   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
87  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88  Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
90  NaN NaN Class   Class   NaN NaN BeginningNote Balance   BeginningNote Balance   NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91  {46}    NaN NaN Class C NaN NaN NaN    NaN NaN ... {46}    NaN NaN NaN    NaN NaN NaN NaN NaN

但是,我的項目需要提取特定的列:

Defaulted Receivables: Line 4
Ending Tranche Balance (all tranches): Line 19
Regular Principal Collections: Line 22
Recoveries: Line 23
Prepayments: Line 24
Interest Collections: Line 25 + Line 26 + Line 27
Ending Reserve Account Balance: Line 63
Ending Pool Balance: Line 79
60 Day Delinquencies: Line 84
90 Day Delinquencies: Line 85
90+ Day Delinquencies: Line 86 + Line 87

那么如何從 df 中獲取特定列? 或者我怎樣才能讓我的 df 更具可讀性?

您可以嘗試使用此示例從 HTML 中提取指定的行:

import requests
from bs4 import BeautifulSoup


def get_row(soup, n):
    return [td.get_text(strip=True) for td in soup.select('tr:contains("{' + str(n) + '}") td') if td.get_text(strip=True)]

url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

row_numbers = [4, 19, 22, 23, 24, 25, 26, 27, 63, 79, 84, 85, 86, 87]

for n in row_numbers:
    print(get_row(soup, n))

印刷:

['{4}Defaulted Receivables', '{4}', '1,310,326.05']
['{19}End of period Note Balance', '{19}', '—', '—', '—', '—', '—', '103,359,894.20', '48,960,000.00', '152,319,894.20']
['{22}Principal Payments Received', '{22}', '8,508,993.67']
['{23}Liquidation Proceeds', '{23}', '1,417,885.33']
['{24}Principal on Repurchased Receivables', '{24}', '136,546.52']
['{25}Interest on Repurchased Receivables', '{25}', '7,927.83']
['{26}Interest collected on Receivables', '{26}', '2,584,253.82']
['{27}Other amounts received', '{27}', '27,116.85']
['{63}End of period Reserve Account balance', '{63}', '12,240,151.27']
['{79}Principal Balance of the Receivables', '{79}', '1,224,015,127.29', '205,713,029.83', '195,904,816.03']
['{84}31-60days', '{84}', '1,059', '12,688,115.93', '6.48', '%']
['{85}61-90days', '{85}', '397', '4,772,733.21', '2.44', '%']
['{86}91-120days', '{86}', '142', '1,628,631.34', '0.83', '%']
['{87}121 + days delinquent', '{87}', '—', '—', '0.00', '%']

想到了三個選項:

  1. pd.dropna()
df[1].dropna(axis=0,how='all')

這將刪除所有元素都是 NaN 的所有行。

  1. 索引和 iloc
i = [1,3,5]
df[1].iloc[i]

如果我知道我想要的行的位置,那么我可以用 iloc 把它們拉出來

  1. pd.isnull 和 loc
df[1].loc[pd.isnull(df[1][0])==False]

這將只選擇列 0 中不是 NaN 的行。同樣,loc 可用於匹配列中的特定字符串。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM