简体   繁体   中英

Extract specific column from HTML

url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'

df = pd.read_html(url, parse_dates=[0])
df1=df[0]
df2=df[1]
df3=df[2]
df4=df[3]

This is my code and I can see every table like this

0   1   2   3   4   5   6   7   8   9   ... 35  36  37  38  39  40  41  42  43  44
0   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   ... NaN NaN NaN NaN {1} NaN NaN NaN 205713029.83    NaN
4   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
87  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88  Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
90  NaN NaN Class   Class   NaN NaN BeginningNote Balance   BeginningNote Balance   NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91  {46}    NaN NaN Class C NaN NaN NaN    NaN NaN ... {46}    NaN NaN NaN    NaN NaN NaN NaN NaN

However, my project need to extract the specific columns:

Defaulted Receivables: Line 4
Ending Tranche Balance (all tranches): Line 19
Regular Principal Collections: Line 22
Recoveries: Line 23
Prepayments: Line 24
Interest Collections: Line 25 + Line 26 + Line 27
Ending Reserve Account Balance: Line 63
Ending Pool Balance: Line 79
60 Day Delinquencies: Line 84
90 Day Delinquencies: Line 85
90+ Day Delinquencies: Line 86 + Line 87

So how can I take specific column from df? or how can I make my df more readible?

You can try this example to extract specified rows from the HTML:

import requests
from bs4 import BeautifulSoup


def get_row(soup, n):
    return [td.get_text(strip=True) for td in soup.select('tr:contains("{' + str(n) + '}") td') if td.get_text(strip=True)]

url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

row_numbers = [4, 19, 22, 23, 24, 25, 26, 27, 63, 79, 84, 85, 86, 87]

for n in row_numbers:
    print(get_row(soup, n))

Prints:

['{4}Defaulted Receivables', '{4}', '1,310,326.05']
['{19}End of period Note Balance', '{19}', '—', '—', '—', '—', '—', '103,359,894.20', '48,960,000.00', '152,319,894.20']
['{22}Principal Payments Received', '{22}', '8,508,993.67']
['{23}Liquidation Proceeds', '{23}', '1,417,885.33']
['{24}Principal on Repurchased Receivables', '{24}', '136,546.52']
['{25}Interest on Repurchased Receivables', '{25}', '7,927.83']
['{26}Interest collected on Receivables', '{26}', '2,584,253.82']
['{27}Other amounts received', '{27}', '27,116.85']
['{63}End of period Reserve Account balance', '{63}', '12,240,151.27']
['{79}Principal Balance of the Receivables', '{79}', '1,224,015,127.29', '205,713,029.83', '195,904,816.03']
['{84}31-60days', '{84}', '1,059', '12,688,115.93', '6.48', '%']
['{85}61-90days', '{85}', '397', '4,772,733.21', '2.44', '%']
['{86}91-120days', '{86}', '142', '1,628,631.34', '0.83', '%']
['{87}121 + days delinquent', '{87}', '—', '—', '0.00', '%']

Three options come to mind:

  1. pd.dropna()
df[1].dropna(axis=0,how='all')

This will drop all rows where all elements are NaN.

  1. indexing and iloc
i = [1,3,5]
df[1].iloc[i]

If I know the position of my desired rows then I can pull them out with iloc

  1. pd.isnull and loc
df[1].loc[pd.isnull(df[1][0])==False]

This will select only rows that aren't NaN within column 0. Likewise, loc can be used to match to specific strings within a column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM