Extract specific column from HTML

Question

url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'

df = pd.read_html(url, parse_dates=[0])
df1=df[0]
df2=df[1]
df3=df[2]
df4=df[3]

This is my code and I can see every table like this

0   1   2   3   4   5   6   7   8   9   ... 35  36  37  38  39  40  41  42  43  44
0   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    I. PRINCIPAL BALANCE CALCULATION    ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   {1} Beginning of period aggregate Principal Ba...   ... NaN NaN NaN NaN {1} NaN NaN NaN 205713029.83    NaN
4   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
87  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88  Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   Class C Accrued Note Interest   ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
90  NaN NaN Class   Class   NaN NaN BeginningNote Balance   BeginningNote Balance   NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91  {46}    NaN NaN Class C NaN NaN NaN    NaN NaN ... {46}    NaN NaN NaN    NaN NaN NaN NaN NaN

However, my project need to extract the specific columns:

Defaulted Receivables: Line 4
Ending Tranche Balance (all tranches): Line 19
Regular Principal Collections: Line 22
Recoveries: Line 23
Prepayments: Line 24
Interest Collections: Line 25 + Line 26 + Line 27
Ending Reserve Account Balance: Line 63
Ending Pool Balance: Line 79
60 Day Delinquencies: Line 84
90 Day Delinquencies: Line 85
90+ Day Delinquencies: Line 86 + Line 87

So how can I take specific column from df? or how can I make my df more readible?

Answer 1

You can try this example to extract specified rows from the HTML:

import requests
from bs4 import BeautifulSoup


def get_row(soup, n):
    return [td.get_text(strip=True) for td in soup.select('tr:contains("{' + str(n) + '}") td') if td.get_text(strip=True)]

url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

row_numbers = [4, 19, 22, 23, 24, 25, 26, 27, 63, 79, 84, 85, 86, 87]

for n in row_numbers:
    print(get_row(soup, n))

Prints:

['{4}Defaulted Receivables', '{4}', '1,310,326.05']
['{19}End of period Note Balance', '{19}', '—', '—', '—', '—', '—', '103,359,894.20', '48,960,000.00', '152,319,894.20']
['{22}Principal Payments Received', '{22}', '8,508,993.67']
['{23}Liquidation Proceeds', '{23}', '1,417,885.33']
['{24}Principal on Repurchased Receivables', '{24}', '136,546.52']
['{25}Interest on Repurchased Receivables', '{25}', '7,927.83']
['{26}Interest collected on Receivables', '{26}', '2,584,253.82']
['{27}Other amounts received', '{27}', '27,116.85']
['{63}End of period Reserve Account balance', '{63}', '12,240,151.27']
['{79}Principal Balance of the Receivables', '{79}', '1,224,015,127.29', '205,713,029.83', '195,904,816.03']
['{84}31-60days', '{84}', '1,059', '12,688,115.93', '6.48', '%']
['{85}61-90days', '{85}', '397', '4,772,733.21', '2.44', '%']
['{86}91-120days', '{86}', '142', '1,628,631.34', '0.83', '%']
['{87}121 + days delinquent', '{87}', '—', '—', '0.00', '%']

Answer 2

Three options come to mind:

pd.dropna()

df[1].dropna(axis=0,how='all')

This will drop all rows where all elements are NaN.

indexing and iloc

i = [1,3,5]
df[1].iloc[i]

If I know the position of my desired rows then I can pull them out with iloc

pd.isnull and loc

df[1].loc[pd.isnull(df[1][0])==False]

This will select only rows that aren't NaN within column 0. Likewise, loc can be used to match to specific strings within a column.

Extract specific column from HTML

Question

2 answers

solution1
1 ACCPTED 2020-09-26 06:30:16

solution2
0 2020-09-26 05:33:55

Extract specific column from HTML

Question

2 answers

solution1 1 ACCPTED 2020-09-26 06:30:16

solution2 0 2020-09-26 05:33:55

solution1
1 ACCPTED 2020-09-26 06:30:16

solution2
0 2020-09-26 05:33:55