url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'
df = pd.read_html(url, parse_dates=[0])
df1=df[0]
df2=df[1]
df3=df[2]
df4=df[3]
This is my code and I can see every table like this
0 1 2 3 4 5 6 7 8 9 ... 35 36 37 38 39 40 41 42 43 44
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION I. PRINCIPAL BALANCE CALCULATION ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... {1} Beginning of period aggregate Principal Ba... ... NaN NaN NaN NaN {1} NaN NaN NaN 205713029.83 NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
87 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88 Class C Accrued Note Interest Class C Accrued Note Interest Class C Accrued Note Interest Class C Accrued Note Interest Class C Accrued Note Interest Class C Accrued Note Interest Class C Accrued Note Interest Class C Accrued Note Interest Class C Accrued Note Interest Class C Accrued Note Interest ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
90 NaN NaN Class Class NaN NaN BeginningNote Balance BeginningNote Balance NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91 {46} NaN NaN Class C NaN NaN NaN NaN NaN ... {46} NaN NaN NaN NaN NaN NaN NaN NaN
However, my project need to extract the specific columns:
Defaulted Receivables: Line 4
Ending Tranche Balance (all tranches): Line 19
Regular Principal Collections: Line 22
Recoveries: Line 23
Prepayments: Line 24
Interest Collections: Line 25 + Line 26 + Line 27
Ending Reserve Account Balance: Line 63
Ending Pool Balance: Line 79
60 Day Delinquencies: Line 84
90 Day Delinquencies: Line 85
90+ Day Delinquencies: Line 86 + Line 87
So how can I take specific column from df? or how can I make my df more readible?
You can try this example to extract specified rows from the HTML:
import requests
from bs4 import BeautifulSoup
def get_row(soup, n):
return [td.get_text(strip=True) for td in soup.select('tr:contains("{' + str(n) + '}") td') if td.get_text(strip=True)]
url = 'https://www.sec.gov/Archives/edgar/data/1383094/000095013120003579/d33910dex991.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
row_numbers = [4, 19, 22, 23, 24, 25, 26, 27, 63, 79, 84, 85, 86, 87]
for n in row_numbers:
print(get_row(soup, n))
Prints:
['{4}Defaulted Receivables', '{4}', '1,310,326.05']
['{19}End of period Note Balance', '{19}', '—', '—', '—', '—', '—', '103,359,894.20', '48,960,000.00', '152,319,894.20']
['{22}Principal Payments Received', '{22}', '8,508,993.67']
['{23}Liquidation Proceeds', '{23}', '1,417,885.33']
['{24}Principal on Repurchased Receivables', '{24}', '136,546.52']
['{25}Interest on Repurchased Receivables', '{25}', '7,927.83']
['{26}Interest collected on Receivables', '{26}', '2,584,253.82']
['{27}Other amounts received', '{27}', '27,116.85']
['{63}End of period Reserve Account balance', '{63}', '12,240,151.27']
['{79}Principal Balance of the Receivables', '{79}', '1,224,015,127.29', '205,713,029.83', '195,904,816.03']
['{84}31-60days', '{84}', '1,059', '12,688,115.93', '6.48', '%']
['{85}61-90days', '{85}', '397', '4,772,733.21', '2.44', '%']
['{86}91-120days', '{86}', '142', '1,628,631.34', '0.83', '%']
['{87}121 + days delinquent', '{87}', '—', '—', '0.00', '%']
Three options come to mind:
df[1].dropna(axis=0,how='all')
This will drop all rows where all elements are NaN.
i = [1,3,5]
df[1].iloc[i]
If I know the position of my desired rows then I can pull them out with iloc
df[1].loc[pd.isnull(df[1][0])==False]
This will select only rows that aren't NaN within column 0. Likewise, loc can be used to match to specific strings within a column.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.