Hello stackoverflowers,
I am hoping you can help me understand an issue I am having with a nested dictionary. I scraped out some tables from an excel file: ['Table 5','Table 8',Table 40']
. What I got from the code I used was a nested dictionary which I am not sure how to handle. These are real the pains of being a beginner, I guess. My aim is to transform the values into data frames using the keys (eg Table 5). The original table:
Example of dataframe:
d = {0: ['TB','VT','BT','CI','CH','CL','RT','RU','PV','PV','PV','PV','PV','RH','PV','PV','PV','PV','NaN','NaN','TB','VT','BT','CI','CH','CL','RT','RU','PV','PV'],
1: ['Table 1','BRAND. SUMMARY','Base: Floating Base (TOTAL) (18-59)','NaN','NaN','NaN','Base','Unweighted row','brand1','brand2','brand3','brand4','NPS','','NaN','Row1','Row2','Row3','NaN','NaN','Table 5','Brands Title 1','Base: All (TOTAL) (18-59)','NaN','NaN','NaN','Base','Unweighted row','Brand1','Brand2'],
2: ['NaN','NaN','NaN','(TOTAL)','Discrete monthly banner','Sept (a)','100','997','0.31','0.31','0.31','0.31','0.31','NaN','0.62','0.64','0.61','0.6','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Total','19479','19608','0.75','0.75'],
3: ['NaN','NaN','NaN','NaN','NaN','Oct (b)','1090','1100','0.31','0.31','0.31','0.31','0.31','NaN','0.64','0.67','0.64','0.64','NaN','NaN','NaN','NaN','NaN','TOTAL','Discrete monthly banner','Sept (a)','1000','1000','0.8','0.8'],
4: ['NaN','NaN','NaN','NaN','NaN','Nov (c)','3164','3191','0.31','0.31','0.31','0.31','0.31','NaN','0.64','0.67','0.64','0.64','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Oct (b)','1000','1000','0.8','0.8'],
5: ['NaN','NaN','NaN','NaN','NaN','Dec (d)','992','3999','0.31','0.31','0.31','0.31','0.31','NaN','0.51','0.47','0.67','0.61','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Nov (c)','1000','1000','0.8','0.8']}
When I print the table values and keys this is returned:
Line 174 should be my column header.
This is the code I used to scrape the tables from Excel:
ws = pd.read_excel(r'C:\Users\Tables.xlsx', sheet_name= "Percents", header = None, usecols="B:XFD")
table_names = ["Table 5", "Table 8", "Table 9", "Table 40"]
groups = ws[1].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:24] for k,g in ws.groupby(groups)}
#because the syntax above (e.g.tables={g.iloc}) returned also the other values, I filtered again based on the table names
filtered_d = dict((k, tables[k]) for k in table_names if k in tables)
I tried to adapt this code to return my values, but when I am removing the orient="index"
or say orient="columns"
I am getting an error. I am thinking a for loop could do the trick.
df = pd.DataFrame.from_dict({(i,j): filtered_d[i][j]
for i in filtered_d.keys()
for j in filtered_d[i].keys()}, orient="index")
How do I solve this by keeping the curent table format and transform each value to a dataframe?
Thank you in advance for whatever advice you can give me.
I'm not entirely certain what output you want, but with the supplied example we can have a shot. Is this what you are after?
import pandas as pd
df = pd.DataFrame({0: ['TB','VT','BT','CI','CH','CL','RT','RU','PV','PV','PV','PV','PV','RH','PV','PV','PV','PV','NaN','NaN','TB','VT','BT','CI','CH','CL','RT','RU','PV','PV'],
1: ['Table 1','BRAND. SUMMARY','Base: Floating Base (TOTAL) (18-59)','NaN','NaN','NaN','Base','Unweighted row','brand1','brand2','brand3','brand4','NPS','','NaN','Row1','Row2','Row3','NaN','NaN','Table 5','Brands Title 1','Base: All (TOTAL) (18-59)','NaN','NaN','NaN','Base','Unweighted row','Brand1','Brand2'],
2: ['NaN','NaN','NaN','(TOTAL)','Discrete monthly banner','Sept (a)','100','997','0.31','0.31','0.31','0.31','0.31','NaN','0.62','0.64','0.61','0.6','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Total','19479','19608','0.75','0.75'],
3: ['NaN','NaN','NaN','NaN','NaN','Oct (b)','1090','1100','0.31','0.31','0.31','0.31','0.31','NaN','0.64','0.67','0.64','0.64','NaN','NaN','NaN','NaN','NaN','TOTAL','Discrete monthly banner','Sept (a)','1000','1000','0.8','0.8'],
4: ['NaN','NaN','NaN','NaN','NaN','Nov (c)','3164','3191','0.31','0.31','0.31','0.31','0.31','NaN','0.64','0.67','0.64','0.64','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Oct (b)','1000','1000','0.8','0.8'],
5: ['NaN','NaN','NaN','NaN','NaN','Dec (d)','992','3999','0.31','0.31','0.31','0.31','0.31','NaN','0.51','0.47','0.67','0.61','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Nov (c)','1000','1000','0.8','0.8']})
tbl = df.drop(range(5), axis=0).drop(0, axis=1)
print(tbl)
Or perhaps you want to name the rows and columns appropriately:
index = tbl.iloc[:,0]
columns = tbl.iloc[0]
data = df.drop(range(6), axis=0).drop(range(2), axis=1)
tbl2 = pd.DataFrame(data, index=index, columns=columns)
In any case hopefully you can coerce it to the right format.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.