简体   繁体   中英

How to extract values from a nested Dictionary as Pandas DataFrames

Hello stackoverflowers,

I am hoping you can help me understand an issue I am having with a nested dictionary. I scraped out some tables from an excel file: ['Table 5','Table 8',Table 40'] . What I got from the code I used was a nested dictionary which I am not sure how to handle. These are real the pains of being a beginner, I guess. My aim is to transform the values into data frames using the keys (eg Table 5). The original table: 在此处输入图片说明

Example of dataframe:

d = {0: ['TB','VT','BT','CI','CH','CL','RT','RU','PV','PV','PV','PV','PV','RH','PV','PV','PV','PV','NaN','NaN','TB','VT','BT','CI','CH','CL','RT','RU','PV','PV'], 
     1: ['Table 1','BRAND. SUMMARY','Base: Floating Base (TOTAL) (18-59)','NaN','NaN','NaN','Base','Unweighted row','brand1','brand2','brand3','brand4','NPS','','NaN','Row1','Row2','Row3','NaN','NaN','Table 5','Brands Title 1','Base: All (TOTAL) (18-59)','NaN','NaN','NaN','Base','Unweighted row','Brand1','Brand2'],
     2: ['NaN','NaN','NaN','(TOTAL)','Discrete monthly banner','Sept (a)','100','997','0.31','0.31','0.31','0.31','0.31','NaN','0.62','0.64','0.61','0.6','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Total','19479','19608','0.75','0.75'],
     3: ['NaN','NaN','NaN','NaN','NaN','Oct (b)','1090','1100','0.31','0.31','0.31','0.31','0.31','NaN','0.64','0.67','0.64','0.64','NaN','NaN','NaN','NaN','NaN','TOTAL','Discrete monthly banner','Sept (a)','1000','1000','0.8','0.8'],
     4: ['NaN','NaN','NaN','NaN','NaN','Nov (c)','3164','3191','0.31','0.31','0.31','0.31','0.31','NaN','0.64','0.67','0.64','0.64','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Oct (b)','1000','1000','0.8','0.8'],
     5: ['NaN','NaN','NaN','NaN','NaN','Dec (d)','992','3999','0.31','0.31','0.31','0.31','0.31','NaN','0.51','0.47','0.67','0.61','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Nov (c)','1000','1000','0.8','0.8']}

When I print the table values and keys this is returned:

在此处输入图片说明

Line 174 should be my column header.

This is the code I used to scrape the tables from Excel:

ws = pd.read_excel(r'C:\Users\Tables.xlsx', sheet_name= "Percents", header = None, usecols="B:XFD")

table_names = ["Table 5", "Table 8", "Table 9", "Table 40"]
groups = ws[1].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:24] for k,g in ws.groupby(groups)}
#because the syntax above (e.g.tables={g.iloc}) returned also the other values, I filtered again based on the table names
filtered_d = dict((k, tables[k]) for k in table_names if k in tables)

I tried to adapt this code to return my values, but when I am removing the orient="index" or say orient="columns" I am getting an error. I am thinking a for loop could do the trick.

df = pd.DataFrame.from_dict({(i,j): filtered_d[i][j] 
                           for i in filtered_d.keys() 
                           for j in filtered_d[i].keys()}, orient="index")

How do I solve this by keeping the curent table format and transform each value to a dataframe?

Thank you in advance for whatever advice you can give me.

I'm not entirely certain what output you want, but with the supplied example we can have a shot. Is this what you are after?

import pandas as pd
df = pd.DataFrame({0: ['TB','VT','BT','CI','CH','CL','RT','RU','PV','PV','PV','PV','PV','RH','PV','PV','PV','PV','NaN','NaN','TB','VT','BT','CI','CH','CL','RT','RU','PV','PV'], 
     1: ['Table 1','BRAND. SUMMARY','Base: Floating Base (TOTAL) (18-59)','NaN','NaN','NaN','Base','Unweighted row','brand1','brand2','brand3','brand4','NPS','','NaN','Row1','Row2','Row3','NaN','NaN','Table 5','Brands Title 1','Base: All (TOTAL) (18-59)','NaN','NaN','NaN','Base','Unweighted row','Brand1','Brand2'],
     2: ['NaN','NaN','NaN','(TOTAL)','Discrete monthly banner','Sept (a)','100','997','0.31','0.31','0.31','0.31','0.31','NaN','0.62','0.64','0.61','0.6','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Total','19479','19608','0.75','0.75'],
     3: ['NaN','NaN','NaN','NaN','NaN','Oct (b)','1090','1100','0.31','0.31','0.31','0.31','0.31','NaN','0.64','0.67','0.64','0.64','NaN','NaN','NaN','NaN','NaN','TOTAL','Discrete monthly banner','Sept (a)','1000','1000','0.8','0.8'],
     4: ['NaN','NaN','NaN','NaN','NaN','Nov (c)','3164','3191','0.31','0.31','0.31','0.31','0.31','NaN','0.64','0.67','0.64','0.64','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Oct (b)','1000','1000','0.8','0.8'],
     5: ['NaN','NaN','NaN','NaN','NaN','Dec (d)','992','3999','0.31','0.31','0.31','0.31','0.31','NaN','0.51','0.47','0.67','0.61','NaN','NaN','NaN','NaN','NaN','NaN','NaN','Nov (c)','1000','1000','0.8','0.8']})
tbl = df.drop(range(5), axis=0).drop(0, axis=1)
print(tbl)

Or perhaps you want to name the rows and columns appropriately:

index = tbl.iloc[:,0]
columns = tbl.iloc[0]
data = df.drop(range(6), axis=0).drop(range(2), axis=1)
tbl2 = pd.DataFrame(data, index=index, columns=columns)

In any case hopefully you can coerce it to the right format.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM