简体   繁体   中英

Extract specific values from pandas profiling to a table

I'm using pandas profiling to make HTML reports of some dataframes with 150+ attributes. I'd like to extract some of the information and arrange it in a simple table. Specifically, I need the number of missing data in each variable 'n_missing', disctinct values and their corresponding percentages 'p_missing'.

Something I've done, but I'm struggling to get there is this:

df1 = pd.read_excel('df.xlsx')
profile = df1.profile_report(title="Dataset Profiling Report")
profile.to_file('dataset_report.html')                            #HTML report


profset = profile.description_set          #Extracting the info from the profile object
print(profset.keys())

OUT[]: dict_keys(['analysis', 'table', 'variables', 'scatter', 'correlations', 'missing', 'messages', 'package', 'sample', 'duplicates'])

attributes = profset["variables"]
print(attributes.keys())
OUT: dict_keys(['Attribute 1', 'Attribute 2', 'Attribute 3', 'Attribute 4'...]) #All my columns or attributes.

I thought the "missing" key of the profile.description_set keys would be the one, but i get this:

missing = profset["missing"]
print(missing.keys())
dict_keys(['bar', 'matrix', 'heatmap', 'dendrogram'])

And none of these seem to be the right one.

I've been going through pandas profiling documentation and searching online, and only found a single possible example which led me to the previous code I've wrote.

The end goal is a table to export to excel that looks like this (ignore the actual values):

Attributes    n_Missing    p_missing    n_disctinct   p_disctinct
Attribute X          23         0.23              2           0.5
Attribute Y          50         0.50             50           1.0
Attribute Z           0         0.00            100           1.0
...

Ok, so after some grinding I answered my own question, and since there seems to be little action around pandas profiling, I'll write share it here and write the code that worked for me.

Its very self explanatory, let me know if you have any doubts.

Full code with dummy dataframe for full reproduction is shown.

Create dataframe with some data in it

dfa = pd.DataFrame(np.random.rand(20,3), columns=['attribute X', 'attribute Y', 'attribute Z'])
dfa.loc[dfa['attribute X']<0.5,'attribute X'] = np.nan
dfa.loc[dfa['attribute X']<0.75,'attribute X'] = 0.6
dfa.loc[dfa['attribute X']>0.75,'attribute X'] = "value X"
dfa 

Out[346]: 
   attribute X  attribute Y  attribute Z
0          NaN     0.929178     0.439837
1          NaN     0.620489     0.146956
2          0.6     0.971534     0.048539
3      value X     0.113160     0.344989
4      value X     0.766421     0.105712
5      value X     0.706522     0.800705
6      value X     0.451648     0.763452
7          NaN     0.110620     0.976297
8          NaN     0.032442     0.650167
9          0.6     0.532029     0.028842
10         0.6     0.316751     0.010858
11     value X     0.096364     0.381514
12         0.6     0.897551     0.426949
13     value X     0.178309     0.974560
14         NaN     0.461955     0.573013
15     value X     0.446666     0.919223
16         0.6     0.526870     0.751822
17     value X     0.709627     0.334106
18         0.6     0.780467     0.940426
19     value X     0.056930     0.846433

Create the profile report and html report

profile_a = dfa.profile_report()
profile_a.to_file('dfa.html')

Access the keys of the data contained in the report

profset = profile_a.description_set
print(profset.keys())

Access the attributes/variables/columns data analyzed in the report

attributes = profset["variables"]
print(attributes.keys())

Create a list of the attributes analyzed to iterate over

attributes_keys = attributes.keys()

We can consult the available data for each attribute by accessing the keys of the attribute, then select whatever we want

attributes['attribute X'].keys()

Out[]: dict_keys(['n_distinct', 'p_distinct', 'is_unique', 'n_unique', 'p_unique', 'type', 'hashable', 'value_counts_without_nan', 'n_missing', 'n', 'p_missing', 'count', 'memory_size'])


To make the table of the data I decided to extract. I chose whatever needed from the previuos list, then iterated. The following for loop is not efficient, but very clear for anyone.

list_attribute = []
list_missing = []
list_p_missing = [] #'p_missing'
list_disctinct = [] #'n_distinct'
list_p_disctinct = [] #'p_distinct'


for i in attributes_keys:
    list_attribute.append(i)
    x = attributes[i]
    list_missing.append(x['n_missing'])
    list_p_missing.append(x['p_missing']) 
    list_disctinct.append(x['n_distinct']) 
    list_p_disctinct.append(x['p_distinct']) 
    
    
df_missing = pd.DataFrame(columns=('Attribute','Missing'))
df_missing['Attribute'] = list_attribute
df_missing['Missing'] = list_missing
df_missing['Percentage Missing'] = list_p_missing
df_missing['Disctinct values'] = list_disctinct
df_missing['Percentage Disctinct'] = list_p_disctinct 
df_missing['Percentage Disctinct'] = df_missing['Percentage Disctinct']*100
df_missing['Percentage Missing'] = df_missing['Percentage Missing']*100

Ready


print(df_missing)

Out[345]: 
     Attribute  Missing  ...  Disctinct values  Percentage Disctinct
0  attribute X        0  ...                 2             13.333333
1  attribute Y       20  ...                20            100.000000
2  attribute Z       20  ...                20            100.000000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM