简体   繁体   中英

Pandas Data Frame Summary Table

How can I make a summary of a data frame in Pandas, stacking individual operations.

For example, I used the following code:

 df=pd.DataFrame(wb)

# Get list with headers
header1 = list(df)
count=df.count()

NaNs=df.isnull().sum()
sum=df.sum(0)
mean=df.mean()
median=df.median()
min= df.min()
max= df.max()
standardeviation= df.std()
nints=df.dtypes

But I can only print them as individual results. I get something like this for each calculation:

Unnamed: 0                  60
region                      50
IV_bins                     60
N                           60
meanbase                    60
cash                        60
dtype: int64

Finally, I tried creating a summarytable=[] table at the beginning and trying something like summarytable.append(count) and so on with all the calculations, but couldn't get it right. What I am looking for is some table like this, which I believe also involves transposing the calculations:

          A    B 
Count     100  98
NANs      5    7
Mean      10   12.5
Median    14   8
...
Nints     95   96
NStr      5    2

One last thing to take into account. I noticed that for some calculations, like sum() , it doesn't make sense to count strings, so, when I print the results, the strings columns don't print anything. This is the result for print(sum) : (Notice how region doesn't appear)

Unnamed: 0                                                               1830
IV_bins                     [0,2.31e+06](2.31e+06,5.7e+06](5.7e+06,1.07e+0...
N                                                                     3680163
meanbase                                                              3.46248
cash                                                              9.00091e+09

Seems like you may get use out of DataFrame.agg() , with which you can essentially build a customized .describe() output. Here's an example to get you started:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'object': ['a', 'b', 'c'],
                    'numeric': [1, 2, 3],
                    'numeric2': [1.1, 2.5, 50.],
                    'categorical': pd.Categorical(['d','e','f'])
                  })


def nullcounts(ser):
    return ser.isnull().sum()


def custom_describe(frame, func=[nullcounts, 'sum', 'mean', 'median', 'max'],
                    numeric_only=True, **kwargs):
    if numeric_only:
        frame = frame.select_dtypes(include=np.number)
    return frame.agg(func, **kwargs)


custom_describe(df)

            numeric   numeric2
nullcounts      0.0   0.000000
sum             6.0  53.600000
mean            2.0  17.866667
median          2.0   2.500000
max             3.0  50.000000

It seems like there is a library that does exactly that. Check out pandas-summary . For each column, it gives you the count, min,max,std,mean,variance,count of all, count of uniques, missing values, type of column, and much more.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM