简体   繁体   中英

Pandas describe() behaviour for numeric dtypes

The output of the function DataFrame.describe() depends on the datatype.

When used on a numeric dtype, it will return the following output:

f.ID.describe()

count       7583.000000
mean      704013.191613
std      1192979.985253
min        10575.000000
25%        10575.000000
50%        10864.000000
75%      2084161.000000
max      6422339.000000

This makes sense in most cases, except when the column contains numeric data that isn't supposed to be aggregated. For example: an ID.

In this case, the following output would be more appropriate:

count      7583
unique       68
top       10864
freq       3390

Above is the output that you'd get for an object dtype. The uniqueness and size of the column are properties that are of more value to me in case of IDs than the mean or distribution.

As far as I can see, the only way to do this for a numeric dtype is to first cast it to an object dtype, eg

f.ID.astype(str).describe()

A datatype conversion is likely to introduce a performance penalty (more obvious with large datasets I imagine). That why I was wondering if there's any other way to modify the describe() behaviour, other than either changing the datatype (on the fly or when creating the DataFrame).

I'd be somewhat inclined to just do as you did and convert to strings on the fly to get your desired output. I don't think the performance penalty is going to be very severe and doubt you are going to be using describe() often enough for that to matter anyway.

That said, it is worth thinking about how you want to store a number that is really an identifier rather than a value or measurement. If it's a unique ID (such as an American style social security number), you'd just store it as an integer. If it's not unique, then it may make sense to store as a categorical column. The less unique (or the more repetition), the better off you are storing as a categorical.

Here's a short example where ID can have values from 1 to 4.

>>> df=pd.DataFrame({ 'id_int':np.random.randint(1,5,20) })
>>> df['id_cat'] = df.id_int.astype('category')

>>> df.dtypes

id_int       int64
id_cat    category

>>> df.memory_usage()

id_int    160
id_cat     52

As you can see, the categorical version of ID uses about 1/3 of the memory (savings here depends on the amount of repetition, of course).

And if you describe() it will treat as a string object.

>>> df.id_cat.describe()

count     20
unique     4
top        1
freq       8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM