简体   繁体   中英

After binning a column of a dataframe, how to make a new dataframe to count the number of elements in each bin?

Say I have a dataframe, df :

>>> df

Age    Score
19     1
20     2
24     3
19     2
24     3
24     1
24     3
20     1
19     1
20     3
22     2
22     1

I want to construct a new dataframe that bins Age and stores the total number of elements in each of the bins in different Score columns:

Age       Score 1   Score 2     Score 3
19-21     2         4           3
22-24     2         2           9

This is my way of doing it, which I feel is highly convoluted (meaning, it shouldn't be this difficult):

import numpy as np
import pandas as pd

data = pd.DataFrame(columns=['Age', 'Score'])
data['Age'] = [19,20,24,19,24,24,24,20,19,20,22,22]
data['Score'] = [1,2,3,2,3,1,3,1,1,3,2,1]

_, bins = np.histogram(data['Age'], 2)

labels = ['{}-{}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])] #dynamically create labels
labels[0] = '{}-{}'.format(bins[0], bins[1])

df = pd.DataFrame(columns=['Score', labels[0], labels[1]])
df['Score'] = data.Score.unique()
for i in labels:
    df[i] = np.zeros(3)


for i in range(len(data)):
    for j in range(len(labels)):
        m1, m2 = labels[j].split('-') # lower & upper bounds of the age interval
        if ((float(data['Age'][i])>float(m1)) & (float(data['Age'][i])<float(m2))): # find the age group in which each age lies
            if data['Score'][i]==1:
                index = 0
            elif data['Score'][i]==2:
                index = 1
            elif data['Score'][i]==3:
                index = 2

            df[labels[j]][index] += 1

df.sort_values('Score', inplace=True)
df.set_index('Score', inplace=True)
print(df)

This produces

             19.0-21.5      22.5-24.0
Score                      
1            2.0            2.0
2            4.0            2.0
3            3.0            9.0

Is there a better, cleaner, more efficient of achieving this?

IIUC, I think you can try one of these:

1.If you already know the bins:

df['Age'] = np.where(df['Age']<=21,'19-21','22-24')
df.groupby(['Age'])['Score'].value_counts().unstack()

2.If you know number of bins you need:

df.Age = pd.cut(df.Age, bins=2,include_lowest=True)
df.groupby(['Age'])['Score'].value_counts().unstack()

3. Jon Clements Idea from comments:

pd.crosstab(pd.cut(df.Age, [19, 21, 24],include_lowest=True), df.Score)

All of the three produces following output:

Score           1   2   3
Age         
(18.999, 21.0]  3   2   1
(21.0, 24.0]    2   1   3
cats = ['1', '2', '3']
bins = [0, 1, 2, 3]
data = data[['Age']].join(pd.get_dummies(pd.cut(data.Score, bins, labels=cats)))
data['bins'] = pd.cut(data['Age'], bins=[19,21,24], include_lowest=True)
data.groupby('bins').sum() 

                Age  1  2  3
bins
(18.999, 21.0]  117  3  2  1
(21.0, 24.0]    140  2  1  3

You can remove/rename the bins and Age series and this will need some tweaking to get the inclusions right.

I'm not entirely sure what result you want (are you multiplying the counts by the score...?) but this might help:

>>> data['age_binned'] = pd.cut(data['Age'], [18,21,24])
>>> data.groupby(['age_binned', 'Score'])['Age'].nunique().unstack()

Score       1  2  3
age_binned         
(18, 21]    2  2  1
(21, 24]    2  1  1

I assumed you wanted the number of unique elements, if you just want the total number of elements use .count() instead of .nunique()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM