简体   繁体   中英

count the number of occurrences using in a pandas pivot table

I have a df that looks something like this (5m rows and around 250 different treaty numbers (both as strings)):

      Id   Name    Treaty Number
 0  Id88   Jack              x12
 1  Id87   John              x33
 2  Id88    Jim              x22
 3  Id11   Hans              x12
 4  Id12   Ivan              x33
 5  Id88   Sara              x22
 6  Id11    Max              x12
 7  Id11  Peter              x33

I would like to find all the duplicate IDs and the count of every treaty number belonging to the id.

Perfectly, It would look like this:

           Sum  
   Id88      3    x12: 1, x22:2, ....
   Id11      3    x12: 2, x33:1,...

Right now I have following code:

    import pandas as pd
    import numpy as np

    data = np.array([
    ['Id88', 'Jack', 'x12'], 
    ['Id87', 'John', 'x33'], 
    ['Id88', 'Jim', 'x22'],
    ['Id11', 'Hans', 'x12'],
    ['Id12', 'Ivan', 'x33'],
    ['Id88', 'Sara', 'x22'],
    ['Id11', 'Max', 'x12'],
    ['Id11', 'Peter', 'x33'],
    ])
    columns=['Id', 'Name', 'Treaty Number']

    df = pd.DataFrame(data= data, columns = columns)

    dublicateIDs = df[df.duplicated(subset=['Id'],keep=False )]

    pivotIDs = dublicateIDs.pivot_table(index=['Id'], aggfunc='size')
    pivotIDs = pivotIDs.sort_values(ascending=False)

    pivotTreaty = dublicateIDs.pivot_table(index=['Id'], columns = 'Treaty Number', aggfunc='size', 
    fill_value=0)

    concatDF = [pivotIDs, pivotTreaty]
    pivotIDsCombine = pd.concat(concatDF, axis=1, sort=False)
    columnNames = pivotIDsCombine.columns.tolist()
    columnNames[0] = 'Sum'
    pivotIDsCombine.columns = columnNames
    print(pivotIDsCombine)

And following result:

         Sum  x12  x22  x33
 Id88      3    1    2    0
 Id11      3    2    0    1

Because of the large number of rows (5m) and treaty numbers (250) and only the small number of treaties for every ID, I have a huge table full with NaNs (or zeros).

Is there a easy way using a pivot table to reach the desired format or should I go loop ever every column/row and count the number of occurrences manually?

This should help you out

df['temp'] = 1
df1 = df.groupby(['Id', 'Treaty Number'])['temp'].count().reset_index()
df1 = df1.pivot_table(index='Id', columns='Treaty Number')
df1.columns = df1.columns.droplevel()
df1.columns.name = None
df1.fillna(0, inplace=True)
df1['Sum'] = df1.sum(axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM