I have a df that looks something like this (5m rows and around 250 different treaty numbers (both as strings)):
Id Name Treaty Number
0 Id88 Jack x12
1 Id87 John x33
2 Id88 Jim x22
3 Id11 Hans x12
4 Id12 Ivan x33
5 Id88 Sara x22
6 Id11 Max x12
7 Id11 Peter x33
I would like to find all the duplicate IDs and the count of every treaty number belonging to the id.
Perfectly, It would look like this:
Sum
Id88 3 x12: 1, x22:2, ....
Id11 3 x12: 2, x33:1,...
Right now I have following code:
import pandas as pd
import numpy as np
data = np.array([
['Id88', 'Jack', 'x12'],
['Id87', 'John', 'x33'],
['Id88', 'Jim', 'x22'],
['Id11', 'Hans', 'x12'],
['Id12', 'Ivan', 'x33'],
['Id88', 'Sara', 'x22'],
['Id11', 'Max', 'x12'],
['Id11', 'Peter', 'x33'],
])
columns=['Id', 'Name', 'Treaty Number']
df = pd.DataFrame(data= data, columns = columns)
dublicateIDs = df[df.duplicated(subset=['Id'],keep=False )]
pivotIDs = dublicateIDs.pivot_table(index=['Id'], aggfunc='size')
pivotIDs = pivotIDs.sort_values(ascending=False)
pivotTreaty = dublicateIDs.pivot_table(index=['Id'], columns = 'Treaty Number', aggfunc='size',
fill_value=0)
concatDF = [pivotIDs, pivotTreaty]
pivotIDsCombine = pd.concat(concatDF, axis=1, sort=False)
columnNames = pivotIDsCombine.columns.tolist()
columnNames[0] = 'Sum'
pivotIDsCombine.columns = columnNames
print(pivotIDsCombine)
And following result:
Sum x12 x22 x33
Id88 3 1 2 0
Id11 3 2 0 1
Because of the large number of rows (5m) and treaty numbers (250) and only the small number of treaties for every ID, I have a huge table full with NaNs (or zeros).
Is there a easy way using a pivot table to reach the desired format or should I go loop ever every column/row and count the number of occurrences manually?
This should help you out
df['temp'] = 1
df1 = df.groupby(['Id', 'Treaty Number'])['temp'].count().reset_index()
df1 = df1.pivot_table(index='Id', columns='Treaty Number')
df1.columns = df1.columns.droplevel()
df1.columns.name = None
df1.fillna(0, inplace=True)
df1['Sum'] = df1.sum(axis=1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.