简体   繁体   中英

efficient way to populate pandas dataframe based on conditions from another dataframe

The Data

I've got a dataframe that has rank scores for a given ID:

>>> ranks
  ID  rank
0  A     6
1  B     9
2  C     6
3  D     1
4  E     1
5  F     2

I would like to turn this into a square matrix with each ID as both an index and a column, based on several conditions: if the rank of an ID on the index is higher than the rank of the ID in the column, set it to 1, if it is lower, set it to 0, if it is equal, set it to 0.5, and if the index is the same as the column, set it to np.nan . This is better described by looking at my desired matrix:

Desired Result

>>> mtrx
     A    B    C    D    E    F
A  NaN  1.0  0.5  0.0  0.0  0.0
B  0.0  NaN  0.0  0.0  0.0  0.0
C  0.5  1.0  NaN  0.0  0.0  0.0
D  1.0  1.0  1.0  NaN  0.5  1.0
E  1.0  1.0  1.0  0.5  NaN  1.0
F  1.0  1.0  1.0  0.0  0.0  NaN

What I've Done (works, but is slow)

The following loop works, but with larger dataframes, it is slow. If someone can point me in the direction of a nicer more pythonic/pandorable way to achieve this, I'd love some help:

# Make an empty matrix as a dataframe
mtrx = pd.DataFrame(np.zeros((len(IDs), len(IDs))), index=IDs, columns = IDs)

# Populate it via for loop
for i in IDs:
    for j in IDs:
        i_rank = ranks.loc[ranks['ID'] == i].iloc[0]['rank']
        j_rank = ranks.loc[ranks['ID'] == j].iloc[0]['rank']
        if i == j:
            mtrx.loc[i, j] = np.nan
        elif i_rank < j_rank:
            mtrx.loc[i, j] = 1.
        elif i_rank == j_rank:
            mtrx.loc[i, j] = 0.5

Code to reproduce this toy example

import pandas as pd
import numpy as np
np.random.seed(1)
IDs = list('ABCDEF')
ranks = pd.DataFrame({'ID':IDs, 'rank':np.random.randint(1,10,len(IDs))})

numpy approach

s=ranks['rank'].values
s1=(s>s[:,None]).astype(int).astype(float)
s1[s==s[:,None]]=0.5
s1[[np.arange(len(s))]*2] = np.nan
pd.DataFrame(s1,index=ranks.ID,columns=ranks.ID)


Out[843]: 
ID    A    B    C    D    E    F
ID                              
A   NaN  1.0  0.5  0.0  0.0  0.0
B   0.0  NaN  0.0  0.0  0.0  0.0
C   0.5  1.0  NaN  0.0  0.0  0.0
D   1.0  1.0  1.0  NaN  0.5  1.0
E   1.0  1.0  1.0  0.5  NaN  1.0
F   1.0  1.0  1.0  0.0  0.0  NaN

pandas approach

s=ranks.assign(key=1).merge(ranks.assign(key=1),on='key')
s['New']=(s['rank_x']<s['rank_y']).astype(int)
s.loc[s['rank_x']==s['rank_y'],'New']=0.5
s.loc[s['ID_x']==s['ID_y'],'New']=np.nan

s.set_index(['ID_x','ID_y']).New.unstack()
Out[854]: 
ID_y    A    B    C    D    E    F
ID_x                              
A     NaN  1.0  0.5  0.0  0.0  0.0
B     0.0  NaN  0.0  0.0  0.0  0.0
C     0.5  1.0  NaN  0.0  0.0  0.0
D     1.0  1.0  1.0  NaN  0.5  1.0
E     1.0  1.0  1.0  0.5  NaN  1.0
F     1.0  1.0  1.0  0.0  0.0  NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM