简体   繁体   中英

Pandas crosstab on dataframe index

I have a dataframe that is storing transaction logs. Every log has its own activity hash and respective user ID, eg

ID                  UserID
999974708546523127  AU896
999974708546523127  ZZ999
999974708546520000  ZZ999

I use crosstab to create a correlation matrix to compare the users activity hashed against each other. Thereby I can measure how similar their behaviour is:

Data = pd.read_csv('path.csv', 
        sep=';', names=['ID', 'UserID', 'Info1', 'Info2'], error_bad_lines=False, 
        encoding='latin-1', dtype='category')

df = pd.crosstab(Data.UserID, Data.ID)

However, as I have ~5 millions rows and the ID activity hash is that complex, the computation takes way too long or doesn't complete at all. Using dtype = 'category' reduced reading time of the csv file significantly already.

Expected Output Correlation Matrix

Index  AU896  ZZ999
AU896    1     0.5
ZZ999   0.5     1

I can not change the hash nor the UserID to reduce memory usage.

This operations takes 6 and 3 seconds for Info1 and Info2.

Maybe there is a more efficient operation to do this with pandas or even with dask?

Thank you for your help!

Not exactly sure about the use case. As you did not show what to do with info1 or info2 column. So I am giving an general example.

import pandas as pd
import io

data_string = '''ID,UserID,info1
999974708546523127,AU896,35
999974708546523127,ZZ999,45
999974708546520000,ZZ999,13
999974708546520000,AU896,13
999974708546523128,AU896,45
999974708546523128,ZZ999,12
999974708546520001,ZZ999,36
999974708546520001,AU896,37'''

df = pd.read_csv(io.StringIO(data_string))

# create a wide form of data from long
wide_df = df.pivot(index="ID", columns="UserID", values="info1").reset_index()
# build the correlation metrics from the wide form of data
corr_df = wide_df[["AU896", "ZZ999"]].corr()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM