I have dataset with SKU IDs and their counts, i need to feed this data into a machine learning algorithm, in a way that SKU IDs become columns and COUNTs are at the intersection of transaction id and SKU ID. Can anyone suggest how to achieve this transformation.
CURRENT DATA
TransID SKUID COUNT
1 31 1
1 32 2
1 33 1
2 31 2
2 34 -1
DESIRED DATA
TransID 31 32 33 34
1 1 2 1 0
2 2 0 0 -1
In R
, we can use either xtabs
xtabs(COUNT~., df1)
# SKUID
#TransID 31 32 33 34
# 1 1 2 1 0
# 2 2 0 0 -1
Or dcast
library(reshape2)
dcast(df1, TransID~SKUID, value.var="COUNT", fill=0)
# TransID 31 32 33 34
#1 1 1 2 1 0
#2 2 2 0 0 -1
Or spread
library(tidyr)
spread(df1, SKUID, COUNT, fill=0)
In Pandas, you can use pivot:
>>> df.pivot('TransID', 'SKUID').fillna(0)
COUNT
SKUID 31 32 33 34
TransID
1 1 2 1 0
2 2 0 0 -1
To avoid ambiguity, it is best to explicitly label your variables:
df.pivot(index='TransID', columns='SKUID').fillna(0)
You can also perform a groupby
and then unstack SKUID
:
>>> df.groupby(['TransID', 'SKUID']).COUNT.sum().unstack('SKUID').fillna(0)
SKUID 31 32 33 34
TransID
1 1 2 1 0
2 2 0 0 -1
In GraphLab/SFrame, the relevant commands are unstack
and unpack
.
import sframe #or import graphlab
sf = sframe.SFrame({'TransID':[1, 1, 1, 2, 2],
'SKUID':[31, 32, 33, 31, 34],
'COUNT': [1, 2, 1, 2, -1]})
sf2 = sf.unstack(['SKUID', 'COUNT'], new_column_name='dict_counts')
out = sf2.unpack('dict_counts', column_name_prefix='')
The missing values can be filled by column:
for c in out.column_names():
out[c] = out[c].fillna(0)
out.print_rows()
+---------+----+----+----+----+
| TransID | 31 | 32 | 33 | 34 |
+---------+----+----+----+----+
| 1 | 1 | 2 | 1 | 0 |
| 2 | 2 | 0 | 0 | -1 |
+---------+----+----+----+----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.