Data transformation for machine learning

Question

I have dataset with SKU IDs and their counts, i need to feed this data into a machine learning algorithm, in a way that SKU IDs become columns and COUNTs are at the intersection of transaction id and SKU ID. Can anyone suggest how to achieve this transformation.

CURRENT DATA

TransID     SKUID      COUNT
1           31         1  
1           32         2 
1           33         1  
2           31         2  
2           34         -1

DESIRED DATA

TransID      31      32      33      34
  1          1        2      1       0
  2          2        0      0       -1

Answer 1

In R , we can use either xtabs

xtabs(COUNT~., df1)
#         SKUID
#TransID 31 32 33 34
#     1  1  2  1  0
#     2  2  0  0 -1

Or dcast

library(reshape2)
dcast(df1, TransID~SKUID, value.var="COUNT", fill=0)
#  TransID 31 32 33 34
#1       1  1  2  1  0
#2       2  2  0  0 -1

Or spread

library(tidyr)
spread(df1, SKUID, COUNT, fill=0)

Answer 2

In Pandas, you can use pivot:

>>> df.pivot('TransID', 'SKUID').fillna(0)
        COUNT         
SKUID      31 32 33 34
TransID               
1           1  2  1  0
2           2  0  0 -1

To avoid ambiguity, it is best to explicitly label your variables:

df.pivot(index='TransID', columns='SKUID').fillna(0)

You can also perform a groupby and then unstack SKUID :

>>> df.groupby(['TransID', 'SKUID']).COUNT.sum().unstack('SKUID').fillna(0)
SKUID    31  32  33  34
TransID                
1         1   2   1   0
2         2   0   0  -1

Answer 3

In GraphLab/SFrame, the relevant commands are unstack and unpack .

import sframe  #or import graphlab
sf = sframe.SFrame({'TransID':[1, 1, 1, 2, 2],
                    'SKUID':[31, 32, 33, 31, 34],
                    'COUNT': [1, 2, 1, 2, -1]})

sf2 = sf.unstack(['SKUID', 'COUNT'], new_column_name='dict_counts')
out = sf2.unpack('dict_counts', column_name_prefix='')

The missing values can be filled by column:

for c in out.column_names():
    out[c] = out[c].fillna(0)

out.print_rows()

+---------+----+----+----+----+
| TransID | 31 | 32 | 33 | 34 |
+---------+----+----+----+----+
|    1    | 1  | 2  | 1  | 0  |
|    2    | 2  | 0  | 0  | -1 |
+---------+----+----+----+----+

Data transformation for machine learning

Question

3 answers

solution1
4 2016-04-23 04:47:18

solution2
3 ACCPTED 2016-04-23 06:09:21

solution3
2 2016-04-26 17:57:04

Data transformation for machine learning

Question

3 answers

solution1 4 2016-04-23 04:47:18

solution2 3 ACCPTED 2016-04-23 06:09:21

solution3 2 2016-04-26 17:57:04

solution1
4 2016-04-23 04:47:18

solution2
3 ACCPTED 2016-04-23 06:09:21

solution3
2 2016-04-26 17:57:04