简体   繁体   中英

Data transformation for machine learning

I have dataset with SKU IDs and their counts, i need to feed this data into a machine learning algorithm, in a way that SKU IDs become columns and COUNTs are at the intersection of transaction id and SKU ID. Can anyone suggest how to achieve this transformation.

CURRENT DATA

TransID     SKUID      COUNT
1           31         1  
1           32         2 
1           33         1  
2           31         2  
2           34         -1  

DESIRED DATA

TransID      31      32      33      34
  1          1        2      1       0
  2          2        0      0       -1  

In R , we can use either xtabs

xtabs(COUNT~., df1)
#         SKUID
#TransID 31 32 33 34
#     1  1  2  1  0
#     2  2  0  0 -1

Or dcast

library(reshape2)
dcast(df1, TransID~SKUID, value.var="COUNT", fill=0)
#  TransID 31 32 33 34
#1       1  1  2  1  0
#2       2  2  0  0 -1

Or spread

library(tidyr)
spread(df1, SKUID, COUNT, fill=0)

In Pandas, you can use pivot:

>>> df.pivot('TransID', 'SKUID').fillna(0)
        COUNT         
SKUID      31 32 33 34
TransID               
1           1  2  1  0
2           2  0  0 -1

To avoid ambiguity, it is best to explicitly label your variables:

df.pivot(index='TransID', columns='SKUID').fillna(0)

You can also perform a groupby and then unstack SKUID :

>>> df.groupby(['TransID', 'SKUID']).COUNT.sum().unstack('SKUID').fillna(0)
SKUID    31  32  33  34
TransID                
1         1   2   1   0
2         2   0   0  -1

In GraphLab/SFrame, the relevant commands are unstack and unpack .

import sframe  #or import graphlab
sf = sframe.SFrame({'TransID':[1, 1, 1, 2, 2],
                    'SKUID':[31, 32, 33, 31, 34],
                    'COUNT': [1, 2, 1, 2, -1]})

sf2 = sf.unstack(['SKUID', 'COUNT'], new_column_name='dict_counts')
out = sf2.unpack('dict_counts', column_name_prefix='')

The missing values can be filled by column:

for c in out.column_names():
    out[c] = out[c].fillna(0)

out.print_rows()

+---------+----+----+----+----+
| TransID | 31 | 32 | 33 | 34 |
+---------+----+----+----+----+
|    1    | 1  | 2  | 1  | 0  |
|    2    | 2  | 0  | 0  | -1 |
+---------+----+----+----+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM