简体   繁体   中英

Data transformation for machine learning

I have dataset with SKU IDs and their counts, i need to feed this data into a machine learning algorithm, in a way that SKU IDs become columns and COUNTs are at the intersection of transaction id and SKU ID. Can anyone suggest how to achieve this transformation.


TransID     SKUID      COUNT
1           31         1  
1           32         2 
1           33         1  
2           31         2  
2           34         -1  


TransID      31      32      33      34
  1          1        2      1       0
  2          2        0      0       -1  

In R , we can use either xtabs

xtabs(COUNT~., df1)
#         SKUID
#TransID 31 32 33 34
#     1  1  2  1  0
#     2  2  0  0 -1

Or dcast

dcast(df1, TransID~SKUID, value.var="COUNT", fill=0)
#  TransID 31 32 33 34
#1       1  1  2  1  0
#2       2  2  0  0 -1

Or spread

spread(df1, SKUID, COUNT, fill=0)

In Pandas, you can use pivot:

>>> df.pivot('TransID', 'SKUID').fillna(0)
SKUID      31 32 33 34
1           1  2  1  0
2           2  0  0 -1

To avoid ambiguity, it is best to explicitly label your variables:

df.pivot(index='TransID', columns='SKUID').fillna(0)

You can also perform a groupby and then unstack SKUID :

>>> df.groupby(['TransID', 'SKUID']).COUNT.sum().unstack('SKUID').fillna(0)
SKUID    31  32  33  34
1         1   2   1   0
2         2   0   0  -1

In GraphLab/SFrame, the relevant commands are unstack and unpack .

import sframe  #or import graphlab
sf = sframe.SFrame({'TransID':[1, 1, 1, 2, 2],
                    'SKUID':[31, 32, 33, 31, 34],
                    'COUNT': [1, 2, 1, 2, -1]})

sf2 = sf.unstack(['SKUID', 'COUNT'], new_column_name='dict_counts')
out = sf2.unpack('dict_counts', column_name_prefix='')

The missing values can be filled by column:

for c in out.column_names():
    out[c] = out[c].fillna(0)


| TransID | 31 | 32 | 33 | 34 |
|    1    | 1  | 2  | 1  | 0  |
|    2    | 2  | 0  | 0  | -1 |

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM