简体   繁体   中英

Creating New Data Frame in R

I have data in this format in R

customer_key    item_key    units
2669699            16865    1.00
2669699            16866    1.00
2669699            46963    2.00
2685256            55271    1.00
2685256            43458    1.00
2685256            54977    1.00
2685256             2533    1.00
2685256            55011    1.00
2685256            44785    2.00

but I want to get the unique head_key as column and I want my the other variables name be the unique values in item_key and their value would be the units like this

customer_key       '16865'   '16866'  '46963'  '55271'   '43458'   '54977'    '2533'
    2669699          1.00     1.00     1.00     0.00      0.00      0.00       0.00 
    2685256          0.00     0.00     0.00     1.00      1.00      1.00       2.00

Please help me transform my data for cluster analysis

That's just a simple dcast task. Assuming df is your data set

library(reshape2)
dcast(df, customer_key ~ item_key , value.var = "units", fill = 0)
#   customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
# 1      2669699    0     1     1     0     0     2     0     0     0
# 2      2685256    1     0     0     1     2     0     1     1     1

Here is one way.

library(tidyr)

spread(mydf,item_key, units, fill = 0)

#  customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
#1      2669699    0     1     1     0     0     2     0     0     0
#2      2685256    1     0     0     1     2     0     1     1     1

Since the packages have been covered (+1 to you all), here are a couple of base solutions to join the party:

xtabs :

xtabs(units ~ customer_key + item_key, df)
#             item_key
# customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
#      2669699    0     1     1     0     0     2     0     0     0
#      2685256    1     0     0     1     2     0     1     1     1

reshape

reshape(df, direction = "wide", idvar = "customer_key", timevar = "item_key")
#   customer_key units.16865 units.16866 units.46963 units.55271
# 1      2669699           1           1           2          NA
# 4      2685256          NA          NA          NA           1
#   units.43458 units.54977 units.2533 units.55011 units.44785
# 1          NA          NA         NA          NA          NA
# 4           1           1          1           1           2
library(dplyr); library(tidyr)
df2 <- df %>% arrange(item_key) %>% spread(item_key, units, fill=0)
df2
#   customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
# 1      2669699    0     1     1     0     0     2     0     0     0
# 2      2685256    1     0     0     1     2     0     1     1     1

Data

df <- structure(list(customer_key = c(2669699L, 2669699L, 2669699L, 
2685256L, 2685256L, 2685256L, 2685256L, 2685256L, 2685256L), 
    item_key = c(16865L, 16866L, 46963L, 55271L, 43458L, 54977L, 
    2533L, 55011L, 44785L), units = c(1, 1, 2, 1, 1, 1, 1, 1, 
    2)), .Names = c("customer_key", "item_key", "units"), class = "data.frame", row.names = c(NA, 
-9L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM