简体   繁体   English

将 Dataframe 转换为 R 中的二进制矩阵

[英]Transform a Dataframe to a Binary Matrix in R

I have a data frame with 2 columns, customerID and StockCodes describing stockcodes bought by the customer over a period of time.我有一个包含 2 列的数据框,customerID 和 StockCodes 描述了客户在一段时间内购买的股票代码。 There could be multiple observations for the same customer as he might have bought the same items multiple times or different items over a period of time.同一客户可能有多次观察,因为他可能多次购买相同的物品或在一段时间内购买不同的物品。 The sample data looks as follows:样本数据如下所示:

CustomerID StockCode

12346 23166

12347 16008

12347 17021

12347 20665

12347 20719

12347 20719

12347 20719

12347 20719

12347 20780

12347 20782

12347 20966

12347 21035

I need to transpose the data frame in R such that all stockcodes would appear as columns without any repetition and each row will have a distinct customerID.我需要在 R 中转置数据帧,这样所有股票代码都将显示为没有任何重复的列,并且每一行都有一个不同的客户 ID。 I have two questions:我有两个问题:

  1. The cross-section cell value will have either numeric '1' if the customer has at least one matching stock code else 0.如果客户至少有一个匹配的股票代码,则横截面单元格值将具有数字“1”,否则为 0。

  2. The cross-section cell will have the count of stockcodes each customer has, if there is a matching stock code, else 0.如果有匹配的库存代码,则横截面单元格将包含每个客户拥有的库存代码的计数,否则为 0。

This is easily done with dplyr and tidyr::pivot_wider .这可以通过dplyrtidyr::pivot_wider轻松完成。

Data数据

example <- data.frame(CustomerID = c(12346, 12347, 12347, 12347, 12347, 12347), 
                      StockCode = c(23166, 16008, 17021, 20665, 20719, 20719)
)

Code for Part (1)第 (1) 部分的代码

 A <- example %>% 
    distinct %>%
      mutate(Test = 1) %>% 
        tidyr::pivot_wider(values_from = Test, names_from = StockCode) %>% 
            replace(is.na(.), 0)

Output for Part (1) Output 用于零件 (1)

# A tibble: 2 x 6
  CustomerID `23166` `16008` `17021`
       <dbl>   <dbl>   <dbl>   <dbl>
1      12346       1       0       0
2      12347       0       1       1
# ... with 2 more variables:
#   `20665` <dbl>, `20719` <dbl>

Code for Part (2)第 (2) 部分的代码

B <- example %>% 
  group_by_all %>%
    count %>% 
     tidyr::pivot_wider(values_from = n, names_from = StockCode) %>% 
       replace(is.na(.), 0)

Output for Part (2) Output 用于零件 (2)

> B
# A tibble: 2 x 6
# Groups:   CustomerID [2]
  CustomerID `23166` `16008` `17021`
       <dbl>   <int>   <int>   <int>
1      12346       1       0       0
2      12347       0       1       1
# ... with 2 more variables:
#   `20665` <int>, `20719` <int>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM