簡體   English   中英

R dplyr 中的一個熱編碼臟列

[英]one hot encoding dirty column in R dplyr

我有一個這樣的專欄。 該列以“,”開頭和結尾,每個值用“,,”分隔。

col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,

如何將此列轉換為以下內容:

col1_9 col1_101 col1_102 col1_200 col1_201
1      1        0        1        1
0      1        1        0        1
1      1        1        1        1
1      1        2        1        0

一個選項可能是:

  • 首先使用str_sub中的stringr刪除開頭和結尾的“,”
  • 一個 Hot 使用mtabulatestrsplit對列進行編碼,sep 為“,”
  • 根據數字對列名進行排序
  • 最后,使用paste0給列命名“col1_”

結果如下:

df <- read.table(text = "col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,", header = TRUE)

library(stringr)
library(qdapTools)
df$col1 <- str_sub(df$col1, 2, -2)
df <- mtabulate(strsplit(df$col1, ",,"))
df <- df[, order(as.numeric(names(df)))]
names(df) <- paste0("col1_", names(df))
df
#>   col1_9 col1_101 col1_102 col1_200 col1_201
#> 1      1        1        0        1        1
#> 2      0        1        1        0        1
#> 3      1        1        1        1        1
#> 4      1        1        2        1        0

reprex 包於 2022-07-21 創建 (v2.0.1)

df%>%
  mutate(rowid = row_number(), value = 1)%>%
  separate_rows(col1)%>%
  filter(nzchar(col1)) %>%
  pivot_wider(rowid, names_from = col1, 
              values_fn = sum, names_prefix = 'col1_', 
              values_fill = 0)

  # A tibble: 4 x 6
  rowid col1_101 col1_9 col1_201 col1_200 col1_102
  <int>    <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
1     1        1      1        1        1        0
2     2        1      0        1        0        1
3     3        1      1        1        1        1
4     4        1      1        0        1        2

在基礎 R 中:

a <- setNames(strsplit(trimws(df$col1,white=','), ',+'), seq(nrow(df)))
as.data.frame.matrix(t(table(stack(a))))

  101 102 200 201 9
1   1   0   1   1 1
2   1   1   0   1 0
3   1   1   1   1 1
4   1   2   1   0 1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM