[英]one hot encoding dirty column in R dplyr
我有一個這樣的專欄。 該列以“,”開頭和結尾,每個值用“,,”分隔。
col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,
如何將此列轉換為以下內容:
col1_9 col1_101 col1_102 col1_200 col1_201
1 1 0 1 1
0 1 1 0 1
1 1 1 1 1
1 1 2 1 0
一個選項可能是:
str_sub
中的stringr
刪除開頭和結尾的“,”mtabulate
和strsplit
對列進行編碼,sep 為“,”paste0
給列命名“col1_”結果如下:
df <- read.table(text = "col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,", header = TRUE)
library(stringr)
library(qdapTools)
df$col1 <- str_sub(df$col1, 2, -2)
df <- mtabulate(strsplit(df$col1, ",,"))
df <- df[, order(as.numeric(names(df)))]
names(df) <- paste0("col1_", names(df))
df
#> col1_9 col1_101 col1_102 col1_200 col1_201
#> 1 1 1 0 1 1
#> 2 0 1 1 0 1
#> 3 1 1 1 1 1
#> 4 1 1 2 1 0
由reprex 包於 2022-07-21 創建 (v2.0.1)
df%>%
mutate(rowid = row_number(), value = 1)%>%
separate_rows(col1)%>%
filter(nzchar(col1)) %>%
pivot_wider(rowid, names_from = col1,
values_fn = sum, names_prefix = 'col1_',
values_fill = 0)
# A tibble: 4 x 6
rowid col1_101 col1_9 col1_201 col1_200 col1_102
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 0
2 2 1 0 1 0 1
3 3 1 1 1 1 1
4 4 1 1 0 1 2
在基礎 R 中:
a <- setNames(strsplit(trimws(df$col1,white=','), ',+'), seq(nrow(df)))
as.data.frame.matrix(t(table(stack(a))))
101 102 200 201 9
1 1 0 1 1 1
2 1 1 0 1 0
3 1 1 1 1 1
4 1 2 1 0 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.