將具有多個級別的因子變量重新編碼為虛擬變量？

Question

我正在處理具有230多個變量的數據集，其中我有大約60個類別var和6個以上的六個級別（無法進行偏好排序，例如：顏色）

我的問題是關於任何可以幫助我無需手動執行即可重新編碼這些變量的函數，這需要大量的工作和時間，並且有犯很多錯誤的風險！

我可以使用R和python ，所以隨時建議可以完成此工作的最有效函數。

假設我有一個名為df的數據集，階乘列的集合為

clm=(clm1, clm2,clm3,....,clm60)

所有這些因素都有很多層次：

(min=2, max=not important [may be 10, 30 or 100...etc])

非常感謝您的幫助！

Answer 1

這是使用model.matrix的簡短示例，可以幫助您入門：

df <- data.frame(
    clm1 = gl(2, 6, 12, c("clm1.levelA", "clm1.levelB")),
    clm2 = gl(3, 4, 12, c("clm2.levelA", "clm2.levelB", "clm2.levelC")));
#          clm1        clm2
#1  clm1.levelA clm2.levelA
#2  clm1.levelA clm2.levelA
#3  clm1.levelA clm2.levelA
#4  clm1.levelA clm2.levelA
#5  clm1.levelA clm2.levelB
#6  clm1.levelA clm2.levelB
#7  clm1.levelB clm2.levelB
#8  clm1.levelB clm2.levelB
#9  clm1.levelB clm2.levelC
#10 clm1.levelB clm2.levelC
#11 clm1.levelB clm2.levelC
#12 clm1.levelB clm2.levelC



as.data.frame.matrix(model.matrix(rep(0, nrow(df)) ~ 0 + clm1 + clm2, df));
#   clm1clm1.levelA clm1clm1.levelB clm2clm2.levelB clm2clm2.levelC
#1                1               0               0               0
#2                1               0               0               0
#3                1               0               0               0
#4                1               0               0               0
#5                1               0               1               0
#6                1               0               1               0
#7                0               1               1               0
#8                0               1               1               0
#9                0               1               0               1
#10               0               1               0               1
#11               0               1               0               1
#12               0               1               0               1

Answer 2

使用python3 pandas ，您可以執行以下操作：

import pandas as pd
df = pd.DataFrame({'clm1': ['clm1a', 'clm1b', 'clm1c'], 'clm2': ['clm2a', 'clm2b', 'clm2c']})
pd.get_dummies(df)

有關更多示例，請參見文檔。

Answer 3

在R中，@ Maurits Evers提出的model.matrix方法存在的問題是，除了第一個因素外，該函數會降低每個因素的第一級。 有時這是您想要的，但有時不是（取決於@Maurits Evers強調的問題）。

有幾個分散在不同程序包中的功能可以做到這一點（例如， caret請參見此處的幾個示例）。

我使用以下功能，此功能受@Jaap的堆棧溢出答案的啟發

#' 
#' Transform factors from a data.frame into dummy variables (one hot encoding)
#' 
#' This function will transform all factors into dummy variables with one column
#' for each level of the factor (unlike the contrasts matrices that will drop the first
#' level). The factors with only two levels will have only one column (0/1 on the second 
#' level). The ordered factors and logicals are transformed into numeric.
#' The numeric and text vectors will remain untouched.
#'

make_dummies <- function(df){

    # function to create dummy variables for one factor only
    dummy <- function(fac, name = "") {

        if(is.factor(fac) & !is.ordered(fac)) {
            l <- levels(fac)
            res <- outer(fac, l, function(fac, l) 1L * (fac == l))
            colnames(res) <- paste0(name, l)
            if(length(l) == 2) {res <- res[,-1, drop = F]}
            if(length(l) == 1) {res <- res}
        } else if(is.ordered(fac) | is.logical(fac)) {
            res <- as.numeric(fac)
        } else {
            res <- fac
        }
        return(res)
    }

    # Apply this function to all columns
    res <- (lapply(df, dummy))
    # change the names of the cases with only one column
    for(i in seq_along(res)){
        if(any(is.matrix(res[[i]]) & ncol(res[[i]]) == 1)){
            colnames(res[[i]]) <- paste0(names(res)[i], ".", colnames(res[[i]]))
        }
    }
    res <- as.data.frame(res)
    return(res)
}

范例：

df <- data.frame(num = round(rnorm(12),1),
                 sex = factor(c("Male", "Female")),
                 color = factor(c("black", "red", "yellow")),
                 fac2 = factor(1:4),
                 fac3 = factor("A"),
                 size =  factor(c("small", "middle", "big"),
                                levels = c("small", "middle", "big"), ordered = TRUE),
                 logi = c(TRUE, FALSE))
print(df)
#>     num    sex  color fac2 fac3   size  logi
#> 1   0.0   Male  black    1    A  small  TRUE
#> 2  -1.0 Female    red    2    A middle FALSE
#> 3   1.3   Male yellow    3    A    big  TRUE
#> 4   1.4 Female  black    4    A  small FALSE
#> 5  -0.9   Male    red    1    A middle  TRUE
#> 6   0.1 Female yellow    2    A    big FALSE
#> 7   1.4   Male  black    3    A  small  TRUE
#> 8   0.1 Female    red    4    A middle FALSE
#> 9   1.6   Male yellow    1    A    big  TRUE
#> 10  1.1 Female  black    2    A  small FALSE
#> 11  0.2   Male    red    3    A middle  TRUE
#> 12  0.3 Female yellow    4    A    big FALSE
make_dummies(df)
#>     num sex.Male color.black color.red color.yellow fac2.1 fac2.2 fac2.3
#> 1   0.0        1           1         0            0      1      0      0
#> 2  -1.0        0           0         1            0      0      1      0
#> 3   1.3        1           0         0            1      0      0      1
#> 4   1.4        0           1         0            0      0      0      0
#> 5  -0.9        1           0         1            0      1      0      0
#> 6   0.1        0           0         0            1      0      1      0
#> 7   1.4        1           1         0            0      0      0      1
#> 8   0.1        0           0         1            0      0      0      0
#> 9   1.6        1           0         0            1      1      0      0
#> 10  1.1        0           1         0            0      0      1      0
#> 11  0.2        1           0         1            0      0      0      1
#> 12  0.3        0           0         0            1      0      0      0
#>    fac2.4 fac3.A size logi
#> 1       0      1    1    1
#> 2       0      1    2    0
#> 3       0      1    3    1
#> 4       1      1    1    0
#> 5       0      1    2    1
#> 6       0      1    3    0
#> 7       0      1    1    1
#> 8       1      1    2    0
#> 9       0      1    3    1
#> 10      0      1    1    0
#> 11      0      1    2    1
#> 12      1      1    3    0

由reprex軟件包（v0.2.0）於2018-03-19創建。

將具有多個級別的因子變量重新編碼為虛擬變量？

問題描述

3 個解決方案

解決方案1
3 2018-03-19 10:32:48

解決方案2
0 2018-03-19 11:03:22

解決方案3
0 2018-03-19 13:13:10

將具有多個級別的因子變量重新編碼為虛擬變量？

問題描述

3 個解決方案

解決方案1 3 2018-03-19 10:32:48

解決方案2 0 2018-03-19 11:03:22

解決方案3 0 2018-03-19 13:13:10

解決方案1
3 2018-03-19 10:32:48

解決方案2
0 2018-03-19 11:03:22

解決方案3
0 2018-03-19 13:13:10