R中模型矩陣中因子的所有級別

Question

我有一個由數字和因子變量組成的data.frame ，如下所示。

testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

我想構建一個matrix ，將虛擬變量分配給因子並單獨留下數值變量。

model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

正如在運行lm時所預期的那樣，這將每個因素的一個水平作為參考水平。 但是，我想為所有因素的每個級別構建一個帶有虛擬/指標變量的matrix 。 我正在為glmnet構建這個矩陣，所以我不擔心多重共線性。

有沒有辦法讓model.matrix為因子的每個級別創建虛擬對象？

Answer 1

（試圖贖回自己......）為了回應 Jared 對@Fabians 回答關於自動化的評論，請注意，您需要提供的只是一個命名的對比矩陣列表。 contrasts()采用向量/因子並從中產生對比矩陣。 為此，我們可以使用lapply()對數據集中的每個因素運行contrasts() ，例如提供的testFrame示例：

> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
        Alice Bob Charlie David
Alice       1   0       0     0
Bob         0   1       0     0
Charlie     0   0       1     0
David       0   0       0     1

$Fifth
        Edward Frank Georgia Hank Isaac
Edward       1     0       0    0     0
Frank        0     1       0    0     0
Georgia      0     0       1    0     0
Hank         0     0       0    1     0
Isaac        0     0       0    0     1

@fabians 答案中的哪些位置很好：

model.matrix(~ ., data=testFrame, 
             contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))

Answer 2

您需要重置因子變量的contrasts ：

model.matrix(~ Fourth + Fifth, data=testFrame, 
        contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), 
                Fifth=contrasts(testFrame$Fifth, contrasts=F)))

或者，打字少一點，沒有正確的名字：

model.matrix(~ Fourth + Fifth, data=testFrame, 
    contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), 
            Fifth=diag(nlevels(testFrame$Fifth))))

Answer 3

caret實現了一個不錯的函數dummyVars以通過 2 行實現此目的：

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

檢查最后一列：

colnames(testFrame2)

"First"  "Second"         "Third"          "Fourth.Alice"   "Fourth.Bob"     "Fourth.Charlie" "Fourth.David"   "Fifth.Edward"   "Fifth.Frank"   "Fifth.Georgia"  "Fifth.Hank"     "Fifth.Isaac"

這里最好的一點是你得到了原始數據框，加上排除了用於轉換的原始變量的虛擬變量。

更多信息： http : //amunategui.github.io/dummyVar-Walkthrough/

Answer 4

也可以使用caret dummyVars 。 http://caret.r-forge.r-project.org/preprocess.html

Answer 5

一個tidyverse答案：

library(dplyr)
library(tidyr)
result <- testFrame %>% 
    mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% 
    mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")

產生所需的結果（與@Gavin Simpson 的回答相同）：

> head(result, 6)
  First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1     1      5     4           0         0             1           0           0          1            0         0          0
2     1     14    10           0         0             0           1           0          0            1         0          0
3     2      2     9           0         1             0           0           1          0            0         0          0
4     2      5     4           0         0             0           1           0          1            0         0          0
5     2     13     5           0         0             1           0           1          0            0         0          0
6     2     15     7           1         0             0           0           1          0            0         0          0

Answer 6

好的。 只需閱讀以上內容並將其放在一起即可。 假設您想要矩陣，例如 'X.factors' 乘以您的系數向量以獲得您的線性預測器。 還有幾個額外的步驟：

X.factors = 
  model.matrix( ~ ., data=X, contrasts.arg = 
    lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
                                             contrasts, contrasts = FALSE))

（請注意，如果您只有一個因子列，您需要將 X[*] 轉回數據框。）

然后說你得到了這樣的東西：

attr(X.factors,"assign")
[1]  0  1  **2**  2  **3**  3  3  **4**  4  4  5  6  7  8  9 10 #emphasis added

我們想要擺脫每個因素的 **d 參考水平

att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))

Answer 7

使用 R 包“CatEncoders”

library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

fit <- OneHotEncoder.fit(testFrame)

z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output

Answer 8

目前我正在學習套索模型和glmnet::cv.glmnet() model.matrix()和Matrix::sparse.model.matrix()高維矩陣，利用model.matrix將殺害我們的時間通過的建議glmnet作者。）。

只是在那里分享有一個整潔的編碼，以獲得與@fabians 和@Gavin 的答案相同的答案。 同時，@asdf123 還引入了另一個包library('CatEncoders') 。

> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
> 
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

來源： R for Everyone：高級分析和圖形（第 273 頁）

Answer 9

您可以使用tidyverse來實現這一點，而無需手動指定每一列。

訣竅是制作一個“長”數據框。

然后，修改一些東西，然后將其展開以創建指標/虛擬變量。

代碼：

library(tidyverse)

## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)

testFrame %>%
    ## pivot to "long" format
    gather(feature, value, -id) %>%
    ## add indicator value
    mutate(indicator=1) %>%
    ## create feature name that unites a feature and its value
    unite(feature, value, col="feature_value", sep="_") %>%
    ## convert to wide format, filling missing values with zero
    spread(feature_value, indicator, fill=0)

輸出：

   id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1   1            1           0             0          0           0       0       0       0
2   2            0           1             0          0           0       0       0       0
3   3            0           0             1          0           0       0       0       0
4   4            0           0             0          1           0       0       0       0
5   5            0           0             0          0           1       0       0       0
6   6            1           0             0          0           0       0       0       0
7   7            0           1             0          0           0       0       1       0
8   8            0           0             1          0           0       1       0       0
9   9            0           0             0          1           0       0       0       0
10 10            0           0             0          0           1       0       0       0
11 11            1           0             0          0           0       0       0       0
12 12            0           1             0          0           0       0       0       0
...

Answer 10

model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)

或

model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)

應該是最直接的

Answer 11

stats包的答案：

new_tr <- model.matrix(~.+0,data = testFrame)

在R中將+0（或-1）添加到模型公式（例如，在lm（）中）可抑制截距。

請參見

Answer 12

我編寫了一個名為ModelMatrixModel的包來改進 model.matrix() 的功能。 默認情況下，包中的 ModelMatrixModel() 函數返回一個類，該類包含一個具有各級虛擬變量的稀疏矩陣，適合在 glmnet 包中的 cv.glmnet() 中輸入。 重要的是，返回的類還存儲轉換參數，例如因子級別信息，然后可以將其應用於新數據。 該函數可以處理 r 公式中的大多數項目，如 poly() 和交互。 它還提供了其他幾個選項，例如處理無效因子級別和縮放輸出。

#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
                        Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
                        Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
                   Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
                   Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     7     17           1         0             0           0
## 2     9      7           0         1             0           0

#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data     
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2))) 
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     6      3           0         1             0           0
## 2     2     12           0         0             1           0

R中模型矩陣中因子的所有級別

問題描述

11 個解決方案

解決方案1
69 2010-12-31 09:26:23

解決方案2
53 已采納 2010-12-30 09:38:21

解決方案3
18 2016-12-28 18:08:50

解決方案4
11 2013-03-14 02:29:10

解決方案5
3 2019-02-16 09:43:12

解決方案6
2 2014-07-24 18:05:51

解決方案7
2 2016-09-14 01:56:17

解決方案8
2 2017-01-15 17:59:29

解決方案9
1 2020-03-27 00:22:31

解決方案10
0 2015-09-04 08:05:07

解決方案11
0 2019-07-27 18:42:03

解決方案12
0 2021-08-11 17:02:33

R中模型矩陣中因子的所有級別

問題描述

11 個解決方案

解決方案1 69 2010-12-31 09:26:23

解決方案2 53 已采納 2010-12-30 09:38:21

解決方案3 18 2016-12-28 18:08:50

解決方案4 11 2013-03-14 02:29:10

解決方案5 3 2019-02-16 09:43:12

解決方案6 2 2014-07-24 18:05:51

解決方案7 2 2016-09-14 01:56:17

解決方案8 2 2017-01-15 17:59:29

解決方案9 1 2020-03-27 00:22:31

解決方案10 0 2015-09-04 08:05:07

解決方案11 0 2019-07-27 18:42:03

解決方案12 0 2021-08-11 17:02:33

解決方案1
69 2010-12-31 09:26:23

解決方案2
53 已采納 2010-12-30 09:38:21

解決方案3
18 2016-12-28 18:08:50

解決方案4
11 2013-03-14 02:29:10

解決方案5
3 2019-02-16 09:43:12

解決方案6
2 2014-07-24 18:05:51

解決方案7
2 2016-09-14 01:56:17

解決方案8
2 2017-01-15 17:59:29

解決方案9
1 2020-03-27 00:22:31

解決方案10
0 2015-09-04 08:05:07

解決方案11
0 2019-07-27 18:42:03

解決方案12
0 2021-08-11 17:02:33