[英]Collapsing multiple columns containing the same variable into one column
我的數據如下所示:
ID Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4
A 1 0 0 0
A 1 0 0 0
A 1 0 0 0
B 0 1 0 0
C 0 0 0 1
C 0 1 0 0
D 0 0 0 1
E 0 0 1 0
E 0 1 0 0
E 0 0 1 0
Diagnosis_1:Diagnosis_4都是二元的,表示診斷的存在(1)或不存在(0)。 我想要做的是創建一個如下所示的數據框:
ID Diagnosis
A 1
A 1
A 1
B 2
C 4
C 2
D 4
E 3
E 2
E 3
無論我閱讀有關reshape / reshape2 / tidyr的文檔多少次,我都無法繞過他們的實現。
我可以使用dplyr的mutate解決我的問題,但這是一個耗時,迂回的方式來實現我的目標。
編輯:編輯的數據更真實地代表我的實際數據框架。
嘗試矩陣乘法:
nc <- ncol(DF)
data.frame(ID = DF$ID, Diagnosis = as.matrix(DF[-1]) %*% seq(nc-1))
贈送:
ID Diagnosis
1 A 1
2 B 2
3 C 2
4 D 4
5 E 3
注意:我們使用它作為輸入:
Lines <- "ID Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4
A 1 0 0 0
B 0 1 0 0
C 0 1 0 0
D 0 0 0 1
E 0 0 1 0"
DF <- read.table(text = Lines, header = TRUE)
您可以嘗試使用max.col
來獲取每行的列索引。
data.frame(ID=df1$ID, Diagnosis=max.col(df1[-1]))
# ID Diagnosis
#1 A 1
#2 B 2
#3 C 2
#4 D 4
#5 E 3
或者獲得索引的另一種選擇是
unname(which(t(df1[-1])!=0, arr.ind=TRUE)[,1])
#[1] 1 2 2 4 3
set.seed(25)
m1 <- matrix(sample(0:1, 1e8*5, replace=TRUE, prob=c(0.9, 0.1)), ncol=5)
m2 <- m1[rowSums(m1)==1,]
dim(m2)
#[1] 32812201 5
set.seed(395)
df1 <- data.frame(ID= sample(LETTERS, nrow(m2), replace=TRUE), m2,
stringsAsFactors=FALSE)
colnames(df1)[-1] <- paste('X', 1:5, sep="_")
Grothendieck <- function() {nc <- ncol(df1)
data.frame(ID = df1$ID, Diagnosis =
as.matrix(df1[-1]) %*% seq(nc-1))}
akrun <- function() {data.frame(ID=df1$ID,
Diagnosis=max.col(df1[-1], 'first'))}
ananda <- function() {df1 %>%
gather(var, val, -ID) %>%
separate(var, into = c("var", "value")) %>%
filter(val == 1) %>%
select(ID, value)}
system.time(akrun())
# user system elapsed
# 3.690 0.396 4.085
system.time(Grothendieck())
# user system elapsed
# 3.121 0.459 3.581
嘗試在'df1'的較小子集上進行dplyr
解決方案。 1e6行
df1 <- df1[1:1e6,]
system.time(ananda())
# user system elapsed
# 6.279 0.177 6.454
使用microbenchmark
,
library(microbenchmark)
microbenchmark(akrun(), Grothendieck(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun() 1.019108 1.252443 1.084306 1.180743 1.16463 0.6928535 20 a
#Grothendieck() 1.000000 1.000000 1.000000 1.000000 1.00000 1.0000000 20 a
df1 <- structure(list(ID = c("A", "B", "C", "D", "E"),
Diagnosis_1 = c(1L,
0L, 0L, 0L, 0L), Diagnosis_2 = c(0L, 1L, 1L, 0L, 0L),
Diagnosis_3 = c(0L,
0L, 0L, 0L, 1L), Diagnosis_4 = c(0L, 0L, 0L, 1L, 0L)),
.Names = c("ID",
"Diagnosis_1", "Diagnosis_2", "Diagnosis_3", "Diagnosis_4"),
class = "data.frame", row.names = c(NA, -5L))
由於您提到“reshape2”,“tidyr”和相關工具,以下是一些需要考慮的選項:
## Using "tidyr" and "dplyr"
library(dplyr)
library(tidyr)
df1 %>%
gather(var, val, -ID) %>%
separate(var, into = c("var", "value")) %>%
filter(val == 1) %>%
select(ID, value)
# ID value
# 1 A 1
# 2 B 2
# 3 C 2
# 4 E 3
# 5 D 4
## Getting half-way there with "melt" from "reshape2"
library(reshape2)
melt(replace(df1, df1 == 0, NA), id.vars = "ID", na.rm = TRUE)
# ID variable value
# 1 A Diagnosis_1 1
# 7 B Diagnosis_2 1
# 8 C Diagnosis_2 1
# 15 E Diagnosis_3 1
# 19 D Diagnosis_4 1
考慮到您的更新,您只需添加輔助ID:
library(dplyr)
library(tidyr)
mydf %>%
group_by(ID) %>%
mutate(ID2 = row_number()) %>%
gather(var, val, Diagnosis_1:Diagnosis_4) %>%
separate(var, into = c("var", "value")) %>%
filter(val == 1) %>%
arrange(ID, ID2)
# Source: local data frame [10 x 5]
#
# ID ID2 var value val
# 1 A 1 Diagnosis 1 1
# 2 A 2 Diagnosis 1 1
# 3 A 3 Diagnosis 1 1
# 4 B 1 Diagnosis 2 1
# 5 C 1 Diagnosis 4 1
# 6 C 2 Diagnosis 2 1
# 7 D 1 Diagnosis 4 1
# 8 E 1 Diagnosis 3 1
# 9 E 2 Diagnosis 2 1
# 10 E 3 Diagnosis 3 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.