[英]How to reshape wide continuous data into long categorical data?
我的數據采用以下寬格式,根據SUBJECT_ID
按行顯示,總共觀察到變量X
和Y
,然后觀察到元數據的各個列,例如SUBJECT_BIRTHYEAR
, SUBJECT_HOMETOWN
:
variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown
我想將它們轉換為以下長格式,其中對於每個SUBJECT_ID
變量X
和Y
每次觀察:
VARIABLE SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
X A 1950 Townsville
X A 1950 Townsville
Y A 1950 Townsville
X B 1951 Villestown
Y B 1951 Villestown
Y B 1951 Villestown
我的問題的具體內容是如何將n個連續變量的觀測值轉換為n行分類數據。
嘗試以下
數據
df <- read.table(text="variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown", header=TRUE)
解
library(tidyverse)
result <- df %>%
nest(variableX, variableY, .key="VARIABLE") %>%
mutate(VARIABLE = map(VARIABLE, function(i) {
vec <- unlist(i)
rep(gsub("variable", "", names(vec)), times=vec)
})) %>%
unnest()
# A tibble: 6 x 4
# SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN VARIABLE
# <fctr> <int> <fctr> <chr>
# 1 A 1950 Townsville X
# 2 A 1950 Townsville X
# 3 A 1950 Townsville Y
# 4 B 1951 Villestown X
# 5 B 1951 Villestown Y
# 6 B 1951 Villestown Y
該問題要求dcast()
的調用反轉,該調用已使用length()
作為聚合函數將數據從長格式整形為寬格式。
這可以通過調用melt()
以及一些其他轉換來實現:
library(data.table)
# reshape wide back to long format
long <- melt(setDT(wide), measure.vars = c("variableX", "variableY"))[
# undo munging of variable names
, variable := stringr::str_replace(variable, "^variable", "")][]
# undo effect of aggregation by length()
result <- long[long[, rep(.I, value)]][
# beautify result
order(SUBJECT_ID), !"value"]
result
SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variable 1: A 1950 Townsville X 2: A 1950 Townsville X 3: A 1950 Townsville Y 4: B 1951 Villestown X 5: B 1951 Villestown Y 6: B 1951 Villestown Y
.I
是保存行位置(即行索引)的特殊符號。
為了證明這確實是逆運算,可以重新調整result
wide
以重現:
dcast(result, ... ~ paste0("variable", variable), length, value.var = "variable")
SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN variableX variableY 1: A 1950 Townsville 2 1 2: B 1951 Villestown 1 2
library(data.table)
wide <- fread("variableX variableY SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
2 1 A 1950 Townsville
1 2 B 1951 Villestown")
這是使用base R
的選項
res <- cbind(VARIABLE = rep(substr(names(df1)[1:2], 9, 9)[row(df1[1:2])], t(df1[1:2])),
df1[rep(seq_len(nrow(df1)), rowSums(df1[1:2])), -(1:2)])
row.names(res) <- NULL
res
# VARIABLE SUBJECT_ID SUBJECT_BIRTHYEAR SUBJECT_HOMETOWN
#1 X A 1950 Townsville
#2 X A 1950 Townsville
#3 Y A 1950 Townsville
#4 X B 1951 Villestown
#5 Y B 1951 Villestown
#6 Y B 1951 Villestown
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.