[英]How to create new dataset based on existing dataset
我有以下數據集:
individual sequence_special_drug all_drugs
A NA A
A NA B
A 1 C
A 2 D
A NA B
A NA Z
A 2 D
A NA Z
A 2 D
A NA A
A 3 E
我想在 R 中創建以下數據集:
Individual sequence_special_drug special_drug prior_special_drug prior_traditional_drug during_special_drug
A 1 C none A, B none
A 2 D C none B, Z, Z
A 3 E C, D A, B, B, Z, Z, A none
有沒有快速的方法來做到這一點? 我有很多人,但這些都是可能的情況。 special_drug
由序列號標識; 帶有“NA”的人是traditional_drug
。
prior_special_drug
will contain any special_drug previously identified, so for the first special_drug C there is no previous special_drug, for the second special_drug D, there is one previous special_drug that is C, and for the third special_drug there are two previous special_drugs C and D.
prior_traditional_drug
是相同的,但將包含已在 sequence_special_drug 中標識為 NA 的任何內容。 所以對於第一個 special_drug (C),兩個prior_traditional_drugs 是A 和B。對於第三個special_drug,prior_traditional_drugs 是A、B、B、Z、Z、A。
during_special_drug
將包含在 special_drug 管理過程中引用的每個傳統葯物。 這可以通過重復 sequence_special_drug(例如 2 -> NA NA -> 2 -> NA -> 2)在數據集中識別,因此 B、Z、Z。
編輯- 對於 2 個人:
dat <- read.table(
text = "
individual sequence_special_drug all_drugs
A NA A
A NA B
A 1 C
A 2 D
A NA B
A NA Z
A 2 D
A NA Z
A 2 D
A NA A
A 3 E
B 1 D
B NA B
B NA Z
B 1 D
B NA Z
B 1 D
B NA A
B 2 E",
header = TRUE)
我希望:
-prior_traditional_drug 下的錯誤“無”第 3 行 -
Individual sequence_special_drug special_drug prior_special_drug prior_traditional_drug during_special_drug
A 1 C none A, B none
A 2 D C none B, Z, Z
A 3 E C, D A, B, B, Z, Z, A none
B 1 D none none B, Z, Z
B 2 E D B, Z, Z, A none
-prior_traditional_drug 下的右“A,B”第 3 行 -
Individual sequence_special_drug special_drug prior_special_drug prior_traditional_drug during_special_drug
A 1 C none A, B none
A 2 D C A, B B, Z, Z
A 3 E C, D A, B, B, Z, Z, A none
B 1 D none none B, Z, Z
B 2 E D B, Z, Z, A none
我自己的數據集的錯誤消息
> special_drug <- example_data %>%
+ nest_by(individual) %>%
+ mutate(
+ spec_drug = list(get_all_drugs(data))
+ ) %>%
+ unnest(spec_drug) %>%
+ select(-data) %>%
+ ungroup()
`summarise()` has grouped output by 'sequence_special_drug'. You can override using the `.groups` argument.
Error: Problem with `mutate()` input `spec_drug`.
x Problem with `mutate()` input `flag3`.
x `false` must be a list, not a character vector.
ℹ Input `flag3` is `if_else(flag1 == 1, list(character(0)), flag3)`.
ℹ Input `spec_drug` is `list(get_all_drugs(data))`.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
Error in is_rlang_error(parent) :
argument "parent" is missing, with no default
我自己的數據集更像這樣:
example_data <- read.table(
text = "
individual sequence_special_drug all_drugs
77779 NA Name1
77779 1 Name2
77779 1 Name2
77779 1 Name2
77779 2 Name3
4444 NA Name1
4444 1 Name4
4444 2 Name3
4444 3 Name7",
header = TRUE)
但下面的數據集也會生成相同的錯誤消息:
example_data <- read.table(
text = "
individual sequence_special_drug all_drugs
A NA A
A 1 C
A 2 D
A 2 D
A 2 D
A 3 E
B NA B
B 1 D
B 2 E
B 3 F",
header = TRUE)
這是我僅針對此特定問題的不雅解決方案,但給您提示可能很有用。
library(data.table)
dt <- fread(
"
individual sequence_special_drug all_drugs
A NA A
A NA B
A 1 C
A 2 D
A NA B
A NA Z
A 2 D
A NA Z
A 2 D
A NA A
A 3 E
"
)
df <- unique(na.omit(dt))
setnames(df,"all_drugs","special_drug")
df
#> individual sequence_special_drug special_drug
#> 1: A 1 C
#> 2: A 2 D
#> 3: A 3 E
## add row ideantifier in dt
dt[,rd:=rowid(individual)]
## create prior_special_drug
df[,prior_special_drug:=shift(special_drug)]
df[3,4] <- df[special_drug < "E", paste(special_drug,collapse = ", ")]
df
#> individual sequence_special_drug special_drug prior_special_drug
#> 1: A 1 C <NA>
#> 2: A 2 D C
#> 3: A 3 E C, D
special.drug = df$special_drug
special.drug
#> [1] "C" "D" "E"
posi <- c(
dt[,first(.I[all_drugs==special.drug[1]])], #first position of C
dt[,first(.I[all_drugs==special.drug[2]])], #first position of D
dt[,last(.I[all_drugs==special.drug[2]])], #last position of D
dt[,last(.I[all_drugs==special.drug[3]])] #last position of E
)
posi
#> [1] 3 4 9 11
# dt[is.na(sequence_special_drug) & rd < posi[1], all_drugs]
# dt[is.na(sequence_special_drug) & rd %between% posi[2:3], all_drugs]
# dt[is.na(sequence_special_drug) & rd < posi[4], all_drugs]
drug <- c(
paste(dt[is.na(sequence_special_drug) & rd < posi[1], all_drugs],collapse = ", "),
paste(dt[is.na(sequence_special_drug) & rd %between% posi[2:3], all_drugs],collapse = ", "),
paste(dt[is.na(sequence_special_drug) & rd < posi[4], all_drugs],collapse = ", ")
)
drug
#> [1] "A, B" "B, Z, Z" "A, B, B, Z, Z, A"
## create prior_traditional_drug and during_special_drug
df[,prior_traditional_drug := drug]
df[,prior_traditional_drug := ifelse(special_drug == "D",NA,prior_traditional_drug)]
df[,during_special_drug := drug]
df[,during_special_drug := ifelse(special_drug %in% c("C","E"),NA,during_special_drug)]
## replace NA with "none" in df
for (jj in 1:ncol(df))
set(df,
i = which(is.na(df[[jj]])),
j = jj,
v = "none"
)
df
#> individual sequence_special_drug special_drug prior_special_drug
#> 1: A 1 C none
#> 2: A 2 D C
#> 3: A 3 E C, D
#> prior_traditional_drug during_special_drug
#> 1: A, B none
#> 2: none B, Z, Z
#> 3: A, B, B, Z, Z, A none
由代表 package (v2.0.0) 於 2021 年 6 月 6 日創建
這是我使用{tidyverse}
的建議。 我寫了一個 function 來獲取每一列,然后將它們放在一起get_all_drugs()
。 然后,我逐個通過嵌套數據運行 function,如下例所示。
library(tidyverse)
example_data <- read.table(
text = "
individual sequence_special_drug all_drugs
A NA A
A NA B
A 1 C
A 2 D
A NA B
A NA Z
A 2 D
A NA Z
A 2 D
A NA A
A 3 E
B 1 D
B NA B
B NA Z
B 1 D
B NA Z
B 1 D
B NA A
B 2 E",
header = TRUE)
get_special_drugs <- function(.data) {
.data %>%
filter(sequence_special_drug != 0) %>%
distinct() %>%
select(sequence_special_drug, special_drug = all_drugs) %>%
mutate(prior_special_drug = as.list(accumulate(special_drug, c))) %>%
rowwise() %>%
mutate(prior_special_drug = list(
prior_special_drug[prior_special_drug != special_drug]
)) %>%
ungroup()
}
fix_drug_sequence <- function(.data) {
.data %>%
mutate(
seq_drug = replace_na(sequence_special_drug, 0),
flag = if_else(seq_drug == 0 & seq_drug != lead(seq_drug),
lead(seq_drug),
seq_drug),
flag = if_else(flag == 0 & flag != lead(flag),
lead(flag),
flag)
) %>%
select(-sequence_special_drug) %>%
rename(sequence_special_drug = flag)
}
get_prior_traditional_drug <- function(...) {
fix_drug_sequence(...) %>%
group_by(sequence_special_drug) %>%
mutate(
flag1 = max(seq_drug == sequence_special_drug & row_number() == 1),
) %>%
group_by(sequence_special_drug, flag1) %>%
summarise(
flag2 = list(all_drugs[seq_drug == 0])
) %>%
ungroup() %>%
mutate(
flag3 = as.list(accumulate(flag2, append)),
flag3 = if_else(flag1 == 1, lag(flag3), flag3)
) %>%
select(sequence_special_drug, prior_traditional_drug = flag3)
}
get_during_special_drugs <- function(...) {
fix_drug_sequence(...) %>%
group_by(sequence_special_drug) %>%
mutate(
flag = cumsum(seq_drug == sequence_special_drug)
) %>%
filter(flag > 0) %>%
summarise(
during_special_drug = list(all_drugs[seq_drug == 0])
)
}
get_all_drugs <- function(.data) {
spec_drug <- get_special_drugs(.data)
prior_traditional <- get_prior_traditional_drug(.data)
during_spec <- get_during_special_drugs(.data)
list(spec_drug, prior_traditional, during_spec) %>%
reduce(left_join, by = "sequence_special_drug")
}
special_drug <- example_data %>%
nest_by(individual) %>%
mutate(
spec_drug = list(get_all_drugs(data))
) %>%
unnest(spec_drug) %>%
select(-data) %>%
ungroup()
special_drug
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.