[英]How to transform/resample/interpolate data for normalising variable length within a tidy dataset with multiple grouping variables in R?
I am aiming to normalize the length of vectors for averaging within a tidy data set. 我的目标是标准化向量的长度,以便在整洁的数据集中平均。 Using approx seems to be way to go but I can't make it work efficiently in tidyverse.
使用“大约”似乎是一种方法,但是我无法使其在tidyverse中有效地工作。 One issue is probably related to resizing within a dataframe.
一个问题可能与在数据框中调整大小有关。 Here's a reproducible example:
这是一个可重现的示例:
# create reproducible dataset
i = 80
I = 110
id = rep("AA", I+i)
event = rep("event1", I+i)
sub_event = NA
sub_event[1:i] = 1
sub_event[i+1:I] = 2
sub_event = as.factor(sub_event)
y1 = sin(seq(0, 5*pi, length.out = i))
y2 = sin(seq(0, 5*pi, length.out = I))
y3 = cos(seq(0, 5*pi, length.out = i))
y4 = cos(seq(0, 5*pi, length.out = I))
var1 = c(y1,y2)
var2 = c(y3,y4)
df1 <- data.frame(id, event, sub_event,var1, var2)
df2 <- df1
df2$event = "event2"
df <- rbind(df1, df2)
temp <- df
temp$id = "BB"
df <- rbind(df, temp)
# create a "time" vector for sub_event
df <- df %>%
group_by(id, event, sub_event) %>%
mutate(sub_event_time = seq_along(var1)) %>%
select(id, event, sub_event, sub_event_time, everything()) %>%
ungroup()
Plot var1 绘制var1
# plot
ggplot(df,
aes(x=sub_event_time, y=var1, colour = sub_event)) +
geom_point() +
geom_path() +
facet_wrap(id~event)
I want transform/resample data to obtain length of var1 for each sub_events to be the length of the longest sub_event within each event for each id. 我希望转换/重采样数据以获得每个sub_events的var1的长度,以使其成为每个id中每个事件中最长的sub_event的长度。
For instance we want: length of var1 for event 1 sub event 1 = length of var1 for event 1 sub event 2 (which is the longest). 例如,我们想要:事件1子事件1的var1的长度=事件1子事件2的var1的长度(最长)。
Here's an attempt: 这是一个尝试:
# attempt for var1 only
aim.df <- df %>%
ungroup() %>%
select(-var2) %>%
group_by(id, event) %>%
mutate(max_sub_event_time = max(sub_event_time)) %>%
mutate(var1 = approx(var1, n = max_sub_event_time)$y)
This returns the following error: 这将返回以下错误:
Error in mutate_impl(.data, dots) :
Column `var1` must be length 190 (the group size) or one, not 110
In addition: Warning messages:
1: In if (n <= 0) stop("'approx' requires n >= 1") :
the condition has length > 1 and only the first element will be used
2: In seq.int(x[1L], x[nx], length.out = n) :
first element used of 'length.out' argument
Any ideas ? 有任何想法吗 ?
steps... 脚步...
group_by(id, event, sub_event)
sub_event_time
since it will be irrelevant once you add observations sub_event_time
因为一旦添加观察值,它将不再相关 summarise
the result of the approx
function as a list column (you will have to convert var1
and max_sub_event_time
to appropriate input for approx
) summarise
的结果approx
功能列表的列(你将不得不转换var1
和max_sub_event_time
为适当的输入approx
) unnest
the resulting list column unnest
结果列表列 group_by(id, event, sub_event)
again and add a new sub_event_time
group_by(id, event, sub_event)
并添加一个新的sub_event_time
code... 码...
library(dplyr)
library(tidyr)
df %>%
ungroup() %>%
select(-var2) %>%
group_by(id, event) %>%
mutate(max_sub_event_time = max(sub_event_time)) %>%
group_by(id, event, sub_event) %>%
select(-sub_event_time) %>%
summarise(var1_int = list(approx(as.numeric(var1), n = first(max_sub_event_time))$y)) %>%
unnest() %>%
group_by(id, event, sub_event) %>%
mutate(sub_event_time = row_number())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.