[英]Calculate Z-Score for every N rows in a DF (R)
Hi I have a df that has variables as the columns and rows are time.嗨,我有一个具有变量的 df,因为列和行是时间。 The element of each intersection is a count.每个交叉点的元素是一个计数。
Var_1 Var_2 Var_3
Time_1 5 4 5
Time_2 4 19 4
Time_3 2 2 87
This df has a lot of rows (> 30,000)这个df有很多行(> 30,000)
How can I calculate Z scores for every 20 rows?如何计算每 20 行的 Z 分数? Thanks in advance!提前致谢! <3 <3
Here is an answer that uses dplyr::summarise()
to calculate means and standard deviations, then we merge them with the original data and use mutate()
to calculate the z-scores.这是一个使用dplyr::summarise()
计算均值和标准差的答案,然后我们将它们与原始数据合并并使用mutate()
计算 z 分数。 We'll illustrate the single variable case, but it can be extended to handle multiple variables.我们将说明单个变量的情况,但它可以扩展为处理多个变量。
Given the ambiguity of the original question, we assume the Time-
column is structured in groups of 20, which allows us to use it as the main grouping variable for the solution.考虑到原始问题的模糊性,我们假设Time-
列以 20 个为一组构成,这允许我们将其用作解决方案的主要分组变量。 That is, there are 20 observations at Time-1
, another 20 at Time-2
, etc.也就是说,在Time-1
有 20 个观测值,在Time-2
有另外 20 个观测值,以此类推。
If the requirement is to create groups of 20 rows based on consecutive row identifiers, the solution can easily be modified to add a grouping variable to represent sets of 20 rows.如果需要根据连续的行标识符创建 20 行的组,则可以轻松修改解决方案以添加分组变量来表示 20 行的集合。
# simulate some data
y <- rpois(20000,3) # simulate counts
TimeVal <- paste0(rep("Time-",20000),
rep(1:1000,20))
data <-data.frame(TimeVal,y,stringsAsFactors = FALSE)
library(dplyr)
result <- data %>% group_by(TimeVal) %>% summarise(ybar = mean(y),
stDev = sd(y)) %>%
full_join(data,.,) %>% mutate(.,zScore = (y - ybar) / stDev)
head(result)
...and the output: ...和 output:
> head(result)
TimeVal y ybar stDev zScore
1 Time-1 6 3.45 1.276302 1.99795938
2 Time-2 2 2.95 1.700619 -0.55862010
3 Time-3 2 3.20 1.908430 -0.62878909
4 Time-4 3 3.10 1.916686 -0.05217339
5 Time-5 2 3.10 1.447321 -0.76002513
6 Time-6 2 3.30 1.809333 -0.71849700
>
To solve for multiple columns in the original input data frame, first we create a long form tidy data frame with tidyr::pivot_longer)
, calculate means and standard deviations, merge them with the narrow data and calculate z-scores.为了解决原始输入数据框中的多列问题,首先我们使用tidyr::pivot_longer)
创建一个长格式的整洁数据框,计算均值和标准差,将它们与窄数据合并并计算 z 分数。
Converting the input data to a long form tidy data frame allows us to use the original column names in a dplyr::by_group()
, eliminating a lot of code that would be otherwise required to calculate the z-scores for each column in the original data.将输入数据转换为长格式的整洁数据框允许我们在dplyr::by_group()
中使用原始列名,从而消除了计算原始中每一列的 z 分数所需的大量代码数据。
library(tidyr)
set.seed(95014) # set seed to make results reproducible
y2 <- rpois(20000,8)
y3 <- rpois(20000,15)
data <- data.frame(TimeVal,y,y2,y3,stringsAsFactors = FALSE)
# convert to narrow format tidy, calculate means, sds, and zScores
longData <- data %>%
group_by(TimeVal) %>%
pivot_longer(-TimeVal,
names_to = "variable",
values_to = "value")
result <- longData %>%
group_by(TimeVal,variable) %>%
summarise(avg = mean(value), stDev = sd(value)) %>%
full_join(longData,.) %>%
mutate(.,zScore = (value - avg) / stDev)
head(result)
...and the output: ...和 output:
> head(result)
# A tibble: 6 x 6
# Groups: TimeVal [2]
TimeVal variable value avg stDev zScore
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 Time-1 y 6 3.45 1.28 2.00
2 Time-1 y2 13 8.7 2.23 1.93
3 Time-1 y3 20 16.4 5.25 0.686
4 Time-2 y 2 2.95 1.70 -0.559
5 Time-2 y2 6 8.2 2.89 -0.760
6 Time-2 y3 12 14.8 3.34 -0.852
>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.