[英]Diagonals to rows in data.frame
I have a matrix-like data frame with an additional column denoting time. 我有一个类似矩阵的数据框,附加一列表示时间。 It contains information on the number of enrolled students in a given school, from grade 5 (column
A
) to grade 9 (column E
). 它包含有关特定学校注册学生人数的信息,从5年级(
A
栏)到9年级( E
栏)。
time A B C D E
1 13 1842 1844 1689 1776 1716
2 14 1898 1785 1807 1617 1679
3 15 2065 1865 1748 1731 1590
4 16 2215 1994 1811 1708 1703
5 17 2174 2122 1903 1765 1699
I need to trace the size of the cohort over time, meaning that I need row-wise information on how many fifth graders from each starting year remained in the school from grades 6 through 9. For example, for the cohort that has begun fifth grade in 2013, I want information on how many remained in sixth grade in 2014, and so on. 我需要随着时间的推移追踪队列的大小,这意味着我需要有关从6到9年级开始的每个起始年份的五年级学生的行数信息。例如,对于已经开始五年级的队列在2013年,我想了解2014年有多少人留在六年级,等等。
Expected output 预期产出
This is what I would like to end up with: 这就是我想要的结果:
start.time point.A point.B point.C point.D point.E
1 13 1842 1785 1748 1708 1699
2 14 1898 1865 1811 1765 NA
3 15 2065 1811 1765 NA NA
4 16 2215 1765 NA NA NA
5 17 2174 NA NA NA NA
I have looked at diag()
from base.R
, but I could only get the the data from the main diagonal. 我从
base.R
看过diag()
,但我只能从主对角线获取数据。 Ideally, I'd like to accomplish this using dplyr
syntax and the pipe. 理想情况下,我想使用
dplyr
语法和管道完成此操作。
Data 数据
structure(list(time = 13:17, A = c(1842, 1898, 2065, 2215, 2174), B = c(1844, 1785, 1865, 1994, 2122), C = c(1689, 1807, 1748, 1811, 1903), D = c(1776, 1617, 1731, 1708, 1765), E = c(1716, 1679, 1590, 1703, 1699)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), vars = "time", drop = TRUE, indices = list(
0L, 1L, 2L, 3L, 4L), group_sizes = c(1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
time = 13:17), class = "data.frame", row.names = c(NA, -5L), vars = "time", drop = TRUE, .Names = "time"), .Names = c("time", "A", "B", "C", "D", "E"))
Convert the input DF
except for the first column to a matrix mat
. 将输入
DF
除第一列外转换为矩阵mat
。 Then since row(mat) - col(mat)
is constant on diagonals split
with respect to that creating a list of ts
class series in L
. 则由于
row(mat) - col(mat)
是对角线恒定split
相对于该创建的列表ts
类系列L
。 We used ts
class since we can later cbind
them even if they are of different lengths. 我们使用
ts
类,因为我们可以稍后cbind
它们cbind
,即使它们的长度不同。 The diagonals for which row(mat) - col(mat) >= 0
are the only ones we want so pick off those, cbind
them together and transpose the result. row(mat) - col(mat) >= 0
的对角线是我们想要的唯一对象,所以选择它们, cbind
它们组合在一起并转置结果。 Then replace all columns in DF
except the first with that. 然后替换
DF
所有列,除了第一列。 No packages are used. 没有使用包裹。
mat <- as.matrix(DF[-1])
L <- lapply(split(mat, row(mat) - col(mat)), ts)
replace(DF, -1, t(do.call("cbind", L[as.numeric(names(L)) >= 0])))
giving: 赠送:
time A B C D E
1 13 1842 1785 1748 1708 1699
2 14 1898 1865 1811 1765 NA
3 15 2065 1994 1903 NA NA
4 16 2215 2122 NA NA NA
5 17 2174 NA NA NA NA
Since you mentioned dplyr
in your question, you could use dplyr::lead
to shift the values of columns B
to E
by 1, 2 etc. respectively, and then bind the result with columns time
and A
from your original data as follows 既然你在你的问题中提到了
dplyr
,你可以使用dplyr::lead
将B
列的值分别移动到E
,等等,然后将结果与原始数据中的列time
和A
绑定,如下所示
library(tidyverse)
bind_cols(df[, 1:2], map2_df(.x = df[, c(3:ncol(df))],
.y = seq_along(df[, 3:ncol(df)]),
.f = ~dplyr::lead(x = .x, n = .y)))
# A tibble: 5 x 6
# Groups: time [5]
# time A B C D E
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 13 1842 1785 1748 1708 1699
#2 14 1898 1865 1811 1765 NA
#3 15 2065 1994 1903 NA NA
#4 16 2215 2122 NA NA NA
#5 17 2174 NA NA NA NA
Note that your data is grouped by time
the way you provided it. 请注意,您的数据按照您提供的方式按
time
分组。
With some grouping and arranging and row_number()
, we can do this with dplyr
and tidyr
, and we don't lose values. 通过一些分组和排列以及
row_number()
,我们可以使用dplyr
和tidyr
执行此tidyr
,并且我们不会丢失值。
Looks a bit messy, but here I create a 2-dimensional index where the second dimension is inverted. 看起来有点乱,但在这里我创建了一个二维索引,其中第二个维度被反转。 When these index positions are summed, we get a matching value for diagonal rows.
当这些索引位置相加时,我们得到对角行的匹配值。
data %>%
ungroup() %>%
mutate(row = row_number()) %>%
gather(class, stud, A:E) %>%
arrange(row, desc(class)) %>%
group_by(row) %>%
mutate(time_left = row_number()) %>%
ungroup() %>%
transmute(time, class, stud, start_year = time_left + row - 1) %>%
ggplot(aes(time, stud, color = factor(start_year))) +
geom_line() +
geom_point()
Replace the mirrored upper triangle of "d" with the values from the lower triangle. 将镜像的上三角形“d”替换为下三角形的值。
m <- as.matrix(d[-1])
d[-1] <- NA
d[-1][upper.tri(m, diag = TRUE)[ , ncol(m):1]] <- m[lower.tri(m, diag = TRUE)]
# time A B C D E
# 1 13 1842 1785 1748 1708 1699
# 2 14 1898 1865 1811 1765 NA
# 3 15 2065 1994 1903 NA NA
# 4 16 2215 2122 NA NA NA
# 5 17 2174 NA NA NA NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.