简体   繁体   English

对数到data.frame中的行

[英]Diagonals to rows in data.frame

I have a matrix-like data frame with an additional column denoting time. 我有一个类似矩阵的数据框,附加一列表示时间。 It contains information on the number of enrolled students in a given school, from grade 5 (column A ) to grade 9 (column E ). 它包含有关特定学校注册学生人数的信息,从5年级( A栏)到9年级( E栏)。

  time    A    B    C    D    E
1   13 1842 1844 1689 1776 1716
2   14 1898 1785 1807 1617 1679
3   15 2065 1865 1748 1731 1590
4   16 2215 1994 1811 1708 1703
5   17 2174 2122 1903 1765 1699

I need to trace the size of the cohort over time, meaning that I need row-wise information on how many fifth graders from each starting year remained in the school from grades 6 through 9. For example, for the cohort that has begun fifth grade in 2013, I want information on how many remained in sixth grade in 2014, and so on. 我需要随着时间的推移追踪队列的大小,这意味着我需要有关从6到9年级开始的每个起始年份的五年级学生的行数信息。例如,对于已经开始五年级的队列在2013年,我想了解2014年有多少人留在六年级,等等。

Expected output 预期产出

This is what I would like to end up with: 这就是我想要的结果:

  start.time point.A point.B point.C point.D point.E
1         13    1842    1785    1748    1708    1699
2         14    1898    1865    1811    1765      NA
3         15    2065    1811    1765      NA      NA
4         16    2215    1765      NA      NA      NA
5         17    2174      NA      NA      NA      NA

I have looked at diag() from base.R , but I could only get the the data from the main diagonal. 我从base.R看过diag() ,但我只能从主对角线获取数据。 Ideally, I'd like to accomplish this using dplyr syntax and the pipe. 理想情况下,我想使用dplyr语法和管道完成此操作。

Data 数据

structure(list(time = 13:17, A = c(1842, 1898, 2065, 2215, 2174), B = c(1844, 1785, 1865, 1994, 2122), C = c(1689, 1807, 1748, 1811, 1903), D = c(1776, 1617, 1731, 1708, 1765), E = c(1716, 1679, 1590, 1703, 1699)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), vars = "time", drop = TRUE, indices = list(
0L, 1L, 2L, 3L, 4L), group_sizes = c(1L, 1L, 1L, 1L, 1L), biggest_group_size = 1L, labels = structure(list(
time = 13:17), class = "data.frame", row.names = c(NA, -5L), vars = "time", drop = TRUE, .Names = "time"), .Names = c("time", "A", "B", "C", "D", "E"))

Convert the input DF except for the first column to a matrix mat . 将输入DF除第一列外转换为矩阵mat Then since row(mat) - col(mat) is constant on diagonals split with respect to that creating a list of ts class series in L . 则由于row(mat) - col(mat)是对角线恒定split相对于该创建的列表ts类系列L We used ts class since we can later cbind them even if they are of different lengths. 我们使用ts类,因为我们可以稍后cbind它们cbind ,即使它们的长度不同。 The diagonals for which row(mat) - col(mat) >= 0 are the only ones we want so pick off those, cbind them together and transpose the result. row(mat) - col(mat) >= 0的对角线是我们想要的唯一对象,所以选择它们, cbind它们组合在一起并转置结果。 Then replace all columns in DF except the first with that. 然后替换DF所有列,除了第一列。 No packages are used. 没有使用包裹。

mat <- as.matrix(DF[-1])
L <- lapply(split(mat, row(mat) - col(mat)), ts)
replace(DF, -1, t(do.call("cbind", L[as.numeric(names(L)) >= 0])))

giving: 赠送:

  time    A    B    C    D    E
1   13 1842 1785 1748 1708 1699
2   14 1898 1865 1811 1765   NA
3   15 2065 1994 1903   NA   NA
4   16 2215 2122   NA   NA   NA
5   17 2174   NA   NA   NA   NA

Since you mentioned dplyr in your question, you could use dplyr::lead to shift the values of columns B to E by 1, 2 etc. respectively, and then bind the result with columns time and A from your original data as follows 既然你在你的问题中提到了dplyr ,你可以使用dplyr::leadB列的值分别移动到E ,等等,然后将结果与原始数据中的列timeA绑定,如下所示

library(tidyverse)
bind_cols(df[, 1:2], map2_df(.x = df[, c(3:ncol(df))],
                             .y = seq_along(df[, 3:ncol(df)]), 
                             .f = ~dplyr::lead(x = .x, n = .y)))
#  A tibble: 5 x 6
#  Groups:   time [5]
#   time     A     B     C     D     E
#  <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1    13  1842  1785  1748  1708  1699
#2    14  1898  1865  1811  1765    NA
#3    15  2065  1994  1903    NA    NA
#4    16  2215  2122    NA    NA    NA
#5    17  2174    NA    NA    NA    NA

Note that your data is grouped by time the way you provided it. 请注意,您的数据按照您提供的方式按time分组。

With some grouping and arranging and row_number() , we can do this with dplyr and tidyr , and we don't lose values. 通过一些分组和排列以及row_number() ,我们可以使用dplyrtidyr执行此tidyr ,并且我们不会丢失值。

Looks a bit messy, but here I create a 2-dimensional index where the second dimension is inverted. 看起来有点乱,但在这里我创建了一个二维索引,其中第二个维度被反转。 When these index positions are summed, we get a matching value for diagonal rows. 当这些索引位置相加时,我们得到对角行的匹配值。

data %>% 
  ungroup() %>% 
  mutate(row = row_number()) %>% 
  gather(class, stud, A:E) %>% 
  arrange(row, desc(class)) %>% 
  group_by(row) %>% 
  mutate(time_left = row_number()) %>% 
  ungroup() %>% 
  transmute(time, class, stud, start_year = time_left + row - 1) %>% 
  ggplot(aes(time, stud, color = factor(start_year))) +
  geom_line() +
  geom_point()

在此输入图像描述

Replace the mirrored upper triangle of "d" with the values from the lower triangle. 将镜像的上三角形“d”替换为下三角形的值。

m <- as.matrix(d[-1])
d[-1] <- NA
d[-1][upper.tri(m, diag = TRUE)[ , ncol(m):1]] <- m[lower.tri(m, diag = TRUE)]

#   time    A    B    C    D    E
# 1   13 1842 1785 1748 1708 1699
# 2   14 1898 1865 1811 1765   NA
# 3   15 2065 1994 1903   NA   NA
# 4   16 2215 2122   NA   NA   NA
# 5   17 2174   NA   NA   NA   NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM