按年份简化数据框并计算百分比变化

Question

I have two questions: What resources do you recommend reading to improve data manipulation capabilities? 我有两个问题：您建议阅读哪些资源以提高数据处理能力？ I've been working with larger datasets and have been struggling to adapt--I feel like I'm hitting a brick wall and don't know where to look (many online resources get too complicated without building foundation). 我一直在处理较大的数据集，并且一直在努力适应-我感觉好像撞到了墙，不知道在哪里看（许多在线资源如果没有基础就太复杂了）。

For example, I am trying to solve this issue. 例如，我正在尝试解决此问题。 I have a df with millions of rows and I am trying to simplify it and analyze a trend. 我有一个包含数百万行的df，我正在尝试简化它并分析趋势。 I have a dput example. 我有一个dput示例。 I am trying to isolate each ID and grab the minimum value for a given year. 我试图隔离每个ID并获取给定年份的最小值。 (Some IDs have years not available for others). （某些ID的年份对其他人不可用）。 After simplifying that data, I am trying to add a percent change column. 在简化该数据之后，我尝试添加一个百分比变化列。 Given this is a 20+ year time series, I am ok with ignoring months at this point, as minimum value for a year compared to a minimum to another year should yield a reasonable percent change. 鉴于这是一个20年以上的时间序列，因此我现在可以忽略几个月，因为一年的最小值与另一年的最小值相比应该会产生合理的百分比变化。

Thanks! 谢谢！

Input: 输入：

structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L), .Label = c("a", "b"), class = "factor"), Date = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 5L, 
6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 15L, 16L), .Label = c("2/21/2009", 
"2/22/2009", "2/23/2009", "2/24/2009", "2/25/2009", "2/26/2009", 
"3/2/2011", "3/3/2011", "3/4/2011", "3/5/2010", "3/5/2011", "3/6/2010", 
"3/6/2011", "3/7/2010", "3/7/2011", "3/8/2011"), class = "factor"), 
    Year = c(2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L, 
    2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2009L, 2009L, 
    2010L, 2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 
    2011L), Value = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
    20, 21, 22, 5, 6, 7, 8, 8, 9, 10, 11, 12, 15, 23, 25, 27)), .Names = c("ID", 
"Date", "Year", "Value"), class = "data.frame", row.names = c(NA, 
-26L))

Expected output: 预期产量：

structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", 
"b"), class = "factor"), Date = structure(c(1L, 4L, 5L, 2L, 4L, 
3L), .Label = c("2/21/2009", "2/25/2009", "3/2/2011", "3/5/2010", 
"3/6/2011"), class = "factor"), Year = c(2009L, 2010L, 2011L, 
2009L, 2010L, 2011L), Value = c(10, 16, 5, 6, 8, 10), Percent.Increase = c(NA, 
0.6, -0.6875, NA, 0.333333333, 0.25)), .Names = c("ID", "Date", 
"Year", "Value", "Percent.Increase"), class = "data.frame", row.names = c(NA, 
-6L))

Answer 1

After grouping by 'ID', 'Year', we slice the min "Value" rows within each group, then grouped by 'ID', we create the 'Percent.Increase' by subtracting the 'Value' from lag of 'Value' and dividing by the lag of 'Value'. 在按“ ID”，“年份”分组后，我们将每个组中的min “值”行slice ，然后按“ ID”分组，通过从“值”的lag中减去“值”来创建“ Percent.Increase”。然后除以“价值”的lag 。

res <-  df1 %>%
         group_by(ID, Year) %>%
         slice(which.min(Value)) %>% 
         group_by(ID) %>%
         mutate(Percent.Increase = (Value-lag(Value))/lag(Value))

Answer 2

Until HAVING clause is implemented in data.table, this seems to be pretty efficient way: 在data.table中实现HAVING子句之前，这似乎是一种非常有效的方法：

dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1
   ][, Percent_Increase := {
       tmp <- shift(Value)
       (Value-tmp)/tmp
   }, .(ID)]

Check timing on 5e7. 检查5e7上的计时。

library(dplyr)
library(data.table)
N = 5e7
set.seed(1)
df = data.frame(ID = sample(2L, N, TRUE), 
                Date = sample(16L, N, TRUE), 
                Year = sample(2009:2011, N, TRUE), 
                Value = sample(N/10, N, TRUE))
dt = as.data.table(df)
system.time(
    res <- df %>%
        group_by(ID, Year) %>%
        slice(which.min(Value)) %>% 
        group_by(ID) %>%
        mutate(Percent_Increase = (Value-lag(Value))/lag(Value))    
)
#   user  system elapsed 
#  1.676   2.176   3.847
system.time(
    r <- dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1,
            ][, Percent_Increase := {
                tmp <- shift(Value)
                (Value-tmp)/tmp
            }, .(ID)]
)
#   user  system elapsed 
#  0.940   0.460   1.334
all.equal(r, as.data.table(res), ignore.col.order = TRUE, check.attributes = FALSE, ignore.row.order = TRUE)
#[1] TRUE

按年份简化数据框并计算百分比变化

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-07-09 18:54:11

解决方案2
2 2016-07-09 20:15:07

按年份简化数据框并计算百分比变化

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-07-09 18:54:11

解决方案2 2 2016-07-09 20:15:07

解决方案1
3 已采纳 2016-07-09 18:54:11

解决方案2
2 2016-07-09 20:15:07