简体   繁体   English

按年份简化数据框并计算百分比变化

[英]Simplify Dataframe by Year and Calculate Percent Change

I have two questions: What resources do you recommend reading to improve data manipulation capabilities? 我有两个问题:您建议阅读哪些资源以提高数据处理能力? I've been working with larger datasets and have been struggling to adapt--I feel like I'm hitting a brick wall and don't know where to look (many online resources get too complicated without building foundation). 我一直在处理较大的数据集,并且一直在努力适应-我感觉好像撞到了墙,不知道在哪里看(许多在线资源如果没有基础就太复杂了)。

For example, I am trying to solve this issue. 例如,我正在尝试解决此问题。 I have a df with millions of rows and I am trying to simplify it and analyze a trend. 我有一个包含数百万行的df,我正在尝试简化它并分析趋势。 I have a dput example. 我有一个dput示例。 I am trying to isolate each ID and grab the minimum value for a given year. 我试图隔离每个ID并获取给定年份的最小值。 (Some IDs have years not available for others). (某些ID的年份对其他人不可用)。 After simplifying that data, I am trying to add a percent change column. 在简化该数据之后,我尝试添加一个百分比变化列。 Given this is a 20+ year time series, I am ok with ignoring months at this point, as minimum value for a year compared to a minimum to another year should yield a reasonable percent change. 鉴于这是一个20年以上的时间序列,因此我现在可以忽略几个月,因为一年的最小值与另一年的最小值相比应该会产生合理的百分比变化。

Thanks! 谢谢!

Input: 输入:

structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L), .Label = c("a", "b"), class = "factor"), Date = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 5L, 
6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 15L, 16L), .Label = c("2/21/2009", 
"2/22/2009", "2/23/2009", "2/24/2009", "2/25/2009", "2/26/2009", 
"3/2/2011", "3/3/2011", "3/4/2011", "3/5/2010", "3/5/2011", "3/6/2010", 
"3/6/2011", "3/7/2010", "3/7/2011", "3/8/2011"), class = "factor"), 
    Year = c(2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L, 
    2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2009L, 2009L, 
    2010L, 2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 
    2011L), Value = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
    20, 21, 22, 5, 6, 7, 8, 8, 9, 10, 11, 12, 15, 23, 25, 27)), .Names = c("ID", 
"Date", "Year", "Value"), class = "data.frame", row.names = c(NA, 
-26L))

Expected output: 预期产量:

structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", 
"b"), class = "factor"), Date = structure(c(1L, 4L, 5L, 2L, 4L, 
3L), .Label = c("2/21/2009", "2/25/2009", "3/2/2011", "3/5/2010", 
"3/6/2011"), class = "factor"), Year = c(2009L, 2010L, 2011L, 
2009L, 2010L, 2011L), Value = c(10, 16, 5, 6, 8, 10), Percent.Increase = c(NA, 
0.6, -0.6875, NA, 0.333333333, 0.25)), .Names = c("ID", "Date", 
"Year", "Value", "Percent.Increase"), class = "data.frame", row.names = c(NA, 
-6L))

After grouping by 'ID', 'Year', we slice the min "Value" rows within each group, then grouped by 'ID', we create the 'Percent.Increase' by subtracting the 'Value' from lag of 'Value' and dividing by the lag of 'Value'. 在按“ ID”,“年份”分组后,我们将每个组中的min “值”行slice ,然后按“ ID”分组,通过从“值”的lag中减去“值”来创建“ Percent.Increase”。然后除以“价值”的lag

res <-  df1 %>%
         group_by(ID, Year) %>%
         slice(which.min(Value)) %>% 
         group_by(ID) %>%
         mutate(Percent.Increase = (Value-lag(Value))/lag(Value))

Until HAVING clause is implemented in data.table, this seems to be pretty efficient way: 在data.table中实现HAVING子句之前,这似乎是一种非常有效的方法:

dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1
   ][, Percent_Increase := {
       tmp <- shift(Value)
       (Value-tmp)/tmp
   }, .(ID)]

Check timing on 5e7. 检查5e7上的计时。

library(dplyr)
library(data.table)
N = 5e7
set.seed(1)
df = data.frame(ID = sample(2L, N, TRUE), 
                Date = sample(16L, N, TRUE), 
                Year = sample(2009:2011, N, TRUE), 
                Value = sample(N/10, N, TRUE))
dt = as.data.table(df)
system.time(
    res <- df %>%
        group_by(ID, Year) %>%
        slice(which.min(Value)) %>% 
        group_by(ID) %>%
        mutate(Percent_Increase = (Value-lag(Value))/lag(Value))    
)
#   user  system elapsed 
#  1.676   2.176   3.847
system.time(
    r <- dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1,
            ][, Percent_Increase := {
                tmp <- shift(Value)
                (Value-tmp)/tmp
            }, .(ID)]
)
#   user  system elapsed 
#  0.940   0.460   1.334
all.equal(r, as.data.table(res), ignore.col.order = TRUE, check.attributes = FALSE, ignore.row.order = TRUE)
#[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM