简体   繁体   English

累计和数据组织

[英]Cumulative sum and data organisation

I have about 40000 values for rainfall data from different samples which will constantly be updated. 我有大约40000个来自不同样本的降雨数据值,这些值将不断更新。 The csv file is organised like this: csv文件的组织方式如下:

NAME;       YEAR;   ID;     VALUE
Sample1;    1998;   354;    45
Sample1;    1999;   354;    23
Sample1;    2000;   354;    66
Sample1;    2001;   354;    98
Sample1;    2002;   354;    36
Sample1;    2003;   354;    59
Sample1;    2004;   354;    64
Sample1;    2005;   354;    23
Sample1;    2006;   354;    69
Sample1;    2007;   354;    94
Sample1;    2008;   354;    24
Sample2;    1964;   1342;    7
Sample2;    1965;   1342;   24
Sample3;    2002;   859;    90
Sample3;    2003;   859;    93
Sample3;    2004;   859;    53
Sample3;    2005;   859;    98 

What I'd like to do with an R script is the following: Create a new row where for a group of samples (eg for all Sample1 and then start over at the value for all Sample2 and then start over at the value for all Sample3 and so on) are summed up based on the previous value (cumulative sum of rainfall data), for example for sample 1 results in a row like in this example CUM_RAINFALL (for first example something like this: 45 for CUM_RAINFALL 1 and then 45+23, and then 68+66, and then 134+232 and so on until the end of Sample1, the value of Sample2 should be taken over and the procedure should start all over again) 我想用R脚本执行以下操作:创建一个新行,在该行中为一组样本(例如,对于所有Sample1,然后从所有Sample2的值开始,然后从所有Sample3的值开始依此类推)基于之前的值(降雨数据的累积总和)进行求和,例如,对于样本1,结果在此示例中为CUM_RAINFALL(对于第一个示例为:CUM_RAINFALL 1为45,然后为45+ 23,然后是68 + 66,然后是134 + 232,依此类推,直到Sample1结束,应该接管Sample2的值,并且过程应重新开始)

NAME;       YEAR;   ID;     VALUE    CUM_RAINFALL
Sample1;    1998;   354;    45;       45
Sample1;    1999;   354;    23;       68
Sample1;    2000;   354;    66;      134
Sample1;    2001;   354;    98;      232
Sample1;    2002;   354;    36;      268
Sample1;    2003;   354;    59;      327
Sample1;    2004;   354;    64;      391
Sample1;    2005;   354;    23;      414
Sample1;    2006;   354;    69;      483
Sample1;    2007;   354;    94;      577
Sample1;    2008;   354;    24;      601
Sample2;    1964;   1342;    7;      7
Sample2;    1965;   1342;   24;      31
Sample3;    2002;   859;    90;      90
Sample3;    2003;   859;    93;      183
Sample3;    2004;   859;    53;      236
Sample3;    2005;   859;    98;      334

From this I would like to write a new file containing all rows which have more than 3 values (in the given example Sample2 wouldn't be written into the file, because it contains only 2 values) 由此,我想编写一个新文件,其中包含所有具有超过3个值的行(在给定的示例中,Sample2不会写入该文件,因为它仅包含2个值)

Is there an easy way to do this in R? 在R中有简单的方法可以做到这一点吗? Any help is appreciated! 任何帮助表示赞赏! Under the following link you'll find a csv with the data: https://dl.dropboxusercontent.com/u/16277659/sample.cs 在以下链接下,您将找到带有数据的csv: https : //dl.dropboxusercontent.com/u/16277659/sample.cs

Here's a solution using data.table package assuming your data is stored in dat : 这是一个使用data.table包的解决方案,假设您的数据存储在dat

require(data.table)
ans = setDT(dat)[, crain := cumsum(VALUE[.N > 3L]), by=NAME][!is.na(crain)]
  • setDT converts data.frame to data.table setDT将data.frame转换为data.table
  • Then, we group by NAME and calculate, for each unique group, the cumulative sum of VALUE for that group only if the number of observations for that group (= .N , inbuilt special variable) is > 3L. 然后,我们通过组NAME和计算,对于每个唯一的组,的累积和VALUE为该组当该组观测值的数目(= .N ,内置特殊变量)是> 3L。 And we assign the values to new column crain by reference. 然后,我们通过引用将值分配给新列crain
  • Since we did not compute cumsum for groups with <= 3L observations, they will have NA values in them. 由于我们没有为观测值<= 3L的组计算cumsum ,因此它们中将具有NA值。 We exploit that to subset the desired result. 我们利用它来子集所需的结果。

Now, you can use write.table(.) on ans , as shown in other answers. 现在,您可以在ans上使用write.table(.) ,如其他答案所示。

Note: This answer assumes that your data set does not contain NA values for VALUE column of course. 注意:此答案假设您的数据集当然不包含VALUE列的NA值。

40k observations should do fine in base R. 40k观测值应该在R底下很好。

d$CUMRAIN <- unlist(by(d$VALUE, d$NAME, cumsum), use.names = FALSE)
d
#       NAME YEAR   ID VALUE CUMRAIN
# 1  Sample1 1998  354    45      45
# 2  Sample1 1999  354    23      68
# 3  Sample1 2000  354    66     134
# 4  Sample1 2001  354    98     232
# 5  Sample1 2002  354    36     268
# 6  Sample1 2003  354    59     327
# 7  Sample1 2004  354    64     391
# 8  Sample1 2005  354    23     414
# 9  Sample1 2006  354    69     483
# 10 Sample1 2007  354    94     577
# 11 Sample1 2008  354    24     601
# 12 Sample2 1964 1342     7       7
# 13 Sample2 1965 1342    24      31
# 14 Sample3 2002  859    90      90
# 15 Sample3 2003  859    93     183
# 16 Sample3 2004  859    53     236
# 17 Sample3 2005  859    98     334

I use by here, but here are some other ways to calculate the cumsum by factor level 我用by在这里,但这里有一些其他的方式来计算cumsum通过因子水平

mapply(cumsum, with(d, split(VALUE, NAME)))
sapply(unname(split(d$VALUE, d$NAME)), cumsum)
unsplit(sapply(split(d$VALUE, d$NAME), cumsum), d$NAME) 

The latter is probably the most favorable since it drops the factor names. 后者可能是最有利的,因为它删除了因子名称。

There's also 还有

library(plyr)
ddply(d, .(NAME), mutate, CUMSUM = cumsum(VALUE))     

To subset for more than three observations, you can use a simple table 要为三个以上的观察结果子集化,可以使用一个简单的table

t <- table(d$NAME)
ss <- d[d$NAME %in% names(t)[t > 3], ]

Then to write it to file with 然后将其写入文件

write.table(ss, "filename", sep = ";")

Here's another approach using dplyr 这是使用dplyr的另一种方法

library(dplyr)

data %>%                                   # your data frame
  group_by(NAME) %>%                       # the grouping variable. could add more variables if necessary
  filter(n() > 3) %>%                      # n()  calculates the number of rows per group and then only those with more than 3 are filtered (selected)
  mutate(CUMRAIN = cumsum(VALUE)) %>%      # add a new column "CUMRAIN"
  write.table(., "test.csv", sep = ";")    # write the subset to a file. The "." indicates that it uses the output of the previous operations piped by %>%   

The operations are "piped" together using the %>% operator. 使用%>%运算符将这些操作“管道”在一起。

Update: as noted in @Arun's answer, it's not necessary to calculate the cumulative rain for those sample with less than 3 observations, so we can use the filter operation first (before mutate) to make a subset with all samples containing more than 3 observations and afterwards compute the cumulative rain. 更新:正如@Arun的回答中所述,没有必要为那些观测值少于3的样本计算累积降雨,因此我们可以首先使用过滤器操作(在变异之前)对所有样本包含超过3观测值的子集进行计算然后计算累积雨量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM