[英]Cumulative sum and data organisation
I have about 40000 values for rainfall data from different samples which will constantly be updated. 我有大约40000个来自不同样本的降雨数据值,这些值将不断更新。 The csv file is organised like this: csv文件的组织方式如下:
NAME; YEAR; ID; VALUE
Sample1; 1998; 354; 45
Sample1; 1999; 354; 23
Sample1; 2000; 354; 66
Sample1; 2001; 354; 98
Sample1; 2002; 354; 36
Sample1; 2003; 354; 59
Sample1; 2004; 354; 64
Sample1; 2005; 354; 23
Sample1; 2006; 354; 69
Sample1; 2007; 354; 94
Sample1; 2008; 354; 24
Sample2; 1964; 1342; 7
Sample2; 1965; 1342; 24
Sample3; 2002; 859; 90
Sample3; 2003; 859; 93
Sample3; 2004; 859; 53
Sample3; 2005; 859; 98
What I'd like to do with an R script is the following: Create a new row where for a group of samples (eg for all Sample1 and then start over at the value for all Sample2 and then start over at the value for all Sample3 and so on) are summed up based on the previous value (cumulative sum of rainfall data), for example for sample 1 results in a row like in this example CUM_RAINFALL (for first example something like this: 45 for CUM_RAINFALL 1 and then 45+23, and then 68+66, and then 134+232 and so on until the end of Sample1, the value of Sample2 should be taken over and the procedure should start all over again) 我想用R脚本执行以下操作:创建一个新行,在该行中为一组样本(例如,对于所有Sample1,然后从所有Sample2的值开始,然后从所有Sample3的值开始依此类推)基于之前的值(降雨数据的累积总和)进行求和,例如,对于样本1,结果在此示例中为CUM_RAINFALL(对于第一个示例为:CUM_RAINFALL 1为45,然后为45+ 23,然后是68 + 66,然后是134 + 232,依此类推,直到Sample1结束,应该接管Sample2的值,并且过程应重新开始)
NAME; YEAR; ID; VALUE CUM_RAINFALL
Sample1; 1998; 354; 45; 45
Sample1; 1999; 354; 23; 68
Sample1; 2000; 354; 66; 134
Sample1; 2001; 354; 98; 232
Sample1; 2002; 354; 36; 268
Sample1; 2003; 354; 59; 327
Sample1; 2004; 354; 64; 391
Sample1; 2005; 354; 23; 414
Sample1; 2006; 354; 69; 483
Sample1; 2007; 354; 94; 577
Sample1; 2008; 354; 24; 601
Sample2; 1964; 1342; 7; 7
Sample2; 1965; 1342; 24; 31
Sample3; 2002; 859; 90; 90
Sample3; 2003; 859; 93; 183
Sample3; 2004; 859; 53; 236
Sample3; 2005; 859; 98; 334
From this I would like to write a new file containing all rows which have more than 3 values (in the given example Sample2 wouldn't be written into the file, because it contains only 2 values) 由此,我想编写一个新文件,其中包含所有具有超过3个值的行(在给定的示例中,Sample2不会写入该文件,因为它仅包含2个值)
Is there an easy way to do this in R? 在R中有简单的方法可以做到这一点吗? Any help is appreciated! 任何帮助表示赞赏! Under the following link you'll find a csv with the data: https://dl.dropboxusercontent.com/u/16277659/sample.cs 在以下链接下,您将找到带有数据的csv: https : //dl.dropboxusercontent.com/u/16277659/sample.cs
Here's a solution using data.table
package assuming your data is stored in dat
: 这是一个使用data.table
包的解决方案,假设您的数据存储在dat
:
require(data.table)
ans = setDT(dat)[, crain := cumsum(VALUE[.N > 3L]), by=NAME][!is.na(crain)]
setDT
converts data.frame to data.table setDT
将data.frame转换为data.table NAME
and calculate, for each unique group, the cumulative sum of VALUE
for that group only if the number of observations for that group (= .N
, inbuilt special variable) is > 3L. 然后,我们通过组NAME
和计算,对于每个唯一的组,的累积和VALUE
为该组仅当该组观测值的数目(= .N
,内置特殊变量)是> 3L。 And we assign the values to new column crain
by reference. 然后,我们通过引用将值分配给新列crain
。 cumsum
for groups with <= 3L observations, they will have NA
values in them. 由于我们没有为观测值<= 3L的组计算cumsum
,因此它们中将具有NA
值。 We exploit that to subset the desired result. 我们利用它来子集所需的结果。 Now, you can use write.table(.)
on ans
, as shown in other answers. 现在,您可以在ans
上使用write.table(.)
,如其他答案所示。
Note: This answer assumes that your data set does not contain NA
values for VALUE
column of course. 注意:此答案假设您的数据集当然不包含VALUE
列的NA
值。
40k observations should do fine in base R. 40k观测值应该在R底下很好。
d$CUMRAIN <- unlist(by(d$VALUE, d$NAME, cumsum), use.names = FALSE)
d
# NAME YEAR ID VALUE CUMRAIN
# 1 Sample1 1998 354 45 45
# 2 Sample1 1999 354 23 68
# 3 Sample1 2000 354 66 134
# 4 Sample1 2001 354 98 232
# 5 Sample1 2002 354 36 268
# 6 Sample1 2003 354 59 327
# 7 Sample1 2004 354 64 391
# 8 Sample1 2005 354 23 414
# 9 Sample1 2006 354 69 483
# 10 Sample1 2007 354 94 577
# 11 Sample1 2008 354 24 601
# 12 Sample2 1964 1342 7 7
# 13 Sample2 1965 1342 24 31
# 14 Sample3 2002 859 90 90
# 15 Sample3 2003 859 93 183
# 16 Sample3 2004 859 53 236
# 17 Sample3 2005 859 98 334
I use by
here, but here are some other ways to calculate the cumsum
by factor level 我用by
在这里,但这里有一些其他的方式来计算cumsum
通过因子水平
mapply(cumsum, with(d, split(VALUE, NAME)))
sapply(unname(split(d$VALUE, d$NAME)), cumsum)
unsplit(sapply(split(d$VALUE, d$NAME), cumsum), d$NAME)
The latter is probably the most favorable since it drops the factor names. 后者可能是最有利的,因为它删除了因子名称。
There's also 还有
library(plyr)
ddply(d, .(NAME), mutate, CUMSUM = cumsum(VALUE))
To subset for more than three observations, you can use a simple table
要为三个以上的观察结果子集化,可以使用一个简单的table
t <- table(d$NAME)
ss <- d[d$NAME %in% names(t)[t > 3], ]
Then to write it to file with 然后将其写入文件
write.table(ss, "filename", sep = ";")
Here's another approach using dplyr
这是使用dplyr
的另一种方法
library(dplyr)
data %>% # your data frame
group_by(NAME) %>% # the grouping variable. could add more variables if necessary
filter(n() > 3) %>% # n() calculates the number of rows per group and then only those with more than 3 are filtered (selected)
mutate(CUMRAIN = cumsum(VALUE)) %>% # add a new column "CUMRAIN"
write.table(., "test.csv", sep = ";") # write the subset to a file. The "." indicates that it uses the output of the previous operations piped by %>%
The operations are "piped" together using the %>%
operator. 使用%>%
运算符将这些操作“管道”在一起。
Update: as noted in @Arun's answer, it's not necessary to calculate the cumulative rain for those sample with less than 3 observations, so we can use the filter operation first (before mutate) to make a subset with all samples containing more than 3 observations and afterwards compute the cumulative rain. 更新:正如@Arun的回答中所述,没有必要为那些观测值少于3的样本计算累积降雨,因此我们可以首先使用过滤器操作(在变异之前)对所有样本包含超过3观测值的子集进行计算然后计算累积雨量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.