[英]Compute data on one column based on aggregate results from another column
I would like to use data.table to calculate a summary statistic, and then based on that result, calculate a statistic on a second column. 我想使用data.table计算摘要统计信息,然后根据该结果在第二列上计算统计信息。
Here is an example using the Air Quality data. 这是使用空气质量数据的示例。
(pretend it came this way) (假装是这样)
library(data.table)
dt = as.data.table(airquality)
dt[ , Season:=ifelse(Month>7, 'Fall', 'Summer')]
Some months have high wind 有几个月风很大
## The range of monthly Wind values
dt[ , list(MinWind=min(Wind), MaxWind=max(Wind)),
by=c('Season', 'Month')]
---- R OUTPUT:
Season Month MinWind MaxWind
1: Summer 5 5.7 20.1
2: Summer 6 1.7 20.7
3: Summer 7 4.1 14.9
4: Fall 8 2.3 15.5
5: Fall 9 2.8 16.6
>
Can I do this in one step? 我可以一步一步完成吗?
## Add a column to indicate if it was a high wind month
dt[, HighWind:=any(Wind>20), by=Month]
## Aggregate based on both HighWind and Season
dt[, list(AveSolarR=mean(Solar.R, na.rm=TRUE)), by=c("HighWind","Season")]
---- R OUTPUT:
HighWind season AveSolarR
1: TRUE Summer 185.9649
2: FALSE Summer 216.4839
3: FALSE Fall 169.5690
Why not combine both into one list
? 为什么不将两者合并为一个list
?
dt[,list(HighWind=any(Wind>20),AveSolarR=mean(Solar.R,na.rm=T)),by=Month]
Month HighWind AveSolarR
1: 5 TRUE 181.2963
2: 6 TRUE 190.1667
3: 7 FALSE 216.4839
4: 8 FALSE 171.8571
5: 9 FALSE 167.4333
For the modified problem, you need to do the HighWind
calculation in the by
statement, but I think it makes it more convoluted. 对于修改后的问题,您需要在by
语句中进行HighWind
计算,但我认为这会使问题更加复杂。
dt[,list(AveSolarR=mean(Solar.R,na.rm=T)),
by=list(HighWind=Month%in%Month[Wind>20],Season)]
HighWind Season AveSolarR
1: TRUE Summer 185.9649
2: FALSE Summer 216.4839
3: FALSE Fall 169.5690
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.