[英]Using for loop in R to combine column values based on preceeding column conditions
I am working with a big dataset with multiple observations for a certain gene, on varying dates and with varying expression levels.我正在处理一个大型数据集,该数据集在不同日期和不同表达水平上对某个基因进行多次观察。 Data used
使用的数据
I would like to sum all the 'expression' column values if :如果出现以下情况,我想对所有“表达式”列值求和:
AND和
The output should be something like this (each gene should have 1 observation per date, ie the sum of all the expression levels of that gene on that date): The_desired_output输出应该是这样的(每个基因每个日期应该有 1 个观察值,即该基因在该日期的所有表达水平的总和): The_desired_output
I have tried making a for loop, but I am relatively new to R and having troubles with creating a dataframe out of the loop.我尝试过创建一个 for 循环,但我对 R 比较陌生,并且在创建循环外的数据框时遇到了麻烦。 An alternative solution might be better.
替代解决方案可能会更好。
Thanks a lot!非常感谢!
How big is "big"? “大”有多大? If you really have a large dataset, you are much better off with
data.table
.如果你真的有一个大数据集,你最好使用
data.table
。
Here is an example with 10MM rows.这是一个 10MM 行的示例。
# made up example: YOU should provide this
#
set.seed(1) # for reproducible example
df <- data.frame(gene=sample(1:1e6, 1e7, replace=TRUE),
expression=rpois(1e7, 5),
date=sample(43000:44000, 1e7, replace=TRUE))
##
#
library(tictoc) # for timing functions
library(dplyr)
library(data.table)
##
#
tic()
result.1 <- df %>% group_by(gene, date) %>% summarise(expression = sum(expression))
toc()
## 40.83 sec elapsed
##
#
tic()
result.2 <- setDT(df)[, .(expression=sum(expression)), keyby=.(gene, date)]
toc()
## 3.03 sec elapsed
So data.table
is 13 times faster in this example.所以在这个例子中
data.table
快了 13 倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.