在 R 中使用 for 循环根据前面的列条件组合列值

Question

I am working with a big dataset with multiple observations for a certain gene, on varying dates and with varying expression levels.我正在处理一个大型数据集，该数据集在不同日期和不同表达水平上对某个基因进行多次观察。 Data used使用的数据

I would like to sum all the 'expression' column values if :如果出现以下情况，我想对所有“表达式”列值求和：

They belong to the same gene (column 'gene' i = column 'gene' i+1)它们属于同一个基因（列“基因”i = 列“基因”i+1）

AND和

They are measured on the same date (column 'date' i = column 'date' i+1)它们是在同一日期测量的（列“日期”i = 列“日期”i+1）

The output should be something like this (each gene should have 1 observation per date, ie the sum of all the expression levels of that gene on that date): The_desired_output输出应该是这样的（每个基因每个日期应该有 1 个观察值，即该基因在该日期的所有表达水平的总和）： The_desired_output

I have tried making a for loop, but I am relatively new to R and having troubles with creating a dataframe out of the loop.我尝试过创建一个 for 循环，但我对 R 比较陌生，并且在创建循环外的数据框时遇到了麻烦。 An alternative solution might be better.替代解决方案可能会更好。

Thanks a lot!非常感谢！

Answer 1

How big is "big"? “大”有多大？ If you really have a large dataset, you are much better off with data.table .如果你真的有一个大数据集，你最好使用data.table 。

Here is an example with 10MM rows.这是一个 10MM 行的示例。

#   made up example: YOU should provide this
#
set.seed(1)    # for reproducible example
df <- data.frame(gene=sample(1:1e6, 1e7, replace=TRUE), 
                 expression=rpois(1e7, 5), 
                 date=sample(43000:44000, 1e7, replace=TRUE))
##
#
library(tictoc)       # for timing functions
library(dplyr)
library(data.table)
##
#
tic()
result.1 <- df %>% group_by(gene, date) %>% summarise(expression = sum(expression))
toc()
## 40.83 sec elapsed
##
#
tic()
result.2 <- setDT(df)[, .(expression=sum(expression)), keyby=.(gene, date)]
toc()
## 3.03 sec elapsed

So data.table is 13 times faster in this example.所以在这个例子中data.table快了 13 倍。

在 R 中使用 for 循环根据前面的列条件组合列值

问题描述

1 个解决方案

解决方案1
0 2022-05-13 08:02:25

在 R 中使用 for 循环根据前面的列条件组合列值

问题描述

1 个解决方案

解决方案1 0 2022-05-13 08:02:25

解决方案1
0 2022-05-13 08:02:25