[英]R calculate summary dataframe from dataframe with multiple columns of information
I have a dataframe with multiple columns of information for example: 我有一个包含多列信息的数据框,例如:
df <- data.frame(chr=c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr2", "chr2"), Gene=c("Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Sad", "Sad"), site = c(100, 120, 130, 300, 2000, 2300, 2342, 2451, 120, 123), value=c(20, 25, 21, 30, -80, 31, -79, -90, 10, 13))
> df
chr Gene site value
1 chr1 Happy 100 20
2 chr1 Happy 120 25
3 chr1 Happy 130 21
4 chr1 Happy 300 30
5 chr1 Happy 2000 -80
6 chr1 Happy 2300 31
7 chr1 Happy 2342 -79
8 chr1 Happy 2451 -90
9 chr2 Sad 120 10
10 chr2 Sad 123 13
I would like to create a summary dataframe that calculates for each Gene how many clustered regions there are. 我想创建一个摘要数据框,为每个基因计算有多少个聚簇区域。 I consider a cluster any number of rows where the difference in the site number is no greater than 1,000 (my data is sorted by chr and sites). 我认为集群中站点数量之差不大于1,000的任何行数(我的数据均按chr和站点排序)。 To start I created a new column to calculate the distance between sites in successive rows using: 首先,我创建了一个新列来使用以下方法计算连续行中站点之间的距离:
df$Distance <- c(1001, diff(df$site, lag=1, differences=1))
> df
chr Gene site value Distance
1 chr1 Happy 100 20 1001
2 chr1 Happy 120 25 20
3 chr1 Happy 130 21 10
4 chr1 Happy 300 30 170
5 chr1 Happy 2000 -80 1700
6 chr1 Happy 2300 31 300
7 chr1 Happy 2342 -79 42
8 chr1 Happy 2451 -90 109
9 chr2 Sad 120 10 -2331
10 chr2 Sad 123 13 3
I would like to create a summary table with a row for each gene that summarizes how many clusters are found within each gene where the average value is either positive or negative. 我想为每个基因创建一个汇总表,并在其中汇总平均值为正数或负数的每个基因中发现的簇数。 In the above example the table would look like: 在上面的示例中,表格如下所示:
Gene PositiveClusters NegativeClusters
1 Happy 1 1
2 Sad 1 0
Here's a data.table solution - but I have a feeling there's a more efficient way... 这是一个data.table解决方案-但我觉得有一种更有效的方法...
library(data.table)
setDT(df)[,cluster:=c(0,cumsum(diff(site)>1000)),by=Gene]
df[,mean:=mean(value),by=list(Gene,cluster)]
df[,list(pos=length(unique(cluster[mean>=0])),
neg=length(unique(cluster[mean<0]))),by=Gene]
# Gene pos neg
# 1: Happy 1 1
# 2: Sad 1 0
So this converts df
to a data.table and adds a column cluster
based on the cumsum(diff(site)>1000)
, grouped by Gene
. 因此,这会将df
转换为data.table,并基于cumsum(diff(site)>1000)
添加一个列cluster
,并按Gene
分组。 This is a very typical pattern for generating grouping variables. 这是生成分组变量的非常典型的模式。
Then we add a column mean
which is mean(value)
grouped by both Gene
and cluster
. 然后,我们添加列mean
,该mean(value)
按Gene
和cluster
分组的mean(value)
。
Then we create a new data.table that has the counts of each cluster type for mean either positive (>= 0) or negative (< 0), grouped by Gene
. 然后,我们创建一个新的data.table,该表具有每种聚类类型的计数,其平均值为正(> = 0)或负(<0),并按Gene
分组。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.