R从具有多列信息的数据框中计算汇总数据框

Question

I have a dataframe with multiple columns of information for example: 我有一个包含多列信息的数据框，例如：

df <- data.frame(chr=c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr2", "chr2"), Gene=c("Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Sad", "Sad"), site = c(100, 120, 130, 300, 2000, 2300, 2342, 2451, 120, 123), value=c(20, 25, 21, 30, -80, 31, -79, -90, 10, 13))

> df
    chr  Gene site value
1  chr1 Happy  100    20
2  chr1 Happy  120    25
3  chr1 Happy  130    21
4  chr1 Happy  300    30
5  chr1 Happy 2000   -80
6  chr1 Happy 2300    31
7  chr1 Happy 2342   -79
8  chr1 Happy 2451   -90
9  chr2   Sad  120    10
10 chr2   Sad  123    13

I would like to create a summary dataframe that calculates for each Gene how many clustered regions there are. 我想创建一个摘要数据框，为每个基因计算有多少个聚簇区域。 I consider a cluster any number of rows where the difference in the site number is no greater than 1,000 (my data is sorted by chr and sites). 我认为集群中站点数量之差不大于1,000的任何行数（我的数据均按chr和站点排序）。 To start I created a new column to calculate the distance between sites in successive rows using: 首先，我创建了一个新列来使用以下方法计算连续行中站点之间的距离：

df$Distance <- c(1001, diff(df$site, lag=1, differences=1))

> df
    chr  Gene site value Distance
1  chr1 Happy  100    20     1001
2  chr1 Happy  120    25       20
3  chr1 Happy  130    21       10
4  chr1 Happy  300    30      170
5  chr1 Happy 2000   -80     1700
6  chr1 Happy 2300    31      300
7  chr1 Happy 2342   -79       42
8  chr1 Happy 2451   -90      109
9  chr2   Sad  120    10    -2331
10 chr2   Sad  123    13        3

I would like to create a summary table with a row for each gene that summarizes how many clusters are found within each gene where the average value is either positive or negative. 我想为每个基因创建一个汇总表，并在其中汇总平均值为正数或负数的每个基因中发现的簇数。 In the above example the table would look like: 在上面的示例中，表格如下所示：

   Gene PositiveClusters NegativeClusters
1 Happy                1                1
2   Sad                1                0

Answer 1

Here's a data.table solution - but I have a feeling there's a more efficient way... 这是一个data.table解决方案-但我觉得有一种更有效的方法...

library(data.table)
setDT(df)[,cluster:=c(0,cumsum(diff(site)>1000)),by=Gene]
df[,mean:=mean(value),by=list(Gene,cluster)]
df[,list(pos=length(unique(cluster[mean>=0])),
         neg=length(unique(cluster[mean<0]))),by=Gene]
#     Gene pos neg
# 1: Happy   1   1
# 2:   Sad   1   0

So this converts df to a data.table and adds a column cluster based on the cumsum(diff(site)>1000) , grouped by Gene . 因此，这会将df转换为data.table，并基于cumsum(diff(site)>1000)添加一个列cluster ，并按Gene分组。 This is a very typical pattern for generating grouping variables. 这是生成分组变量的非常典型的模式。

Then we add a column mean which is mean(value) grouped by both Gene and cluster . 然后，我们添加列mean ，该mean(value)按Gene和cluster分组的mean(value) 。

Then we create a new data.table that has the counts of each cluster type for mean either positive (>= 0) or negative (< 0), grouped by Gene . 然后，我们创建一个新的data.table，该表具有每种聚类类型的计数，其平均值为正（> = 0）或负（<0），并按Gene分组。

R从具有多列信息的数据框中计算汇总数据框

问题描述

1 个解决方案

解决方案1
0 2014-11-20 22:04:44

R从具有多列信息的数据框中计算汇总数据框

问题描述

1 个解决方案

解决方案1 0 2014-11-20 22:04:44

解决方案1
0 2014-11-20 22:04:44