简体   繁体   English

R从具有多列信息的数据框中计算汇总数据框

[英]R calculate summary dataframe from dataframe with multiple columns of information

I have a dataframe with multiple columns of information for example: 我有一个包含多列信息的数据框,例如:

df <- data.frame(chr=c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr2", "chr2"), Gene=c("Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Happy", "Sad", "Sad"), site = c(100, 120, 130, 300, 2000, 2300, 2342, 2451, 120, 123), value=c(20, 25, 21, 30, -80, 31, -79, -90, 10, 13))

> df
    chr  Gene site value
1  chr1 Happy  100    20
2  chr1 Happy  120    25
3  chr1 Happy  130    21
4  chr1 Happy  300    30
5  chr1 Happy 2000   -80
6  chr1 Happy 2300    31
7  chr1 Happy 2342   -79
8  chr1 Happy 2451   -90
9  chr2   Sad  120    10
10 chr2   Sad  123    13

I would like to create a summary dataframe that calculates for each Gene how many clustered regions there are. 我想创建一个摘要数据框,为每个基因计算有多少个聚簇区域。 I consider a cluster any number of rows where the difference in the site number is no greater than 1,000 (my data is sorted by chr and sites). 我认为集群中站点数量之差不大于1,000的任何行数(我的数据均按chr和站点排序)。 To start I created a new column to calculate the distance between sites in successive rows using: 首先,我创建了一个新列来使用以下方法计算连续行中站点之间的距离:

df$Distance <- c(1001, diff(df$site, lag=1, differences=1))

> df
    chr  Gene site value Distance
1  chr1 Happy  100    20     1001
2  chr1 Happy  120    25       20
3  chr1 Happy  130    21       10
4  chr1 Happy  300    30      170
5  chr1 Happy 2000   -80     1700
6  chr1 Happy 2300    31      300
7  chr1 Happy 2342   -79       42
8  chr1 Happy 2451   -90      109
9  chr2   Sad  120    10    -2331
10 chr2   Sad  123    13        3

I would like to create a summary table with a row for each gene that summarizes how many clusters are found within each gene where the average value is either positive or negative. 我想为每个基因创建一个汇总表,并在其中汇总平均值为正数或负数的每个基因中发现的簇数。 In the above example the table would look like: 在上面的示例中,表格如下所示:

   Gene PositiveClusters NegativeClusters
1 Happy                1                1
2   Sad                1                0

Here's a data.table solution - but I have a feeling there's a more efficient way... 这是一个data.table解决方案-但我觉得有一种更有效的方法...

library(data.table)
setDT(df)[,cluster:=c(0,cumsum(diff(site)>1000)),by=Gene]
df[,mean:=mean(value),by=list(Gene,cluster)]
df[,list(pos=length(unique(cluster[mean>=0])),
         neg=length(unique(cluster[mean<0]))),by=Gene]
#     Gene pos neg
# 1: Happy   1   1
# 2:   Sad   1   0

So this converts df to a data.table and adds a column cluster based on the cumsum(diff(site)>1000) , grouped by Gene . 因此,这会将df转换为data.table,并基于cumsum(diff(site)>1000)添加一个列cluster ,并按Gene分组。 This is a very typical pattern for generating grouping variables. 这是生成分组变量的非常典型的模式。

Then we add a column mean which is mean(value) grouped by both Gene and cluster . 然后,我们添加列mean ,该mean(value)Genecluster分组的mean(value)

Then we create a new data.table that has the counts of each cluster type for mean either positive (>= 0) or negative (< 0), grouped by Gene . 然后,我们创建一个新的data.table,该表具有每种聚类类型的计数,其平均值为正(> = 0)或负(<0),并按Gene分组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将来自多个样本的回归摘要输出组合到 R 中的单个数据帧中 - Combining regression summary outputs from multiple samples into a single dataframe in R 基于分组 dataframe 使用 ZE28396D3D40DZAF17 中的 dplyr 创建具有多个汇总列的 dataframe 的有效方法 - Efficient way to create a dataframe with multiple summary columns based on a grouped dataframe using dplyr in R 计算多列数据框的平均欧式距离 - Calculate mean euclidean distance of multiple columns dataframe r R将汇总结果(具有所有数据框列的统计信息)转换为数据框 - R convert summary result (statistics with all dataframe columns) into dataframe R 将 dataframe 中选定列的摘要写入新的 dataframe - R writing summary of selected columns in a dataframe to a new dataframe 如何使用日期/时间序列计算 dataframe 中多列的汇总统计信息? - How to calculate summary stats over multiple columns in a dataframe with date/time series? R按组分组的数据汇总统计信息 - R summary statistics from dataframe by group 编写函数来计算R中数据帧中列的平均值 - writing a function to calculate the mean of columns in a dataframe in R 使用循环从 R 中的 dataframe 中的另一列创建多个列 - Use loop for create multiple columns from another columns in dataframe in R 来自多个多元回归输出的汇总数据框 - summary dataframe from several multiple regression outputs
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM