简体   繁体   English

如何聚合 dataframe 并通过 r 中的重复行对列的值求和

[英]How do I aggregate a dataframe and sum the values of a column by repeated rows in r

I'm attempting to aggregate a dataframe to remove repeated rows, but I need to sum the value of a count column and use it as the new count for that row value.我正在尝试聚合 dataframe 以删除重复的行,但我需要对计数列的值求和并将其用作该行值的新计数。 I have the following dataframe:我有以下 dataframe:

  count        freq  cdr3nt cdr3aa         v      d       j  VEnd  DStart   DEnd   JStart
   3154    0.036110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2800    0.038394 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2608    0.033014 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    412    0.004717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    366    0.005015 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

I need to get to this:我需要解决这个问题:

   count    freq    cdr3nt    cdr3aa   v       d     j     VEnd  DStart   DEnd   JStart
   8562    0.048822 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    778    0.003332 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

Instead, I'm getting this:相反,我得到了这个:

      count    freq    cdr3nt    cdr3aa   v       d     j     VEnd  DStart   DEnd   JStart
        3    0.601110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
        2    0.506717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
        1    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

Here's piece of the code that's not working right:这是一段无法正常工作的代码:

  agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + 
             DEnd +   JStart, data = final_df, 
                FUN = length), freq <- count/sum(count))


  agg_df1 <-select(agg_df, count, freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) 

Instead of adding the "count" value of the corresponding repeated rows, so I can recalculate the frequency, it's basically counting the number of times the particular row is repeated.而不是添加相应重复行的“计数”值,所以我可以重新计算频率,它基本上是计算特定行重复的次数。 Any thoughts are greatly appreciated.任何想法都非常感谢。 Thanks.谢谢。

The use of FUN = length causes the output value of count to be the number of occurrences for each of the by groups.使用FUN = length导致 output 的count值成为每个 by 组的出现次数。 Instead, use FUN = sum to calculate the sum of the input count column.相反,使用FUN = sum来计算输入count列的总和。

textFile <- "  count        freq  cdr3nt cdr3aa         v      d       j  VEnd  DStart   DEnd   JStart
   3154    0.036110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2800    0.038394 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2608    0.033014 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    412    0.004717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    366    0.005015 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31"

final_df <- read.table(text = textFile,
                   header = TRUE)

# original code had FUN = length, which returned the number of rows per
# combination of by groups 
agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + 
          DEnd +   JStart, data = final_df, FUN = sum), freq <- count/sum(count))
agg_df

...and the output: ...和 output:

> agg_df
   cdr3nt cdr3aa        v     d       j VEnd DStart DEnd JStart count       freq
1 TGCGCCA  CASMG TRBV10-2 TRBD1 TRBJ1-1    9     15   19     20   778 0.08062176
2 TGTGCCA  CASSE  TRBV6-1 TRBD1 TRBJ2-6   13     18   22     24  8562 0.88725389
3 TGCAGTG  CSARD TRBV20-1 TRBD1 TRBJ1-5   15     17   23     31   310 0.03212435
> 

We can confirm accuracy of the freq column as follows:我们可以确认freq列的准确性如下:

> # confirm accuracy 
> agg_df$count / sum(agg_df$count)
[1] 0.08062176 0.88725389 0.03212435
> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何对 R 中的数据集的值进行重新分类并聚合行? - How do I recategorize values and aggregate rows of a dataset in R? 如何在 R dataframe 中聚合数据 - How do I aggregate data in a R dataframe 数据帧R中每10行后的列值总和 - Sum column values after every 10 rows in dataframe R 如果某些列变量的值出现在具有相同键值的重复行中,我该如何添加它们 - How do I add values of certain column variables if they apprear in repeated rows with same key values 如何基于另一列的值聚合一列的R数据帧 - How to aggregate R dataframe of one column based on values of another R 中 dataframe 中其他列的所有成对分组的列值的总和 - Aggregate sum of column values for all pairwise groupings of other columns in a dataframe in R 如何在R中按时间汇总/求和值 - How to aggregate/ sum values by time in r 如何对列中的值求和,按R中的行中的名称分组,而不列出每个名称? - How can I sum values in a column, grouped by names in rows in R, without listing each name? 如果列为奇数,如何读取奇数行并将这些值求和 - How to read odd rows if column is odd and sum these values in R 如何在一列中将值分成相等的范围,并在R中将另一列的关联值求和? - How do I split values into equal ranges in one column and sum the associated value of another column in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM