如何聚合 dataframe 并通过 r 中的重复行对列的值求和

Question

I'm attempting to aggregate a dataframe to remove repeated rows, but I need to sum the value of a count column and use it as the new count for that row value.我正在尝试聚合 dataframe 以删除重复的行，但我需要对计数列的值求和并将其用作该行值的新计数。 I have the following dataframe:我有以下 dataframe：

  count        freq  cdr3nt cdr3aa         v      d       j  VEnd  DStart   DEnd   JStart
   3154    0.036110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2800    0.038394 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2608    0.033014 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    412    0.004717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    366    0.005015 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

I need to get to this:我需要解决这个问题：

   count    freq    cdr3nt    cdr3aa   v       d     j     VEnd  DStart   DEnd   JStart
   8562    0.048822 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    778    0.003332 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

Instead, I'm getting this:相反，我得到了这个：

      count    freq    cdr3nt    cdr3aa   v       d     j     VEnd  DStart   DEnd   JStart
        3    0.601110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
        2    0.506717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
        1    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31

Here's piece of the code that's not working right:这是一段无法正常工作的代码：

  agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + 
             DEnd +   JStart, data = final_df, 
                FUN = length), freq <- count/sum(count))


  agg_df1 <-select(agg_df, count, freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart)

Instead of adding the "count" value of the corresponding repeated rows, so I can recalculate the frequency, it's basically counting the number of times the particular row is repeated.而不是添加相应重复行的“计数”值，所以我可以重新计算频率，它基本上是计算特定行重复的次数。 Any thoughts are greatly appreciated.任何想法都非常感谢。 Thanks.谢谢。

Answer 1

The use of FUN = length causes the output value of count to be the number of occurrences for each of the by groups.使用FUN = length导致 output 的count值成为每个 by 组的出现次数。 Instead, use FUN = sum to calculate the sum of the input count column.相反，使用FUN = sum来计算输入count列的总和。

textFile <- "  count        freq  cdr3nt cdr3aa         v      d       j  VEnd  DStart   DEnd   JStart
   3154    0.036110 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2800    0.038394 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
   2608    0.033014 TGTGCCA  CASSE   TRBV6-1  TRBD1 TRBJ2-6  13    18       22     24
    412    0.004717 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    366    0.005015 TGCGCCA  CASMG   TRBV10-2 TRBD1 TRBJ1-1   9    15       19     20
    310    0.004250 TGCAGTG  CSARD   TRBV20-1 TRBD1 TRBJ1-5  15    17       23     31"

final_df <- read.table(text = textFile,
                   header = TRUE)

# original code had FUN = length, which returned the number of rows per
# combination of by groups 
agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart + 
          DEnd +   JStart, data = final_df, FUN = sum), freq <- count/sum(count))
agg_df

...and the output: ...和 output：

> agg_df
   cdr3nt cdr3aa        v     d       j VEnd DStart DEnd JStart count       freq
1 TGCGCCA  CASMG TRBV10-2 TRBD1 TRBJ1-1    9     15   19     20   778 0.08062176
2 TGTGCCA  CASSE  TRBV6-1 TRBD1 TRBJ2-6   13     18   22     24  8562 0.88725389
3 TGCAGTG  CSARD TRBV20-1 TRBD1 TRBJ1-5   15     17   23     31   310 0.03212435
>

We can confirm accuracy of the freq column as follows:我们可以确认freq列的准确性如下：

> # confirm accuracy 
> agg_df$count / sum(agg_df$count)
[1] 0.08062176 0.88725389 0.03212435
>

如何聚合 dataframe 并通过 r 中的重复行对列的值求和

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-14 21:23:02

如何聚合 dataframe 并通过 r 中的重复行对列的值求和

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-14 21:23:02

解决方案1
1 已采纳 2020-05-14 21:23:02