[英]How do I aggregate a dataframe and sum the values of a column by repeated rows in r
I'm attempting to aggregate a dataframe to remove repeated rows, but I need to sum the value of a count column and use it as the new count for that row value.我正在尝试聚合 dataframe 以删除重复的行,但我需要对计数列的值求和并将其用作该行值的新计数。 I have the following dataframe:
我有以下 dataframe:
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
3154 0.036110 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2800 0.038394 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2608 0.033014 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
412 0.004717 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
366 0.005015 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
310 0.004250 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31
I need to get to this:我需要解决这个问题:
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
8562 0.048822 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
778 0.003332 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
310 0.004250 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31
Instead, I'm getting this:相反,我得到了这个:
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
3 0.601110 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2 0.506717 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
1 0.004250 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31
Here's piece of the code that's not working right:这是一段无法正常工作的代码:
agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart +
DEnd + JStart, data = final_df,
FUN = length), freq <- count/sum(count))
agg_df1 <-select(agg_df, count, freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart)
Instead of adding the "count" value of the corresponding repeated rows, so I can recalculate the frequency, it's basically counting the number of times the particular row is repeated.而不是添加相应重复行的“计数”值,所以我可以重新计算频率,它基本上是计算特定行重复的次数。 Any thoughts are greatly appreciated.
任何想法都非常感谢。 Thanks.
谢谢。
The use of FUN = length
causes the output value of count
to be the number of occurrences for each of the by groups.使用
FUN = length
导致 output 的count
值成为每个 by 组的出现次数。 Instead, use FUN = sum
to calculate the sum of the input count
column.相反,使用
FUN = sum
来计算输入count
列的总和。
textFile <- " count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
3154 0.036110 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2800 0.038394 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
2608 0.033014 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24
412 0.004717 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
366 0.005015 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20
310 0.004250 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31"
final_df <- read.table(text = textFile,
header = TRUE)
# original code had FUN = length, which returned the number of rows per
# combination of by groups
agg_df <- within(aggregate(count ~ cdr3nt + cdr3aa + v + d + j + VEnd + DStart +
DEnd + JStart, data = final_df, FUN = sum), freq <- count/sum(count))
agg_df
...and the output: ...和 output:
> agg_df
cdr3nt cdr3aa v d j VEnd DStart DEnd JStart count freq
1 TGCGCCA CASMG TRBV10-2 TRBD1 TRBJ1-1 9 15 19 20 778 0.08062176
2 TGTGCCA CASSE TRBV6-1 TRBD1 TRBJ2-6 13 18 22 24 8562 0.88725389
3 TGCAGTG CSARD TRBV20-1 TRBD1 TRBJ1-5 15 17 23 31 310 0.03212435
>
We can confirm accuracy of the freq
column as follows:我们可以确认
freq
列的准确性如下:
> # confirm accuracy
> agg_df$count / sum(agg_df$count)
[1] 0.08062176 0.88725389 0.03212435
>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.