[英]How to merge rows that have the same information in all columns except one?
I have a large data frame that looks smth like this: 我有一个看起来像这样的大数据框:
A 1 2 3 4 ...
B 1 2 3 4 ...
C 1 2 3 4 ...
D 5 2 1 4 ...
E 3 2 3 9 ...
F 0 0 2 2 ...
G 0 0 2 2 ...
As you can see some rows are duplicate entries if you disregard the first column for a second. 如您所见,如果您忽略第一列,则某些行是重复条目。 I would like to combine/merge these rows to generate something like this: 我想合并/合并这些行以生成如下内容:
A;B;C 1 2 3 4 ...
D 5 2 1 4 ...
E 3 2 3 9 ...
F;G 0 0 2 2 ...
I could write a for loop, which iterates over the rows, but that would be neither pretty, nor efficient. 我可以编写一个for循环,该循环遍历所有行,但这既不美观也不有效。 I am pretty certain there's a better way to do this. 我敢肯定,有更好的方法可以做到这一点。
I thought I could: 我以为可以:
slice <- df[, 2:ncols(df)]
对df进行切片,因此我拥有除第一个slice <- df[, 2:ncols(df)]
以外的所有列 dups <- df[duplicated(slice)]
通过dups <- df[duplicated(slice)]
获取具有所有“重复”行的数据帧 uniq <- df[unique(slice)]
通过uniq <- df[unique(slice)]
获得带有“唯一”行的另一个数据框 merge(uniq, dups, by... )
使用除第一列以外的所有内容合并它们merge(uniq, dups, by... )
Except that won't work since unique doesn't return indices but a whole dataframe, which means I cannot index df
with corresponding rows from slice
. 除此之外这是行不通的,因为unique不会返回索引,而是返回整个数据帧,这意味着我无法使用slice
相应行对df
进行索引。
Any suggestions? 有什么建议么?
EDIT: I should clarify that A,B,C... are not rownames but actually part of the dataframe, entries given in string/character representation 编辑:我应该澄清,A,B,C ...不是行名,而是实际上是数据框的一部分,以字符串/字符表示形式给出的条目
There are several functions that would do this. 有几个功能可以做到这一点。 All of them are the common aggregation functions: aggregate
, tapply
, by
, ..., and, of course, the popular "data.table" and "dplyr" set of functions. 它们都是通用的聚合函数: aggregate
, tapply
, by
,...,当然还有流行的“ data.table”和“ dplyr”函数集。
Here's aggregate
: 这是aggregate
:
aggregate(V1 ~ ., mydf, toString)
# V2 V3 V4 V5 V6 V1
# 1 0 0 2 2 ... F, G
# 2 5 2 1 4 ... D
# 3 1 2 3 4 ... A, B, C
# 4 3 2 3 9 ... E
Other options (as indicated in the opening paragraph): 其他选择(如开篇所述):
library(data.table)
as.data.table(mydf)[, toString(V1), by = eval(setdiff(names(mydf), "V1"))]
library(dplyr)
mydf %>%
group_by(V2, V3, V4, V5, V6) %>%
summarise(V1 = toString(V1))
Instead of toString
, you can use the classic paste(., collapse = ";")
approach which gives you more flexibility about the final output. 可以使用经典的paste(., collapse = ";")
方法代替toString
,它为最终输出提供了更大的灵活性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.