简体   繁体   中英

How to merge rows that have the same information in all columns except one?

I have a large data frame that looks smth like this:

A  1  2  3  4  ...
B  1  2  3  4  ...
C  1  2  3  4  ...
D  5  2  1  4  ...
E  3  2  3  9  ...
F  0  0  2  2  ...
G  0  0  2  2  ...

As you can see some rows are duplicate entries if you disregard the first column for a second. I would like to combine/merge these rows to generate something like this:

A;B;C  1  2  3  4  ...
D      5  2  1  4  ...
E      3  2  3  9  ...
F;G    0  0  2  2  ...

I could write a for loop, which iterates over the rows, but that would be neither pretty, nor efficient. I am pretty certain there's a better way to do this.

I thought I could:

  1. slice the df so I have all columns except the first slice <- df[, 2:ncols(df)]
  2. get a dataframe with all "duplicate" rows by dups <- df[duplicated(slice)]
  3. get another dataframe with the "unique" rows by uniq <- df[unique(slice)]
  4. merge them using all but the first column merge(uniq, dups, by... )

Except that won't work since unique doesn't return indices but a whole dataframe, which means I cannot index df with corresponding rows from slice .

Any suggestions?

EDIT: I should clarify that A,B,C... are not rownames but actually part of the dataframe, entries given in string/character representation

There are several functions that would do this. All of them are the common aggregation functions: aggregate , tapply , by , ..., and, of course, the popular "data.table" and "dplyr" set of functions.

Here's aggregate :

aggregate(V1 ~ ., mydf, toString)
#   V2 V3 V4 V5  V6      V1
# 1  0  0  2  2 ...    F, G
# 2  5  2  1  4 ...       D
# 3  1  2  3  4 ... A, B, C
# 4  3  2  3  9 ...       E

Other options (as indicated in the opening paragraph):

library(data.table)
as.data.table(mydf)[, toString(V1), by = eval(setdiff(names(mydf), "V1"))]

library(dplyr)
mydf %>%
  group_by(V2, V3, V4, V5, V6) %>%
  summarise(V1 = toString(V1))

Instead of toString , you can use the classic paste(., collapse = ";") approach which gives you more flexibility about the final output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM