简体   繁体   中英

Removing Duplicates in R

I have data as such: 意见

where a user has a user_id and a bunch of other features like city and country, and is also associated with multiple advertisers. In this format there is a copy of a row for every advertiser that the user is associated with, and the rest of the features of the user are getting copied in every row with that user_id. I want to de-duplicate the user_ids and combine all the advertiser_ids into one set, yet have all the other features stay the same. For example,

I want the 2nd and 3rd rows to be combined and have the advertiser_id column become a set of ids associated with that user, yet have all the other features stay unchanged.

Here's a data.table solution:

library(data.table)
#example data
dt = data.table(user_id = c(1,2,2,3), advertiser_id = c(1:4), other_data = c(4:1))
#   user_id advertiser_id other_data
#1:       1             1          4
#2:       2             2          3
#3:       2             3          2
#4:       3             4          1

dt[, advertiser_list := list(list(advertiser_id)), by = user_id][
     # ^^^ first collect advertisers into a list by user_id
     !duplicated(user_id)][, # now select the unique users
     advertiser_id := NULL] -> dt # finally remove the advertiser_id column
dt
#   user_id other_data advertiser_list
#1:       1          4               1
#2:       2          3             2,3
#3:       3          1               4

The duplicated() function returns a logical vector which is equal to TRUE for duplicated rows. Lets call df your dataset, you will remove all duplicated values with the following line :

df <- subset(df, duplicated(df) = TRUE)

See the R Programming wikibook if you want to learn more about that.

From your description, it sounds like you're just looking for aggregate . Consider the following:

> df = data.frame(user_id = c(1,2,2,3), 
+                 advertiser_id = c(1:4), 
+                 other_data = letters[c(1, 2, 2, 3)])
> df
  user_id advertiser_id other_data
1       1             1          a
2       2             2          b
3       2             3          b
4       3             4          c
> aggregate(advertiser_id ~ . , df, I)
  user_id other_data advertiser_id
1       1          a             1
2       2          b          2, 3
3       3          c             4

The above converts the "advertiser_id" column into a list , as can be inspected using str . This might be convenient, but might also be difficult to work with, for instance if you wanted to save your output to a csv file later on.

> str(aggregate(advertiser_id ~ . , df, I))
'data.frame':   3 obs. of  3 variables:
 $ user_id      : num  1 2 3
 $ other_data   : Factor w/ 3 levels "a","b","c": 1 2 3
 $ advertiser_id:List of 3
  ..$ 0:Class 'AsIs'  int 1
  ..$ 4:Class 'AsIs'  int [1:2] 2 3
  ..$ 8:Class 'AsIs'  int 4

A less flexible alternative is to concatenate the "advertiser_id" columns as a character string.

> aggregate(advertiser_id ~ . , df, paste, collapse = ", ")
  user_id other_data advertiser_id
1       1          a             1
2       2          b          2, 3
3       3          c             4
> str(aggregate(advertiser_id ~ . , df, paste, collapse = ", "))
'data.frame':   3 obs. of  3 variables:
 $ user_id      : num  1 2 3
 $ other_data   : Factor w/ 3 levels "a","b","c": 1 2 3
 $ advertiser_id: chr  "1" "2, 3" "4"

Both of these can also easily be done with data.table , along the lines of @eddi's answer.

If you assume that the user data in all other columns is the same, try:

Assume df is your original data.frane :

#pull add ad_id into one column for each user_id
ad = sapply(unique(df$user_id),function(x){paste(df$advertiser_id[df$user_id==x],collapse=",")}
names(ad) = unique(df$user_id)

#Drop all extra rows
df = df[!duplicated(df[,1]),]

#add a column with combined ad_id
df = cbind(df,ad[df$user_id])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM