删除r中的相邻重复项

Question

I have a data frame in the format : 我有一个格式的数据框：

site_domain <- c('ebay.com','facebook.com','facebook.com','ebay.com','ebay.com','auto.com','ebay.com','facebook.com','auto.com','ebay.com','facebook.com','facebook.com','ebay.com','facebook.com','auto.com','auto.com')
id <- c(1, 1, 1,2,2,3,3,3,3,4,4,4,5,5,5,5)
file0 <- as.data.frame(cbind(site_domain,id))

I did a group by on "id" to get the data : 我对“ id”进行了分组，以获取数据：

library(dplyr)
xx <- as.data.frame(file0 %>% 
                      group_by(id) %>%
                      summarise(pages=paste(site_domain, collapse='_')))

The data looks like: 数据如下：

1 ebay.com_facebook.com_facebook.com
2 ebay.com_ebay.com
3 auto.com_ebay.com_facebook.com_auto.com
4 ebay.com_facebook.com_facebook.com
5 ebay.com_facebook.com_auto.com_auto.com

However i want to remove adjacent duplicates, so i want out put like : 但是我想删除相邻的重复项，所以我想像这样输出：

1 ebay.com_facebook.com
2 ebay.com
3 auto.com_ebay.com_facebook.com_auto.com
4 ebay.com_facebook.com
5 ebay.com_facebook.com_auto.com

How can i achieve this. 我怎样才能做到这一点。

Answer 1

We can use values property of rle to remove adjacent duplicates. 我们可以使用rle values属性删除相邻的重复项。

library(dplyr)
file0 %>% 
   group_by(id) %>%
   summarise(pages=paste(rle(as.character(site_domain))$values, collapse='_'))

#      id                                   pages
#    <fctr>                                   <chr>
#1      1                   ebay.com_facebook.com
#2      2                                ebay.com
#3      3 auto.com_ebay.com_facebook.com_auto.com
#4      4                   ebay.com_facebook.com
#5      5          ebay.com_facebook.com_auto.com

Answer 2

Here is an option with data.table 这是data.table一个选项

library(data.table)
setDT(file0)[,  unique(site_domain), by= .(id, grp=rleid(site_domain))
             ][, .(site=paste(V1, collapse="_")) , id]
#   id                                    site
#1:  1                   ebay.com_facebook.com
#2:  2                                ebay.com
#3:  3 auto.com_ebay.com_facebook.com_auto.com
#4:  4                   ebay.com_facebook.com
#5:  5          ebay.com_facebook.com_auto.com

Or create an index with .I , extract the rows, and paste by 'id' 或使用.I创建索引，提取行，并按'id' paste

i1 <- setDT(file0)[, .I[!duplicated(site_domain)], .(id, grp = rleid(site_domain))]$V1
file0[i1, .(site = paste(site_domain, collapse="_")), by = id]

Answer 3

With unique function: 具有unique功能：

xx <- as.data.frame(file0 %>% 
                      group_by(id) %>%
                      summarise(pages=paste(unique(site_domain), collapse='_')))

xx

#  id                          pages
#1  1          ebay.com_facebook.com
#2  2                       ebay.com
#3  3 auto.com_ebay.com_facebook.com
#4  4          ebay.com_facebook.com
#5  5 ebay.com_facebook.com_auto.com

Answer 4

it is easy to remove the duplication before grouping 分组之前很容易删除重复项

      file0 <- file0  [!duplicated(file0),]


        site_domain id
       1      ebay.com  1
       2  facebook.com  1
       4      ebay.com  2
       6      auto.com  3
       7      ebay.com  3
       8  facebook.com  3
       10     ebay.com  4
       11 facebook.com  4
       13     ebay.com  5
       14 facebook.com  5
       15     auto.com  5

then you can group the data by id 然后您可以按ID对数据进行分组

           id                          pages
            1  1          ebay.com_facebook.com
            2  2                       ebay.com
            3  3 auto.com_ebay.com_facebook.com
            4  4          ebay.com_facebook.com
            5  5 ebay.com_facebook.com_auto.com

删除r中的相邻重复项

问题描述

4 个解决方案

解决方案1
2 已采纳 2016-12-20 08:16:53

解决方案2
2 2016-12-20 09:09:15

解决方案3
1 2016-12-20 07:55:42

解决方案4
1 2016-12-20 07:56:58

删除r中的相邻重复项

问题描述

4 个解决方案

解决方案1 2 已采纳 2016-12-20 08:16:53

解决方案2 2 2016-12-20 09:09:15

解决方案3 1 2016-12-20 07:55:42

解决方案4 1 2016-12-20 07:56:58

解决方案1
2 已采纳 2016-12-20 08:16:53

解决方案2
2 2016-12-20 09:09:15

解决方案3
1 2016-12-20 07:55:42

解决方案4
1 2016-12-20 07:56:58