R-有效地查找具有几乎相同数据的行，并将差异粘贴到一个单元格中

Question

Suppose I have a data frame 假设我有一个数据框

 Data <- data.frame("Name", "Age", "Weight", "School", "Book" , "Author")
 Data[1,] <- c("Paul", 26, 150, "Helgason U", "Intro to Smooth Manifolds", "John Lee")
 Data[2,] <- c("Paul", 26, 150, "Helgason U", "A Tale of Two Cities", "Charles Dickens")
 Data[3,] <- c("Paul", 26, 150, "Helgason U", "Fear and Loathing in Las Vegas", "Hunter Thompson")
 Data[4,] <- c("Paul", 26, 150, "Helgason U", "Gravity's Rainbow", "Thomas Pynchon")
 Data[5,] <- c("David", 35, 165, "Turing College", "Brave New World", "Aldous Huxley")
 Data[6,] <- c("David", 35, 165, "Turing College", "Vashista's Yoga", "Vashista")
 Data[7,] <- c("David", 35, 165, "Turing College", "C++ For Dummies", "Anonymous")

and I wanted to compress the data so that all of the rows corresponding to the same person can be fit into one row, and the numerous books and authors can be concatenated. 我想压缩数据，以使与同一个人对应的所有行都可以放入一行，并且可以连接大量书籍和作者。 In other words, I would like my output to be: 换句话说，我希望输出为：

    Name     Age     Weight     School     Books                          Authors
    Paul     26       150     Helgason U   Intro to Smooth Manifolds      John Lee
                                           A Tale of Two Cities           Charles Dickens
                                           Fear and Loathing in Las Vegas Hunter Thompson
                                           Gravity's Rainbow              Thomas Pynchon
    David    35       165   Turing College Brave New World                Aldous Huxley
                                           Vashista's Yoga                Vashista
                                           C++ For Dummies                Anonymous

Ideally I would like the books can be concatenated as "Intro to Smooth Manifolds\\nA Tale of Two Cities\\nFear and Loathing in Las Vegas\\nGravity's Rainbow" . 理想情况下，我希望这些书可以归类为"Intro to Smooth Manifolds\\nA Tale of Two Cities\\nFear and Loathing in Las Vegas\\nGravity's Rainbow" 。

Originally I had used a for loop, but this was too slow since my actual data is far greater than this. 最初，我使用了for循环，但这太慢了，因为我的实际数据远不止于此。 To give an idea of how I was looping: 让我知道如何循环：

  for (i in 1:L){
    Names = subset(Data, Data$Name == unique(Data$Names)[i])
    rows = nrow(Names)

    Name_Matches = which(duplicated(Names[,Cols]) | duplicated(Names[nrow(Names):1, Cols])[nrow(Names):1])
    Name_UnMtchs = setdiff(1:nrow(Names), Name_Matches)

    Books        = Names$Book[Name_Matches]
    New_Books    = paste(as.character(Books), collapse = "\n")
    Authors     = Names$Author[Name_Matches]
    New_Authors = paste(Authors, collapse = "\n")

    New_Data[count_New, Cols] = Names[Name_Matches[1], Cols]
    New_Data$Book             = New_Books
    New_Data$Author           = New_Authors
    count_New                 = count_New + 1
    }

where Cols are the column indices of the entries which I know stay the same for a person (age, weight, school, name), L is the number of unique names in the data frame, count_New is a counter that is initialized at 1 to start, and New_Data is an empty data frame with the same columns as Data . 其中Cols是我知道一个人（年龄，体重，学校，姓名）保持不变的条目的列索引， L是数据框中唯一名称的数量， count_New是一个初始化为1到开始， New_Data是一个空的数据框，与Data列相同。 What function could I use that would let me consolidate my data without using a for loop of this kind? 我可以使用什么函数来整合数据而无需使用这种for循环？

Answer 1

This kind of stuff could be done with base R, but it's probably better to use a package purposely designed for data wrangling. 这种事情可以用base R完成，但是最好使用专门为数据整理而设计的程序包。

In dplyr: 在dplyr中：

require(dplyr)

Data %>%
  group_by(Name, Age, Weight, School) %>%
  summarise(Books=paste(Book, collapse="\n"), Authors=paste(Author, collapse="\n"))

I suspect that this is what you really want though. 我怀疑这是您真正想要的。 Instead of pasting the book titles (and authors) into one string for each name, it turns them into a vector of titles which can then be used for further processing. 与其将每个书名（和作者）粘贴到每个名称的一个字符串中，不如将书的标题（和作者）粘贴到一个标题向量中，然后将其用于进一步处理。

Data %>%
  group_by(Name, Age, Weight, School) %>%
  summarise(Books=list(Book), Authors=list(Author))

Answer 2

Consider this base R solution (albeit not as efficient or elegant): 考虑以下基础R解决方案（尽管效率不高或不够优雅）：

# OBTAIN UNIQUE PERSONS DATAFRAME
Data2 <- unique(Data[1:4])
rownames(Data2) <- NULL

# GET LIST OF DISTINCT PERSONS
persons = c(Data2[1]) 

# LOOP THROUGH DISTINCT PERSONS
for (j in persons){
  for (k in 0:length(persons)+1){
  # BOOK COLUMN (PULL INTO LIST, THEN CONCATENATE)  
  books <- c(Data[Data$Name==j[k],][5])
  booksconcat <- paste(books[[1]], collapse="\n")
  Data2$Book[Data2$Name==j[k]] <- booksconcat    

  # AUTHOR COLUMN (PULL INTO LIST, THEN CONCATENATE)
  authors <- c(Data[Data$Name==j[k],][6])
  authorsconcat <- paste(authors[[1]], collapse="\n")
  Data2$Author[Data2$Name==j[k]] <- authorsconcat    
  }
}

R-有效地查找具有几乎相同数据的行，并将差异粘贴到一个单元格中

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-08-11 01:08:54

解决方案2
1 2015-08-11 04:43:13

R-有效地查找具有几乎相同数据的行，并将差异粘贴到一个单元格中

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-08-11 01:08:54

解决方案2 1 2015-08-11 04:43:13

解决方案1
3 已采纳 2015-08-11 01:08:54

解决方案2
1 2015-08-11 04:43:13