简体   繁体   English

R-有效地查找具有几乎相同数据的行,并将差异粘贴到一个单元格中

[英]R - Efficiently find rows with nearly identical data, and paste the differences into one cell

Suppose I have a data frame 假设我有一个数据框

 Data <- data.frame("Name", "Age", "Weight", "School", "Book" , "Author")
 Data[1,] <- c("Paul", 26, 150, "Helgason U", "Intro to Smooth Manifolds", "John Lee")
 Data[2,] <- c("Paul", 26, 150, "Helgason U", "A Tale of Two Cities", "Charles Dickens")
 Data[3,] <- c("Paul", 26, 150, "Helgason U", "Fear and Loathing in Las Vegas", "Hunter Thompson")
 Data[4,] <- c("Paul", 26, 150, "Helgason U", "Gravity's Rainbow", "Thomas Pynchon")
 Data[5,] <- c("David", 35, 165, "Turing College", "Brave New World", "Aldous Huxley")
 Data[6,] <- c("David", 35, 165, "Turing College", "Vashista's Yoga", "Vashista")
 Data[7,] <- c("David", 35, 165, "Turing College", "C++ For Dummies", "Anonymous")

and I wanted to compress the data so that all of the rows corresponding to the same person can be fit into one row, and the numerous books and authors can be concatenated. 我想压缩数据,以使与同一个人对应的所有行都可以放入一行,并且可以连接大量书籍和作者。 In other words, I would like my output to be: 换句话说,我希望输出为:

    Name     Age     Weight     School     Books                          Authors
    Paul     26       150     Helgason U   Intro to Smooth Manifolds      John Lee
                                           A Tale of Two Cities           Charles Dickens
                                           Fear and Loathing in Las Vegas Hunter Thompson
                                           Gravity's Rainbow              Thomas Pynchon
    David    35       165   Turing College Brave New World                Aldous Huxley
                                           Vashista's Yoga                Vashista
                                           C++ For Dummies                Anonymous

Ideally I would like the books can be concatenated as "Intro to Smooth Manifolds\\nA Tale of Two Cities\\nFear and Loathing in Las Vegas\\nGravity's Rainbow" . 理想情况下,我希望这些书可以归类为"Intro to Smooth Manifolds\\nA Tale of Two Cities\\nFear and Loathing in Las Vegas\\nGravity's Rainbow"

Originally I had used a for loop, but this was too slow since my actual data is far greater than this. 最初,我使用了for循环,但这太慢了,因为我的实际数据远不止于此。 To give an idea of how I was looping: 让我知道如何循环:

  for (i in 1:L){
    Names = subset(Data, Data$Name == unique(Data$Names)[i])
    rows = nrow(Names)

    Name_Matches = which(duplicated(Names[,Cols]) | duplicated(Names[nrow(Names):1, Cols])[nrow(Names):1])
    Name_UnMtchs = setdiff(1:nrow(Names), Name_Matches)

    Books        = Names$Book[Name_Matches]
    New_Books    = paste(as.character(Books), collapse = "\n")
    Authors     = Names$Author[Name_Matches]
    New_Authors = paste(Authors, collapse = "\n")

    New_Data[count_New, Cols] = Names[Name_Matches[1], Cols]
    New_Data$Book             = New_Books
    New_Data$Author           = New_Authors
    count_New                 = count_New + 1
    }

where Cols are the column indices of the entries which I know stay the same for a person (age, weight, school, name), L is the number of unique names in the data frame, count_New is a counter that is initialized at 1 to start, and New_Data is an empty data frame with the same columns as Data . 其中Cols是我知道一个人(年龄,体重,学校,姓名)保持不变的条目的列索引, L是数据框中唯一名称的数量, count_New是一个初始化为1到开始, New_Data是一个空的数据框,与Data列相同。 What function could I use that would let me consolidate my data without using a for loop of this kind? 我可以使用什么函数来整合数据而无需使用这种for循环?

This kind of stuff could be done with base R, but it's probably better to use a package purposely designed for data wrangling. 这种事情可以用base R完成,但是最好使用专门为数据整理而设计的程序包。

In dplyr: 在dplyr中:

require(dplyr)

Data %>%
  group_by(Name, Age, Weight, School) %>%
  summarise(Books=paste(Book, collapse="\n"), Authors=paste(Author, collapse="\n"))

I suspect that this is what you really want though. 我怀疑这是您真正想要的。 Instead of pasting the book titles (and authors) into one string for each name, it turns them into a vector of titles which can then be used for further processing. 与其将每个书名(和作者)粘贴到每个名称的一个字符串中,不如将书的标题(和作者)粘贴到一个标题向量中,然后将其用于进一步处理。

Data %>%
  group_by(Name, Age, Weight, School) %>%
  summarise(Books=list(Book), Authors=list(Author))

Consider this base R solution (albeit not as efficient or elegant): 考虑以下基础R解决方案(尽管效率不高或不够优雅):

# OBTAIN UNIQUE PERSONS DATAFRAME
Data2 <- unique(Data[1:4])
rownames(Data2) <- NULL

# GET LIST OF DISTINCT PERSONS
persons = c(Data2[1]) 

# LOOP THROUGH DISTINCT PERSONS
for (j in persons){
  for (k in 0:length(persons)+1){
  # BOOK COLUMN (PULL INTO LIST, THEN CONCATENATE)  
  books <- c(Data[Data$Name==j[k],][5])
  booksconcat <- paste(books[[1]], collapse="\n")
  Data2$Book[Data2$Name==j[k]] <- booksconcat    

  # AUTHOR COLUMN (PULL INTO LIST, THEN CONCATENATE)
  authors <- c(Data[Data$Name==j[k],][6])
  authorsconcat <- paste(authors[[1]], collapse="\n")
  Data2$Author[Data2$Name==j[k]] <- authorsconcat    
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在数据帧之间找到几乎相同的行 - finding nearly identical rows between data frames 有效地在数据框中查找具有几乎相同值的行组 - Efficiently find groups of rows in a data frame with almost identical values 如何找到一个数据帧中行的差异? - how to fInd differences in rows in one data frame? 在 R 中,如何从数据框中删除空行并从单元格中提取一个值并将其粘贴到另一个单元格中? - In R how do I remove empty rows from a data frame and extract one value from a cell and paste it in other cell? R:有没有办法搜索几乎相同的行? - R: Is there a way to search for a nearly identical row? 查找一列中相同但另一列中相同的行 - Find rows that are identical in one column but not another R,将多行文本数据帧合并为一个单元格 - R, merge multiple rows of text data frame into one cell 使用R将一个单元格中的数据分成多行 - Using R to split data in one cell into multiple rows 将海量数据导出合并为“ R”,而无需一一添加剪切和粘贴行 - Combine a mass data export into “R” without having to add cut and paste rows one by one R-比较和删除数据框中具有相同列值的行,同时保留其中之一 - R - Compare and delete rows with identical column value in data frame while keeping one of them
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM