[英]R - Efficiently find rows with nearly identical data, and paste the differences into one cell
Suppose I have a data frame 假设我有一个数据框
Data <- data.frame("Name", "Age", "Weight", "School", "Book" , "Author")
Data[1,] <- c("Paul", 26, 150, "Helgason U", "Intro to Smooth Manifolds", "John Lee")
Data[2,] <- c("Paul", 26, 150, "Helgason U", "A Tale of Two Cities", "Charles Dickens")
Data[3,] <- c("Paul", 26, 150, "Helgason U", "Fear and Loathing in Las Vegas", "Hunter Thompson")
Data[4,] <- c("Paul", 26, 150, "Helgason U", "Gravity's Rainbow", "Thomas Pynchon")
Data[5,] <- c("David", 35, 165, "Turing College", "Brave New World", "Aldous Huxley")
Data[6,] <- c("David", 35, 165, "Turing College", "Vashista's Yoga", "Vashista")
Data[7,] <- c("David", 35, 165, "Turing College", "C++ For Dummies", "Anonymous")
and I wanted to compress the data so that all of the rows corresponding to the same person can be fit into one row, and the numerous books and authors can be concatenated. 我想压缩数据,以使与同一个人对应的所有行都可以放入一行,并且可以连接大量书籍和作者。 In other words, I would like my output to be:
换句话说,我希望输出为:
Name Age Weight School Books Authors
Paul 26 150 Helgason U Intro to Smooth Manifolds John Lee
A Tale of Two Cities Charles Dickens
Fear and Loathing in Las Vegas Hunter Thompson
Gravity's Rainbow Thomas Pynchon
David 35 165 Turing College Brave New World Aldous Huxley
Vashista's Yoga Vashista
C++ For Dummies Anonymous
Ideally I would like the books can be concatenated as "Intro to Smooth Manifolds\\nA Tale of Two Cities\\nFear and Loathing in Las Vegas\\nGravity's Rainbow"
. 理想情况下,我希望这些书可以归类为
"Intro to Smooth Manifolds\\nA Tale of Two Cities\\nFear and Loathing in Las Vegas\\nGravity's Rainbow"
。
Originally I had used a for loop, but this was too slow since my actual data is far greater than this. 最初,我使用了for循环,但这太慢了,因为我的实际数据远不止于此。 To give an idea of how I was looping:
让我知道如何循环:
for (i in 1:L){
Names = subset(Data, Data$Name == unique(Data$Names)[i])
rows = nrow(Names)
Name_Matches = which(duplicated(Names[,Cols]) | duplicated(Names[nrow(Names):1, Cols])[nrow(Names):1])
Name_UnMtchs = setdiff(1:nrow(Names), Name_Matches)
Books = Names$Book[Name_Matches]
New_Books = paste(as.character(Books), collapse = "\n")
Authors = Names$Author[Name_Matches]
New_Authors = paste(Authors, collapse = "\n")
New_Data[count_New, Cols] = Names[Name_Matches[1], Cols]
New_Data$Book = New_Books
New_Data$Author = New_Authors
count_New = count_New + 1
}
where Cols
are the column indices of the entries which I know stay the same for a person (age, weight, school, name), L
is the number of unique names in the data frame, count_New
is a counter that is initialized at 1
to start, and New_Data
is an empty data frame with the same columns as Data
. 其中
Cols
是我知道一个人(年龄,体重,学校,姓名)保持不变的条目的列索引, L
是数据框中唯一名称的数量, count_New
是一个初始化为1
到开始, New_Data
是一个空的数据框,与Data
列相同。 What function could I use that would let me consolidate my data without using a for loop of this kind? 我可以使用什么函数来整合数据而无需使用这种for循环?
This kind of stuff could be done with base R, but it's probably better to use a package purposely designed for data wrangling. 这种事情可以用base R完成,但是最好使用专门为数据整理而设计的程序包。
In dplyr: 在dplyr中:
require(dplyr)
Data %>%
group_by(Name, Age, Weight, School) %>%
summarise(Books=paste(Book, collapse="\n"), Authors=paste(Author, collapse="\n"))
I suspect that this is what you really want though. 我怀疑这是您真正想要的。 Instead of pasting the book titles (and authors) into one string for each name, it turns them into a vector of titles which can then be used for further processing.
与其将每个书名(和作者)粘贴到每个名称的一个字符串中,不如将书的标题(和作者)粘贴到一个标题向量中,然后将其用于进一步处理。
Data %>%
group_by(Name, Age, Weight, School) %>%
summarise(Books=list(Book), Authors=list(Author))
Consider this base R solution (albeit not as efficient or elegant): 考虑以下基础R解决方案(尽管效率不高或不够优雅):
# OBTAIN UNIQUE PERSONS DATAFRAME
Data2 <- unique(Data[1:4])
rownames(Data2) <- NULL
# GET LIST OF DISTINCT PERSONS
persons = c(Data2[1])
# LOOP THROUGH DISTINCT PERSONS
for (j in persons){
for (k in 0:length(persons)+1){
# BOOK COLUMN (PULL INTO LIST, THEN CONCATENATE)
books <- c(Data[Data$Name==j[k],][5])
booksconcat <- paste(books[[1]], collapse="\n")
Data2$Book[Data2$Name==j[k]] <- booksconcat
# AUTHOR COLUMN (PULL INTO LIST, THEN CONCATENATE)
authors <- c(Data[Data$Name==j[k],][6])
authorsconcat <- paste(authors[[1]], collapse="\n")
Data2$Author[Data2$Name==j[k]] <- authorsconcat
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.