简体   繁体   English

R Dataframe:按行,按行聚合列内的字符串

[英]R Dataframe: aggregating strings within column, across rows, by group

I have what seems like a very inefficient solution to a peculiar problem. 对于一个特殊问题,我有一个非常低效的解决方案。 I have text data which, for various reasons, is broken across rows of a dataframe at random intervals. 我有文本数据,由于各种原因,它以随机的间隔跨越数据帧的各行。 However, certain subsets of are known to belong together based on unique combinations of other variables in the dataframe. 然而,已知某些子集基于数据帧中其他变量的唯一组合而属于一起。 See, for example, a MWE demonstrating the structure and my initial solution: 例如,参见MWE演示结构和我的初始解决方案:

# Data
df <- read.table(text="page passage  person index text
1  123   A   1 hello      
1  123   A   2 my
1  123   A   3 name
1  123   A   4 is
1  123   A   5 guy
1  124   B   1 well
1  124   B   2 hello
1  124   B   3 guy",header=T,stringsAsFactors=F)

master<-data.frame()
for (i in 123:max(df$passage)) {
  print(paste0('passage ',i))
  tempset <- df[df$passage==i,]
  concat<-''
  for (j in 1:nrow(tempset)) {
    print(paste0('index ',j))
    concat<-paste(concat, tempset$text[j])
  }
  tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
  master<-rbind(master, tempdf)
  rm(concat, tempset, tempdf)
}
master
> master
  tempset.page.1. tempset.passage.1. tempset.person.1.                concat
1               1                123                 A  hello my name is guy
2               1                124                 B        well hello guy

In this example as in my real case, "passage" is the unique grouping variable, so it is not entirely necessary to take the other pieces along with it, although I'd like them available in my dataset. 在这个例子中,就像在我的实际案例中一样,“passage”是唯一的分组变量,因此并不完全有必要将其他部分与它一起使用,尽管我希望它们在我的数据集中可用。

My current estimates are that this procedure I have devise will take several hours for a dataset that is otherwise easily handled by R on my computer. 我目前的估计是,我设计的这个程序将花费几个小时来处理我的计算机上R很容易处理的数据集。 Perhaps there are some efficiencies to be gained either by other functions or packages, or not creating and removing so many objects? 也许通过其他功能或包获得一些效率,或者不创建和删除这么多对象?

Thanks for any help here! 感谢您的帮助!

data.table Here's one way: data.table这是一种方式:

require(data.table)
DT <- data.table(df)

DT[,.(concat=paste0(text,collapse=" ")),by=.(page,passage,person)]
#    page passage person               concat
# 1:    1     123      A hello my name is guy
# 2:    1     124      B       well hello guy

Putting the extra variables (besides passage ) in the by doesn't cost much, I think. 把额外的变量(除了passage中) by成本并不高,我想。


dplyr The analogue is dplyr模拟是

df %>% 
  group_by(page,passage,person) %>% 
  summarise(concat=paste0(text,collapse=" "))

# Source: local data frame [2 x 4]
# Groups: page, passage, person
# 
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 2    1     124      B       well hello guy

base R One way is: 基地R一种方法是:

df$concat <- with(df,ave(text,passage,FUN=function(x)paste0(x,collapse=" ")))
unique(df[,which(names(df)%in%c("page","passage","person","concat"))])
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 6    1     124      B       well hello guy

Here are two ways: 这有两种方式:

base R 基地R.

aggregate(
    text ~ page + passage + person, 
    data=df, 
    FUN=paste, collapse=' '
)

dplyr dplyr

library(dplyr)
df %>% 
    group_by_(~page, ~passage, ~person) %>%
    summarize_(text=~paste(text, collapse=' '))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM