简体   繁体   English

是否可以合并R data.frame中的行?

[英]Is it possible to merge rows in R data.frame?

If I have the following data.frame: 如果我有以下data.frame:

> df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'))
> df
   x y
1  a d
2 b* e
3  c f

Is there a clear way to identify rows in which the df$x entries include the string value * , then use this condition to force the string entries of that row to be merged with the row preceding itself, resulting in a data.frame like the following: 是否有明确的方法来标识df$x条目包含字符串值* ,然后使用此条件强制该行的字符串条目与其前面的行合并,从而产生类似于以下:

> df
     x   y
1 a b* d e
2    c   f

I assume that the first part of the problem (identifying the x row values that include `*) can be done in a fairly straightforward way using regular expressions. 我假设问题的第一部分(识别包含`*的x行值)可以使用正则表达式以相当简单的方式完成。 I'm having trouble identifying how to force a data.frame row merge with the row preceding it. 我无法确定如何强制data.frame行与它前面的行合并。

One particularly tricky challenge is if multiple entries in a row have the pattern, eg 一个特别棘手的挑战是如果一行中的多个条目具有该模式,例如

> df <- data.frame(x = c('a', 'b*', 'c*'), y = c('d', 'e', 'f'))
> df
   x y
1  a d
2 b* e
3 c* f

In this case, the resulting data.frame should look like this: 在这种情况下,生成的data.frame应如下所示:

> df
        x     y
1 a b* c* d e f

The main issue that I find is that after running one iteration of a loop that pastes the strings from df[2,] into df[1,] , the data.frame index does not adapt to the new data.frame size: 我发现的主要问题是,在运行一个循环的迭代后,将df[2,]的字符串粘贴到df[1,] ,data.frame索引不适应新的data.frame大小:

> df
     x   y
1 a b* d e
3   c*   f

So, subsequent indexing is disrupted. 因此,后续索引被中断。

Here a initial solution: 这是一个初步解决方案

# Creating the data frame
df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'),stringsAsFactors = FALSE)
df

# Creating a vector of rows with *
ast <- grepl("\\*",df$x)

# For loop
for(i in seq(length(ast),1,-1)){
  if(ast[i]){
    df[i-1,"x"] <- paste(df[i-1,"x"],df[i,"x"],sep=" ")
    df[i-1,"y"] <- paste(df[i-1,"y"],df[i,"y"],sep=" ")
    df <- df[-i,]
  }
}

That's an initial solution because you still have to manage when the first row has * and other situations like this. 这是一个初始的解决方案,因为你仍然必须管理第一行有*和其他类似的情况。 I hope that helps already. 我希望这有帮助。

Not actually merging the rows, but for those rows that have a * it pastes the value of the previous row in, and then it gets rid of rows that had a * in the following row. 实际上并不是合并行,但是对于那些具有*它的行,它会粘贴前一行的值,然后它会删除在下一行中具有*的行。

library(dplyr)

df <- data.frame(x = c('a', 'b*', 'c'), y = c('d', 'e', 'f'))

df <- mutate(df, 
             Operator = grepl("\\*",x), # Check for *
             lagged.x = lag(x, n = 1),  # Get x value from 1 row ago
             lagged.y = lag(y, n = 1),  # Get y value from 1 row ago
             x = ifelse(Operator, paste(lagged.x, x),x), # if there is * paste lagged x
             y = ifelse(Operator, paste(lagged.y, y),y), # if there is * paste lagged y
             lead.Operator = lead(Operator, n = 1)       # Check if next row has a *
)

# keep only rows that had no * in following row and that had no following row (last row)
df <- filter(df, !lead.Operator | is.na(lead.Operator))

# Select just the x and y columns
df <- select(df, x, y)

Here are 3 alternatives (for the base R one, I assumed x and y are characters rather factor. I also made your data more complicated in order to cover different scenarios) 这里有3个替代方案(对于基础R一个,我假设xy是字符而不是因素。我还使你的数据更复杂以涵盖不同的场景)

(A bit more complicated data set) (更复杂的数据集)

df <- data.frame(x = c('p','a', 'b*', 'c*', 'd', 'h*', 'j*', 'l*', 'n'), 
                 y = c('r','d', 'e', 'f', 'g', 'i', 'k', 'm', 'o'), 
                 stringsAsFactors = FALSE)

Base R 基地R.

aggregate(. ~ ID, 
          transform(df, ID = cumsum(!grepl("*", x, fixed = TRUE))),
          paste, collapse = " ")
#   ID          x       y
# 1  1          p       r
# 2  2    a b* c*   d e f
# 3  3 d h* j* l* g i k m
# 4  4          n       o

data.table data.table

library(data.table)
setDT(df)[, lapply(.SD, paste, collapse = " "), 
            by = .(ID = cumsum(!grepl("*", df[["x"]], fixed = TRUE)))]
#    ID          x       y
# 1:  1          p       r
# 2:  2    a b* c*   d e f
# 3:  3 d h* j* l* g i k m
# 4:  4          n       o

dplyr dplyr

library(dplyr)
df %>%
  group_by(ID = cumsum(!grepl("*", x, fixed = TRUE))) %>%
  summarise_all(funs(paste(., collapse = " ")))

# # A tibble: 4 x 3
#      ID          x       y
#   <int>      <chr>   <chr>
# 1     1          p       r
# 2     2    a b* c*   d e f
# 3     3 d h* j* l* g i k m
# 4     4          n       o

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM