R 中同一列中的条件字符串连接

Question

I am new to R and have a very large irregular column in a data frame like this:我是 R 的新手，在这样的数据框中有一个非常大的不规则列：

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations 
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation

I need to concatenate this column to look like this:我需要将此列连接成如下所示：

section
BOOK I: Introduction 
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation

Basically the goal is to extract the value of the upper string based in a condition and then concatenate with the lower actualizing the value with a regex expression, but I really don't know how to do it.基本上，目标是根据条件提取上部字符串的值，然后使用正则表达式将值与下部字符串连接起来，但我真的不知道该怎么做。

Thanks in advance.提前致谢。

Answer 1

Here is one method:这是一种方法：

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

x <- dplyr::mutate(x,
  isSection = stringr::str_starts(section, "Page", negate = TRUE),
  sectionNum = cumsum(isSection)
) |> 
  dplyr::group_by(sectionNum) |> 
  dplyr::mutate(newSection = dplyr::if_else(
    condition = isSection, 
    true = section, 
    false = paste(dplyr::first(section), section, sep = " / ")
  )) |>
  ungroup()

x
#> # A tibble: 9 × 4
#>   section                      isSection sectionNum newSection                  
#>   <chr>                        <lgl>          <int> <chr>                       
#> 1 BOOK I: Introduction         TRUE               1 BOOK I: Introduction        
#> 2 Page one: presentation       FALSE              1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments    FALSE              1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations  TRUE               2 MAGAZINE II: Considerations 
#> 5 Page one: characters         FALSE              2 MAGAZINE II: Considerations…
#> 6 Page two: index              FALSE              2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE               3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE               4 BOOK III: General Principles
#> 9 Page one: invitation         FALSE              4 BOOK III: General Principle…

^{Created on 2022-03-25 by the reprex package (v2.0.1)}^{由reprex package (v2.0.1) 创建于 2022-03-25}

Here, we first determine if the section is a section title or a page title and save that as TRUE or FALSE .在这里，我们首先确定该section是部分标题还是页面标题，并将其保存为TRUE或FALSE 。

Then, we label the pages belonging to a section by using cumsum() (cumulative sum).然后，我们使用cumsum() （累积和）对属于某个部分的页面进行 label。 When we add up TRUE and FALSE values, TRUE (here, sections) become 1 and increment the cumulative sum, but FALSE (here, pages) become 0 and don't increment the cumulative sum, so all of the pages within a specific section receive the same value.当我们将TRUE和FALSE值相加时， TRUE （此处为部分）变为1并增加累积总和，但FALSE （此处为页面）变为0并且不增加累积总和，因此特定部分内的所有页面获得相同的价值。

Lastly, we make a new section variable, this time using group_by() and if_else() to conditionally set the value.最后，我们创建一个新的节变量，这次使用group_by()和if_else()来有条件地设置值。 If isSection is TRUE , we just keep the existing value of section (the section title).如果isSection为TRUE ，我们只保留部分的现有值（ section标题）。 If isSection is FALSE , we concatenate the first value of section from the group with the existing value of section , separated by " / " .如果isSection为FALSE ，我们将组中section的第一个值与section的现有值连接起来，以" / "分隔。

Answer 2

using data.table:使用 data.table：

library(data.table)

setDT(x)[grepl("^Page.",section)==F, header:=section] %>% 
  .[,header:=zoo::na.locf(header)] %>% 
  .[section!=header,header:=paste0(header, " / ",section)] %>% 
  .[,.(section = header)] %>% 
  .[]

1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

Answer 3

A rolling join could achieve this.滚动连接可以实现这一点。 In data.table:在 data.table：


library( data.table )

# add a row column for joining by reference
x[ , row := .I ]

# pick out just the title rows. It looks like these start with either "BOOK" or "MAGAZINE"
books_magazines <- x[ grepl("^BOOK|^MAGAZINE", section),
                      .(row, book_magazine = section) ]

# join the 2 tables, using a rolling join to add the title row to subsequent rows
both_cols <- books_magazines[ x, on = .(row), roll = TRUE ]

# concatenate the 2 columns together where necessary, leave it alone if it's the title row
result <- both_cols[ , .(
    section_string = fifelse( book_magazine == section,
                              book_magazine,
                              sprintf("%s / %s", book_magazine, section) )
) ]

This gives:这给出：

> result$section_string

[1] "BOOK I: Introduction"                               
[2] "BOOK I: Introduction / Page one: presentation"      
[3] "BOOK I: Introduction / Page two: acknowledgments"   
[4] "MAGAZINE II: Considerations"                        
[5] "MAGAZINE II: Considerations / Page one: characters" 
[6] "MAGAZINE II: Considerations / Page two: index"      
[7] "BOOK III: General Principles"                       
[8] "BOOK III: General Principles"                       
[9] "BOOK III: General Principles / Page one: invitation"

Answer 4

You can do:你可以做：

unlist(lapply(split(x$section, cumsum(grepl('^[A-Z]{3}', x$section))), 
              function(y) {
                  if(length(y) == 1) return(y)
                  else c(y[1], paste(y[1], y[-1], sep = " / "))
                }), use.names = FALSE)
#> [1] "BOOK I: Introduction"                               
#> [2] "BOOK I: Introduction / Page one: presentation"      
#> [3] "BOOK I: Introduction / Page two: acknowledgments"   
#> [4] "MAGAZINE II: Considerations"                        
#> [5] "MAGAZINE II: Considerations / Page one: characters" 
#> [6] "MAGAZINE II: Considerations / Page two: index"      
#> [7] "BOOK III: General Principles"                       
#> [8] "BOOK III: General Principles"                       
#> [9] "BOOK III: General Principles / Page one: invitation"

Answer 5

An slightly simpler data.table approach:稍微简单一点的data.table方法：

library(data.table)
setDT(x)

x[, g := cumsum(grepl('(BOOK|MAGAZINE)', section))]
x[, section := ifelse(seq_along(section) == 1,
    section, paste(section[1], section, sep = ' / ')), by = .(g)]
x[, g := NULL]

The output is: output 是：

> x
                                               section
1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

R 中同一列中的条件字符串连接

问题描述

5 个解决方案

解决方案1
8 2022-03-25 12:49:28

解决方案2
3 2022-03-25 12:52:00

解决方案3
3 2022-03-25 12:52:38

解决方案4
2 已采纳 2022-03-25 12:47:09

解决方案5
1 2022-03-25 13:12:12

R 中同一列中的条件字符串连接

问题描述

5 个解决方案

解决方案1 8 2022-03-25 12:49:28

解决方案2 3 2022-03-25 12:52:00

解决方案3 3 2022-03-25 12:52:38

解决方案4 2 已采纳 2022-03-25 12:47:09

解决方案5 1 2022-03-25 13:12:12

解决方案1
8 2022-03-25 12:49:28

解决方案2
3 2022-03-25 12:52:00

解决方案3
3 2022-03-25 12:52:38

解决方案4
2 已采纳 2022-03-25 12:47:09

解决方案5
1 2022-03-25 13:12:12