简体   繁体   English

在 R 中分割一个巨大字符串的有效方法

[英]Efficient way to split a huge string in R

I have a huge string (> 500MB), actually it's an entire book collection in one.我有一个巨大的字符串(> 500MB),实际上它是一整本书的集合。 I have some meta information in another dataframe, eg page numbers, (different) authors and titles.我在另一个 dataframe 中有一些元信息,例如页码、(不同的)作者和标题。 I try to detect the title strings in my huge string and split it by title.我尝试检测我的巨大字符串中的标题字符串并按标题拆分它。 I assume titles are unique.我假设标题是唯一的。

The data looks like this:数据如下所示:

mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"

# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
                   title = c( "Lorem", "maecenas"))
mydf

  page   title
1    1   Lorem
2    2 vivamus

mygoal <- mydf  # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal 

  page   title                                    text
1    1   Lorem ipsum dolor sit amet, sollicitudin duis
2    2 vivamus        habitasse ultrices aenean tempus

How can I split the string such that everything between two titles is the first text, everything that comes after the second title and before the third title, becomes the second text element - in the most efficient way.我如何拆分字符串,使两个标题之间的所有内容都是第一个文本,第二个标题之后和第三个标题之前的所有内容都以最有效的方式成为第二个文本元素。

We could use strsplit我们可以使用strsplit

mygoal$text <- trimws(strsplit(mystring,
      paste(mydf$title, collapse = "|"))[[1]][-1])

-output -输出

> mygoal
  page    title                                    text
1    1    Lorem ipsum dolor sit amet, sollicitudin duis
2    2 maecenas        habitasse ultrices aenean tempus

In case you wanted to do the operation in a piped tidyverse way, you could try using stringr::str_extract with some regex:如果您想以管道 tidyverse 方式进行操作,您可以尝试将stringr::str_extract与一些正则表达式一起使用:

library(dplyr)
library(stringr)
library(glue)

mydf |>  
  mutate(next_title = lead(title, default = "$")) |> 
  mutate(text = str_extract(mystring, glue::glue("(?<={title}\\s?)(.*)(?:{next_title})"))) |> 
  select(-next_title)

Yielding:产量:

page    title                                      text
1    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
2    2 maecenas          habitasse ultrices aenean tempus

If performance is a concern, a similar approach with data.table would be:如果性能是一个问题,与data.table类似的方法是:

library(data.table)
library(stringr)
library(glue)

mydt <- setDT(mydf)

mydt[, next_title :=shift(title, fill = "$", type = "lead")][
  ,text := str_extract(..mystring, glue_data(.SD,"(?<={title}\\s?)(.*)(?={next_title})"))][,
    !("next_title")]

Resulting in:导致:

   page    title                                      text
1:    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
2:    2 maecenas          habitasse ultrices aenean tempus

EDIT编辑

Added for better performance options:添加了更好的性能选项:

Generally, str_split or str_split_fixed will be a faster way to go than str_extract .通常, str_splitstr_split_fixed比 str_extract 更快地到达str_extract

The problem for str_split is that a regex with many alternate pipes will also slow down the process, so another solution would be to replace all the titles in the string first with some fixed character string, and then split on those. str_split的问题在于,具有许多备用管道的正则表达式也会减慢该过程,因此另一种解决方案是首先用一些固定字符串替换字符串中的所有标题,然后在这些字符串上进行拆分。 Another thing you can do to speed up the splitting is use str_split_fixed and pre-assign how many splits to process.您可以做的另一件事来加速拆分是使用str_split_fixed并预先分配要处理的拆分数。

    # create named character vector for str_replace_all function
split_at <- rep("@@",nrow(mydf))
names(split_at) <- mydf$title
mystring <- str_replace_all(mystring, split_at)

# used fixed in str_split
mydf$text <- str_split(mystring,fixed("@@ "))[[1]][-1]

# Alternative (maybe faster) define number of splits by nrow
mydf$text <- str_split_fixed(mystring,fixed("@@ "), n = nrow(mydf)+1)[,-1]


## using str_split_fixed in data.table
mydt <- setDT(mydf)
mydt[, text := 
       str_split_fixed(mystring,fixed("@@ "), nrow(mydt)+1)[,-1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM