簡體   English   中英

將字符串拆分為較小的字符串以在數據框中創建新行(在 R 中)

[英]Split strings into smaller ones to create new rows in a data frame (in R)

我是一個新的 R 用戶,我目前正在努力解決如何在數據框的每一行中拆分字符串,然后使用修改后的字符串創建一個新行(以及修改原始字符串)。 這是下面的示例,但實際數據集要大得多。

library(dplyr)
library(stringr)
library(tidyverse)
library(utils)

posts_sentences <- data.frame("element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3), 
                "sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"), 
                "sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors=FALSE)

我想分解超過某個字數(此數據集為 15 個)的句子,使用正則表達式從較長的句子中創建新句子,以便首先嘗試按句點(或其他符號)分解它,然后如果字數仍然太長,我嘗試用逗號后跟一個 I(或大寫字母),然后嘗試用 'and' 后跟一個大寫字母,等等。每次我創建一個新句子時,都需要更改句子從舊行到句子的第一部分,同時更改字數(我有一個函數),同時創建一個具有相同元素 id 的新行,一個位於序列旁邊的句子 id(如果 sentence_id 是1,現在新句是2),新句字數,然后把下面所有的句子都改成下一個句子的id號。

我已經為此工作了幾天,但不知道該怎么做。 我嘗試過使用 unnest 標記、str_split/extract 和過濾器、變異等的各種 dplyr 組合以及 google/SO 搜索。 有誰知道實現這一目標的最佳方法? Dplyr 是首選,但我對任何可行的方法持開放態度。 如果您需要任何說明,請隨時提出問題!

編輯以添加預期的輸出數據框:

expected_output <- data.frame("element_id" = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), "sentence_id" = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6), 
                                   "sentence" = c("You know, when I grew up", "I grew up in a very religious family", "I had the same sought of troubles people have", "I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.", "I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.", "I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and", "I don't know who to tell and", "I was going to tell my friend about it but I'm not sure.", "I keep saying omg!", "it's too much"), 
                                   "sentence_wc" = c(6, 8, 8, 21, 4, 27, 6, 7, 9, 7, 13, 4, 3), stringsAsFactors=FALSE)

這是一種tidyverse方法,可讓您指定自己的啟發式方法,我認為這應該最適合您的情況。 關鍵是使用pmap創建每一行的列表,如果需要,您可以使用map_if拆分這些map_if 在我看來,這是一種很難單獨使用dplyr的情況,因為我們在我們的操作中添加了行,因此rowwise很難使用。

split_too_long()的結構基本上是:

  1. 使用dplyr::mutatetokenizers::count_words獲取每個句子的字數
  2. 使用purrr::pmap使每一行成為列表的元素,它接受數據purrr::pmap作為列列表作為輸入
  3. 使用purrr::map_if檢查字數是否大於我們想要的限制
  4. 如果滿足上述條件,則使用tidyr::separate_rows將句子拆分為多行,
  5. 然后用新的字數替換字數,並使用filter (由雙倍分隔符創建)刪除任何空行。

然后我們可以將其應用於不同的分隔符,因為我們意識到需要進一步拆分元素。 在這里,我使用與您提到的啟發式相對應的這些模式:

  • "[\\\\.\\\\?\\\\!] ?" 哪個匹配.!? 和一個可選的空間
  • ", ?(?=[:upper:])"匹配, , 可選空格, 前一個大寫字母
  • "and ?(?=[:upper:])"匹配and可選空格,在大寫字母之前。

它正確返回與預期輸出相同的拆分句子。 使用row_number可以很容易地在最后添加sentence_id ,並且可以使用stringr::str_trim刪除錯誤的前導/尾隨空格。

注意事項:

  • 我寫這個是為了探索性分析的可讀性,因此每次都拆分成列表並重新綁定在一起。 如果您事先決定了您想要的分隔符,您可以將其放入一個map步驟中,這可能會使其更快,盡管我還沒有在大型數據集上對此進行分析。
  • 根據評論,這些拆分后仍然有超過 15 個單詞的句子。 您必須決定要拆分哪些附加符號/正則表達式以進一步縮短長度。
  • 目前列名被硬編碼到split_too_long中。 如果能夠在函數調用中指定列名對您很重要,我建議您programming with dplyr vignette 進行programming with dplyr (它應該只需要進行一些調整即可實現)
posts_sentences <- data.frame(
  "element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
  "sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
  "sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors = FALSE
)

library(tidyverse)
library(tokenizers)
split_too_long <- function(df, regexp, max_length) {
  df %>%
    mutate(wc = count_words(sentence)) %>%
    pmap(function(...) tibble(...)) %>%
    map_if(
      .p = ~ .$wc > max_length,
      .f = ~ separate_rows(., sentence, sep = regexp)
      ) %>%
    bind_rows() %>%
    mutate(wc = count_words(sentence)) %>%
    filter(wc != 0)
}

posts_sentences %>%
  group_by(element_id) %>%
  summarise(sentence = str_c(sentence, collapse = ".")) %>%
  ungroup() %>%
  split_too_long("[\\.\\?\\!] ?", 15) %>%
  split_too_long(", ?(?=[:upper:])", 15) %>%
  split_too_long("and ?(?=[:upper:])", 15) %>%
  group_by(element_id) %>%
  mutate(
    sentence = str_trim(sentence),
    sentence_id = row_number()
  ) %>%
  select(element_id, sentence_id, sentence, wc)
#> # A tibble: 13 x 4
#> # Groups:   element_id [2]
#>    element_id sentence_id sentence                                      wc
#>         <dbl>       <int> <chr>                                      <int>
#>  1          1           1 You know, when I grew up                       6
#>  2          1           2 I grew up in a very religious family           8
#>  3          1           3 I had the same sought of troubles people ~     9
#>  4          1           4 I was excelling in alot of ways, but beca~    21
#>  5          1           5 Im at breaking point                           4
#>  6          1           6 I have no one to talk to about this and i~    29
#>  7          1           7 I dont know what to do                         6
#>  8          2           1 I feel like I’m going to explode               7
#>  9          2           2 I have so many thoughts and feelings insi~     8
#> 10          2           3 I don't know who to tell                       6
#> 11          2           4 I was going to tell my friend about it bu~    13
#> 12          2           5 I keep saying omg                              4
#> 13          2           6 it's too much                                  3

reprex 包(v0.2.0) 於 2018 年 5 月 21 日創建。

編輯:我已經編輯了整個答案以更詳細地解決特定問題。

這並不完全是通用的,因為它假設這些組完全基於element_id

split_too_long <- function(str, max.words=15L, ...) {
  cuts <- stringi::stri_locate_all_words(str)[[1L]]

  # return one of these
  if (nrow(cuts) <= max.words) {
    c(str, NA_character_)
  }
  else {
    left <- substr(str, 1L, cuts[max.words, 2L])
    right <- substr(str, cuts[max.words + 1L, 1L], nchar(str))
    c(left, right)
  }
}

recursive_split <- function(not_done, done=NULL, ...) {
  left_right <- split_too_long(not_done, ...)

  # return one of these
  if (is.na(left_right[2L]))
    c(done, left_right[1L])
  else
    recursive_split(left_right[2L], done=c(done, left_right[1L]), ...)
}

collapse_split <- function(sentences, regex="[.;:] ?", ...) {
  sentences <- paste(sentences, collapse=". ")
  sentences <- unlist(strsplit(sentences, split=regex))
  # return
  unlist(lapply(sentences, recursive_split, done=NULL, ...))
}

group_fun <- function(grouped_df, ...) {
  # initialize new data frame with new number of rows
  new_df <- data.frame(sentence=collapse_split(grouped_df$sentence, ...),
                       stringsAsFactors=FALSE)
  # count words
  new_df$sentence_wc <- stringi::stri_count_words(new_df$sentence)
  # add sentence_id
  new_df$sentence_id <- 1L:nrow(new_df)
  # element_id must be equal because it is a grouping variable,
  # so take 1 to repeat it in output
  new_df$element_id <- grouped_df$element_id[1L]
  # return
  dplyr::filter(new_df, sentence_wc > 0L)
}

out <- posts_sentences %>%
  group_by(element_id) %>%
  do(group_fun(., max.words=5L, regex="[.;:!] ?"))

此解決方案首先在大寫字母前用逗號或句點分割句子。 然后只用逗號和句點分割句子。 最后,如果一個句子仍然高於限制詞。 句子由每個大寫字母分割。

posts_sentences <- data.frame("element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3), 
                              "sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"), 
                              "sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors=FALSE)

# To create an empty data frame to save the new elements

new_posts_sentences <- data.frame(element_id = as.numeric(),
                 sentence_id =as.numeric(), 
                 sentence = character(), 
                 sentence_wc = as.numeric(),  stringsAsFactors=FALSE) 

limit_words <- 15 # 15 for this data set

countSentences <- 0

for (sentence in posts_sentences[,3]) {

        vector <- character()

        Velement_id <- posts_sentences$element_id[countSentences + 1]

        vector <- c(vector, sentence) #To create a vector with the sentences
        vector <- vector[!vector %in% ''] #remove empty elements from vector

        ## First we will separate the sentences that start with a uppercase after of a capital letter
        if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words ){

                vector <- vector[!vector %in% sentence]

                split_points <- unlist(gregexpr("[:,:]\\s[A-Z]", sentence)) # To get the character position

                ## If a sentences is still over the limit words value. Let's split it for each comma or period
                sentences_1 <- substring(sentence, c(1, split_points + 2), c(split_points -1, nchar(sentence)))

                for(sentence in sentences_1){

                        vector <- c(vector, sentence)
                        vector <- vector[!vector %in% '']

                        if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words){

                                vector <- vector[!vector %in% sentence]

                                split_points <- unlist(gregexpr("[:,:]|[:.:]", sentence))

                                sentences_2 <- substring(sentence, c(1, split_points + 1), c(split_points -1, nchar(sentence)))

                                ## If a sentence is still s still over the limit words value. Let's split it for each capital letter

                                for(sentence in sentences_2){

                                        vector <- c(vector, sentence)
                                        vector <- vector[!vector %in% '']

                                        if(lengths(gregexpr("[A-z]\\W+", sentence)) > limit_words){

                                                vector <- vector[!vector %in% sentence]

                                                split_points <- unlist(gregexpr("[A-Z]", sentence))

                                                sentences_3 <- substring(sentence,c(1, split_points), c(split_points -1, nchar(sentence)))

                                                vector <- c(vector, sentences_3)
                                                vector <- vector[!vector %in% '']

                                        }

                                }

                        }

                }

        }

        ## To make a data frame o each original sentence
        element_id <- rep(Velement_id, length(vector))
        sentence_id <- 1:length(vector)
        sentence_wc <- character()
        for (element in vector){sentence_wc <- c(sentence_wc, (lengths(gregexpr("[A-z]\\W+", element)))) }
        sentenceDataFrame <- data.frame(element_id, sentence_id, vector, sentence_wc)       

        ## To join it with the final dataframe
        new_posts_sentences <- rbind(new_posts_sentences, sentenceDataFrame)

        countSentences <- countSentences + 1

}

你得到這個數據框

print(new_posts_sentences)

   element_id sentence_id                                           vector sentence_wc
1           1           1                         You know, when I grew up           5
2           1           2             I grew up in a very religious family           7
3           1           3    I had the same sought of troubles people have           8
4           1           4                  I was excelling in alot of ways           6
5           1           5    but because there was alot of trouble at home           8
6           1           6                     we were always moving around           4
7           1           1                             Im at breaking point           3
8           1           2      I have no one to talk to about this and if           11
9           1           3                                      I’m honest            3
10          1           4                                         I think            2
11          1           5        I’m too scared to tell anyone because if            9
12          1           6                        I do then it becomes real           5
13          1           7                           I dont know what to do           5
14          2           1                I feel like I’m going to explode.           8
15          2           1 I have so many thoughts and feelings inside and            9
16          2           2                    I don't know who to tell and            8
17          2           3      I was going to tell my friend about it but           10
18          2           4                                     I'm not sure           3
19          2           1                  I keep saying omg!it's too much           7

我希望它有幫助。

替代tidyverse解決方案:

library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(utils)

check_and_split <- function(element_id, sentence_id, sentence, sentence_wc,
                             word_count, attmpt){

  methods <- c("\\.", ",\\s?(?=[I])", "and\\s?(?=[A-Z])")
  df <- data.frame(element_id=element_id,
             sentence_id=sentence_id,
             sentence=sentence,
             sentence_wc=sentence_wc,
             word_count=word_count,
             attmpt=attmpt,
             stringsAsFactors = FALSE)

    if(word_count<=15 | attmpt>=3){
      return(df) #early return
    } else{
     df %>% 
        tidyr::separate_rows(sentence, sep=methods[attmpt+1]) %>% 
        mutate(word_count=str_count(sentence,'\\w+'),
               attmpt = attmpt+1)
    }
}

posts_sentences %>% 
  mutate(word_count=str_count(sentence,'\\w+'),
         attmpt=0) %>%
  pmap_dfr(check_and_split) %>% 
  pmap_dfr(check_and_split) %>% 
  pmap_dfr(check_and_split) 

在這里,我們創建了一個輔助函數,它接受一行(由元素分解,由purrr::pmap() ),我們將其組裝回數據幀,檢查字數是否超過 15 以及嘗試嘗試的次數前句。 然后我們使用tidyr::separate_rows()和下一次嘗試對應的分離標記,更新word_countnumber of attempts並返回數據幀。

我應用了三次相同的函數 - 這可能會被包裝成一個循環(lapply/purrr::map 將不起作用,因為我們需要更新順序更新數據幀)。

就正則表達式標記而言,首先我們使用的是 literal . ,然后我們跟蹤逗號和零個或多個空格,后跟“I”。 注意正向前瞻語法。 最后,我們正在嘗試“和”,可能是空格,前瞻后跟大寫字母。

希望這是有道理的

我認為最簡單的方法是使用 stringr 包中的 str_split() 函數(根據您的正則表達式拆分每個文本塊),然后使用 tidyr 包中的 unnest() 函數。

sentences_split = posts_sentences %>%
  mutate(text_split=str_split(sentence, pattern = "\\.")) %>%
  unnest(text_split) %>%

  #Count number of words in text_split
  mutate(wc_split = str_count(text_split, "\\w+")) %>%

  filter(wc_split!=0) %>%

  #Split again if text_split column has >15 words
  mutate(text_split_again = ifelse(wc_split>15, str_split(text_split, pattern = ",\\sI"), text_split)) %>%
  unnest(text_split_again) 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM