提取数字和以下文本并在 R 中创建多个新列

Question

I have free-text data with numerous references to specific questions, and I'd like to organize it as shown below.我有大量引用特定问题的自由文本数据，我想按如下所示组织它。

I'm able to create columns that note mentions of a certain topic (if the respondent references it by number), but I'd like a way to extract all of the text following the number, until another number is encountered.我可以创建记录提到某个主题的列（如果受访者按数字引用它），但我想要一种方法来提取数字后面的所有文本，直到遇到另一个数字。

Thanks in advance for any help!提前感谢您的帮助！

library(tidyverse, warn.conflicts = F)

# Data
df <- data.frame(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good", "1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"))

# I can identify the mentions if a respondent specifies the number they are responding to
df <- df %>% 
  mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric)) %>% 
  unnest_wider(col = mention, names_sep = "_")

# Ideal output
df_ideal <- structure(list(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good", 
"1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"
), mention_1 = c(1, 1), mention_2 = c(4, 17), mention_3 = c(9, 
25), mention_4 = c(NA, 43), comment_1 = c("1: this is fine.", 
"1 ok this is fine."), comment_2 = c("4 this is fine too.", "17 i do not like this idea."
), comment_3 = c("9 not so good", "25 great idea"), comment_4 = c(NA, 
"42 nice idea")), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))

^{Created on 2021-04-18 by the reprex package (v2.0.0)}^{由reprex package (v2.0.0) 于 2021 年 4 月 18 日创建}

Answer 1

An option with strsplit to split at one or more space ( \\s+ ) that follows a dot ( \\. - . - metacharacters are escaped), and precedes a digit or # (regex lookaround), then we loop over the output list with lapply , remove any characters that are not digits ( \\D+ ) from the start ( ^ ) of each of the string with sub , rbind the list elements and assign the 'comment_' columns into the original dataset 'df'带有strsplit的选项在一个或多个空格 ( \\s+ ) 后跟一个点 ( \\. - . - 元字符被转义)，并在一个数字或# (正则表达式环视) 之前，然后我们循环遍历 output list使用lapply ，从带有sub的每个字符串的开头 ( ^ ) 中删除任何不是数字 ( \\D+ ) 的字符， rbind list元素并将 'comment_' 列分配到原始数据集 'df'

df[paste0('comment_', 1:3)] <- do.call(rbind, lapply(strsplit(df$comment, 
      "(?<=\\.)\\s+(?=[0-9#])", perl = TRUE), function(x) sub("^\\D+", "", x)))

-output -输出

df
# A tibble: 2 x 7
  comment                                                      mention_1 mention_2 mention_3 comment_1          comment_2                   comment_3    
  <chr>                                                            <dbl>     <dbl>     <dbl> <chr>              <chr>                       <chr>        
1 topic 1: this is fine. 4 this is fine too. #9 not so good            1         4         9 1: this is fine.   4 this is fine too.         9 not so good
2 1 ok this is fine. 17 i do not like this idea. 25 great idea         1        17        25 1 ok this is fine. 17 i do not like this idea. 25 great idea

Update更新

If the length differ (as in the updated example), we can pad NA at the end based on the max lengths from the list to make the list elements equal in length before doing the rbind如果length不同（如在更新的示例中），我们可以根据list中的max lengths在末尾填充NA以使list元素的length相等，然后再执行rbind

lst1 <- lapply(strsplit(df$comment, 
   "(topic \\d+)(*SKIP)(*F)|\\s+(?=[0-9#])", perl = TRUE), 
         function(x) sub("^\\D+", "", x))
mx <- max(lengths(lst1))
df[paste0('comment_', seq_len(mx))] <- do.call(rbind,
           lapply(lst1, `length<-`, mx))

-output -输出

df
# A tibble: 2 x 9
  comment                                                             mention_1 mention_2 mention_3 mention_4 comment_1          comment_2                 comment_3    comment_4  
  <chr>                                                                   <dbl>     <dbl>     <dbl>     <dbl> <chr>              <chr>                     <chr>        <chr>      
1 topic 1: this is fine. 4 this is fine too. #9 not so good                   1         4         9        NA 1: this is fine.   4 this is fine too.       9 not so go… <NA>       
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 co…         1        17        25        43 1 ok this is fine. 17 i do not like this id… 25 great id… 43 cool id…
>

Answer 2

A base R approach is just to read the data in again:基本的 R 方法只是再次读取数据：

read.table(text = gsub("(\\d+)","&\\1",df$comment), sep = "&", fill = TRUE,
           comment.char = "", header = FALSE, strip.white = TRUE, na.strings = "")[,-1]
                  V2                          V3            V4           V5
1   1: this is fine.       4 this is fine too. # 9 not so good         <NA>
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea

Answer 3

You can continue without your extract and unnest_wider approach.您可以继续使用您的 extract 和unnest_wider方法。

library(tidyverse)


df %>%
  mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric), 
         new_comment = str_extract_all(comment, '\\d+.*?(?=\\d|$)')) %>%
  unnest_wider(col = new_comment, names_sep = "_") %>%
  unnest_wider(col = mention, names_sep = "_")

#                                                                    comment
#1                 topic 1: this is fine. 4 this is fine too. #9 not so good
#2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea

#  mention_1 mention_2 mention_3 mention_4       new_comment_1
#1         1         4         9        NA   1: this is fine. 
#2         1        17        25        43 1 ok this is fine. 

#                 new_comment_2  new_comment_3 new_comment_4
#1        4 this is fine too. #  9 not so good          <NA>
#2 17 i do not like this idea.  25 great idea   43 cool idea

提取数字和以下文本并在 R 中创建多个新列

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-04-18 23:26:27

Update更新

解决方案2
1 2021-04-19 00:28:13

解决方案3
1 2021-04-19 04:04:56

提取数字和以下文本并在 R 中创建多个新列

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-04-18 23:26:27

Update更新

解决方案2 1 2021-04-19 00:28:13

解决方案3 1 2021-04-19 04:04:56

解决方案1
1 已采纳 2021-04-18 23:26:27

解决方案2
1 2021-04-19 00:28:13

解决方案3
1 2021-04-19 04:04:56