[英]Extract number and following text and create multiple new columns in R
I have free-text data with numerous references to specific questions, and I'd like to organize it as shown below.我有大量引用特定问题的自由文本数据,我想按如下所示组织它。
I'm able to create columns that note mentions of a certain topic (if the respondent references it by number), but I'd like a way to extract all of the text following the number, until another number is encountered.我可以创建记录提到某个主题的列(如果受访者按数字引用它),但我想要一种方法来提取数字后面的所有文本,直到遇到另一个数字。
Thanks in advance for any help!提前感谢您的帮助!
library(tidyverse, warn.conflicts = F)
# Data
df <- data.frame(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good", "1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"))
# I can identify the mentions if a respondent specifies the number they are responding to
df <- df %>%
mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric)) %>%
unnest_wider(col = mention, names_sep = "_")
# Ideal output
df_ideal <- structure(list(comment = c("topic 1: this is fine. 4 this is fine too. #9 not so good",
"1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea"
), mention_1 = c(1, 1), mention_2 = c(4, 17), mention_3 = c(9,
25), mention_4 = c(NA, 43), comment_1 = c("1: this is fine.",
"1 ok this is fine."), comment_2 = c("4 this is fine too.", "17 i do not like this idea."
), comment_3 = c("9 not so good", "25 great idea"), comment_4 = c(NA,
"42 nice idea")), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Created on 2021-04-18 by the reprex package (v2.0.0)由reprex package (v2.0.0) 于 2021 年 4 月 18 日创建
An option with strsplit
to split at one or more space ( \\s+
) that follows a dot ( \\.
- .
- metacharacters are escaped), and precedes a digit or #
(regex lookaround), then we loop over the output list
with lapply
, remove any characters that are not digits ( \\D+
) from the start ( ^
) of each of the string with sub
, rbind
the list
elements and assign the 'comment_' columns into the original dataset 'df'带有
strsplit
的选项在一个或多个空格 ( \\s+
) 后跟一个点 ( \\.
- .
- 元字符被转义),并在一个数字或#
(正则表达式环视) 之前,然后我们循环遍历 output list
使用lapply
,从带有sub
的每个字符串的开头 ( ^
) 中删除任何不是数字 ( \\D+
) 的字符, rbind
list
元素并将 'comment_' 列分配到原始数据集 'df'
df[paste0('comment_', 1:3)] <- do.call(rbind, lapply(strsplit(df$comment,
"(?<=\\.)\\s+(?=[0-9#])", perl = TRUE), function(x) sub("^\\D+", "", x)))
-output -输出
df
# A tibble: 2 x 7
comment mention_1 mention_2 mention_3 comment_1 comment_2 comment_3
<chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 topic 1: this is fine. 4 this is fine too. #9 not so good 1 4 9 1: this is fine. 4 this is fine too. 9 not so good
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 1 17 25 1 ok this is fine. 17 i do not like this idea. 25 great idea
If the length
differ (as in the updated example), we can pad NA
at the end based on the max
lengths
from the list
to make the list
elements equal in length
before doing the rbind
如果
length
不同(如在更新的示例中),我们可以根据list
中的max
lengths
在末尾填充NA
以使list
元素的length
相等,然后再执行rbind
lst1 <- lapply(strsplit(df$comment,
"(topic \\d+)(*SKIP)(*F)|\\s+(?=[0-9#])", perl = TRUE),
function(x) sub("^\\D+", "", x))
mx <- max(lengths(lst1))
df[paste0('comment_', seq_len(mx))] <- do.call(rbind,
lapply(lst1, `length<-`, mx))
-output -输出
df
# A tibble: 2 x 9
comment mention_1 mention_2 mention_3 mention_4 comment_1 comment_2 comment_3 comment_4
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 topic 1: this is fine. 4 this is fine too. #9 not so good 1 4 9 NA 1: this is fine. 4 this is fine too. 9 not so go… <NA>
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 co… 1 17 25 43 1 ok this is fine. 17 i do not like this id… 25 great id… 43 cool id…
>
A base R approach is just to read the data in again:基本的 R 方法只是再次读取数据:
read.table(text = gsub("(\\d+)","&\\1",df$comment), sep = "&", fill = TRUE,
comment.char = "", header = FALSE, strip.white = TRUE, na.strings = "")[,-1]
V2 V3 V4 V5
1 1: this is fine. 4 this is fine too. # 9 not so good <NA>
2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea
You can continue without your extract and unnest_wider
approach.您可以继续使用您的 extract 和
unnest_wider
方法。
library(tidyverse)
df %>%
mutate(mention = map(str_extract_all(comment, "[0-9]+"), as.numeric),
new_comment = str_extract_all(comment, '\\d+.*?(?=\\d|$)')) %>%
unnest_wider(col = new_comment, names_sep = "_") %>%
unnest_wider(col = mention, names_sep = "_")
# comment
#1 topic 1: this is fine. 4 this is fine too. #9 not so good
#2 1 ok this is fine. 17 i do not like this idea. 25 great idea 43 cool idea
# mention_1 mention_2 mention_3 mention_4 new_comment_1
#1 1 4 9 NA 1: this is fine.
#2 1 17 25 43 1 ok this is fine.
# new_comment_2 new_comment_3 new_comment_4
#1 4 this is fine too. # 9 not so good <NA>
#2 17 i do not like this idea. 25 great idea 43 cool idea
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.