数据框和文本挖掘

Question

library(stringr)
data<-data.frame(id=c(1,2,3), 
          text=c("This is (2020) text; mining exercise (1999)","Text analysis (1975) is; bit confusing (2012)","Hint (1998) on; this text (2007) analysis?"))

a <- b <- list()
mm <- data.frame(a=NA,b=NA)
for(i in 1:length(data$text)){
   a[[i]] <- lengths(strsplit(as.character(data$text[i]),";"))
   b[[i]] <- str_count(data$text[i], "\\(19[0-9]{2}\\)|\\(20[0-9]{2}\\)")
}

Output I'm getting: Output 我得到：

# mm
    a     b
1  NA     NA

Why I'm not getting the corresponding values for each row of the data frame mm ?为什么我没有得到数据框mm每一行的相应值？ The code has nor error.代码也没有错误。

Expected output:预期 output：

Answer 1

After your loop completes, you have two lists, a and b with your expected output:循环完成后，您有两个列表， a和b以及您预期的 output：

a
[[1]]
[1] 2

[[2]]
[1] 2

[[3]]
[1] 2

But you never assign these values to your data.frame :但是您永远不会将这些值分配给您的data.frame ：

mm <- data.frame(a=unlist(a),b=unlist(b))
mm
  a b
1 2 2
2 2 2
3 2 2

Answer 2

An option with tidyverse tidyverse的一个选项

library(dplyr)
library(stringr)
library(purrr)
data %>% 
   transmute(out = str_split(text, ";")) %>% 
   transmute(a = lengths(out),
       b = lengths(map(out, ~ str_extract(.x, "(?<=(19|20))[0-9]{2}\\b"))))
#  a b
#1 2 2
#2 2 2
#3 2 2

数据框和文本挖掘

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-05-29 16:01:58

解决方案2
1 2020-05-29 19:23:06

数据框和文本挖掘

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-05-29 16:01:58

解决方案2 1 2020-05-29 19:23:06

解决方案1
2 已采纳 2020-05-29 16:01:58

解决方案2
1 2020-05-29 19:23:06