[英]Data frame and text mining
library(stringr)
data<-data.frame(id=c(1,2,3),
text=c("This is (2020) text; mining exercise (1999)","Text analysis (1975) is; bit confusing (2012)","Hint (1998) on; this text (2007) analysis?"))
a <- b <- list()
mm <- data.frame(a=NA,b=NA)
for(i in 1:length(data$text)){
a[[i]] <- lengths(strsplit(as.character(data$text[i]),";"))
b[[i]] <- str_count(data$text[i], "\\(19[0-9]{2}\\)|\\(20[0-9]{2}\\)")
}
Output I'm getting: Output 我得到:
# mm
a b
1 NA NA
Why I'm not getting the corresponding values for each row of the data frame mm
?为什么我没有得到数据框
mm
每一行的相应值? The code has nor error.代码也没有错误。
Expected output:预期 output:
# mm
a b
1 2 2
2 2 2
3 2 2
After your loop completes, you have two lists, a
and b
with your expected output:循环完成后,您有两个列表,
a
和b
以及您预期的 output:
a
[[1]]
[1] 2
[[2]]
[1] 2
[[3]]
[1] 2
But you never assign these values to your data.frame
:但是您永远不会将这些值分配给您的
data.frame
:
mm <- data.frame(a=unlist(a),b=unlist(b))
mm
a b
1 2 2
2 2 2
3 2 2
An option with tidyverse
tidyverse
的一个选项
library(dplyr)
library(stringr)
library(purrr)
data %>%
transmute(out = str_split(text, ";")) %>%
transmute(a = lengths(out),
b = lengths(map(out, ~ str_extract(.x, "(?<=(19|20))[0-9]{2}\\b"))))
# a b
#1 2 2
#2 2 2
#3 2 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.