[英]Extract sentences from texts in data frame
I have a data frame with a column "text" and in each row of my data frame "text" contains several sentences (maybe only two, maybe 100 or more).我有一个带有“文本”列的数据框,并且在我的数据框“文本”的每一行中都包含几个句子(可能只有两个,可能有 100 个或更多)。 Now I would like to analyze the text in every row of my data frame for specific keywords.
现在我想分析我的数据框每一行中的特定关键字的文本。 If a keyword is found in the text of this row I would like to extract the sentences, which contain keywords, to a separate column, fe
如果在该行的文本中找到关键字,我想将包含关键字的句子提取到单独的列中,fe
needles = c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c("This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."),
findings = c("This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA)
)
So column "text" contains the sentences I want to check for keywords, "findings" is the result I would like to have in the end.所以“文本”列包含我要检查关键字的句子,“结果”是我最后想要的结果。
Can anyone help me how to apply the solution for all rows of the data frame?谁能帮助我如何将解决方案应用于数据框的所有行? Thank you!
谢谢!
What about something like this:像这样的东西怎么样:
find_sentence <- function(text, word){
require(stringr)
x <- c(str_split(text, "\\..", simplify=TRUE))
inds <- which(str_detect(x, word))
if(length(inds) > 0){
list(x[inds])
}else{
list(NA)
}
}
mydata %>%
rowwise %>%
mutate(res = find_sentence(text, "the")) %>%
unnest(res)
# # A tibble: 4 × 3
# text findings res
# <chr> <chr> <chr>
# 1 This is the first sentence. It is the beginning of this project This is the first sentence. This is the fi…
# 2 This is the first sentence. It is the beginning of this project This is the first sentence. It is the begi…
# 3 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence. I hope this project will wo… Then I will an…
# 4 And this is the last sentence. Finally my work ends. I am really happy about that. I am really happy about tha… And this is th…
This returns a new variable called res
that has a different row for each occurrence of the keyword in a sentence.这将返回一个名为
res
的新变量,该变量在句子中每次出现关键字时都有不同的行。 So, if two sentences contained the word (as in the first sentence in text
), the text
and findings
columns will be replicated for each of the relevant sentences in res
.因此,如果两个句子包含该词(如
text
中的第一句),则将为res
中的每个相关句子复制text
和findings
列。
With Base R
,带
Base R
,
lookup <- strsplit(as.character(mydata[,1]),"\\.")
out <- lapply(lookup,function(x) {
logic <- grepl(paste0(needles,collapse="|"),x)
paste0(x[logic],collapse=".")
})
data.frame(findings = do.call(rbind,out) )
gives,给,
# findings
#1 This is the first sentence
#2 I hope this project will work fine. Then I will analyze the third sentence
#3 I am really happy about that
#4
This uses grep
and a strsplit
to get the matches.这使用
grep
和strsplit
来获取匹配项。
mydata$findings <- sapply( strsplit( t(mydata), "\\. " ), function(x)
x[unlist( lapply( needles, function(y) grep(y, x) ) )] )
text
1 This is the first sentence. It is the beginning of this project
2 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.
3 And this is the last sentence. Finally my work ends. I am really happy about that.
4 These sentences do not contain any relevant information. There is no keyword. And it is not relevant.
findings
1 This is the first sentence
2 I hope this project will work fine, Then I will analyze the third sentence.
3 I am really happy about that.
4
We can work with a nested list by splitting each row in text
column and looking for the needles inside each resulting sentence of each row.我们可以通过拆分
text
列中的每一行并在每行的每个结果句子中查找针头来处理嵌套列表。
The reduce
functions are to take levels of depth of the lists. reduce
函数将获取列表的深度级别。
code:代码:
library(tidyverse)
needles <- c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c(
"This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."
),
findings = c(
"This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA
)
)
(map(mydata$text, ~ str_split(., "\\.\\s")) %>%
map_depth(2, function(row) map(needles, ~ str_subset(row, .))) %>%
map_depth(2, ~ reduce(., c)) %>%
map(~ reduce(., c)) %>%
map_if(~ length(.) > 1, ~ reduce(., paste, sep = ". ")) %>%
reduce(c) -> findings)
#> [1] "This is the first sentence"
#> [2] "I hope this project will work fine. Then I will analyze the third sentence."
#> [3] "I am really happy about that."
Created on 2021-11-26 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2021 年 11 月 26 日创建
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.