简体   繁体   English

从数据框中的文本中提取句子

[英]Extract sentences from texts in data frame

I have a data frame with a column "text" and in each row of my data frame "text" contains several sentences (maybe only two, maybe 100 or more).我有一个带有“文本”列的数据框,并且在我的数据框“文本”的每一行中都包含几个句子(可能只有两个,可能有 100 个或更多)。 Now I would like to analyze the text in every row of my data frame for specific keywords.现在我想分析我的数据框每一行中的特定关键字的文本。 If a keyword is found in the text of this row I would like to extract the sentences, which contain keywords, to a separate column, fe如果在该行的文本中找到关键字,我想将包含关键字的句子提取到单独的列中,fe

needles = c("first", "hope", "analyze", "happy")

mydata <- data.frame(
  text = c("This is the first sentence. It is the beginning of this project",
           "My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
           "And this is the last sentence. Finally my work ends. I am really happy about that.",
           "These sentences do not contain any relevant information. There is no keyword. And it is not relevant."),
  findings = c("This is the first sentence.",
               "I hope this project will work fine. Then I will analyze the third sentence.",
               "I am really happy about that.",
               NA)
)

So column "text" contains the sentences I want to check for keywords, "findings" is the result I would like to have in the end.所以“文本”列包含我要检查关键字的句子,“结果”是我最后想要的结果。

Can anyone help me how to apply the solution for all rows of the data frame?谁能帮助我如何将解决方案应用于数据框的所有行? Thank you!谢谢!

What about something like this:像这样的东西怎么样:

find_sentence <- function(text, word){
  require(stringr)
  x <- c(str_split(text, "\\..", simplify=TRUE))
  inds <- which(str_detect(x, word))
  if(length(inds) > 0){
    list(x[inds])
  }else{
    list(NA)
  }
  
}

mydata %>% 
  rowwise %>% 
  mutate(res = find_sentence(text, "the")) %>% 
  unnest(res)

# # A tibble: 4 × 3
#   text                                                                                                    findings                     res            
#   <chr>                                                                                                   <chr>                        <chr>          
# 1 This is the first sentence. It is the beginning of this project                                         This is the first sentence.  This is the fi…
# 2 This is the first sentence. It is the beginning of this project                                         This is the first sentence.  It is the begi…
# 3 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence. I hope this project will wo… Then I will an…
# 4 And this is the last sentence. Finally my work ends. I am really happy about that.                      I am really happy about tha… And this is th…

This returns a new variable called res that has a different row for each occurrence of the keyword in a sentence.这将返回一个名为res的新变量,该变量在句子中每次出现关键字时都有不同的行。 So, if two sentences contained the word (as in the first sentence in text ), the text and findings columns will be replicated for each of the relevant sentences in res .因此,如果两个句子包含该词(如text中的第一句),则将为res中的每个相关句子复制textfindings列。

With Base R ,Base R ,

lookup <- strsplit(as.character(mydata[,1]),"\\.")

out <- lapply(lookup,function(x) { 
                logic <- grepl(paste0(needles,collapse="|"),x)
                paste0(x[logic],collapse=".")


            })


data.frame(findings = do.call(rbind,out) )

gives,给,

#                                                                     findings
#1                                                  This is the first sentence
#2  I hope this project will work fine. Then I will analyze the third sentence
#3                                                I am really happy about that
#4                                                                            

This uses grep and a strsplit to get the matches.这使用grepstrsplit来获取匹配项。

mydata$findings <- sapply( strsplit( t(mydata), "\\. " ), function(x)
                     x[unlist( lapply( needles, function(y) grep(y, x) ) )] )

                                                                                                     text
1                                         This is the first sentence. It is the beginning of this project
2 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.
3                      And this is the last sentence. Finally my work ends. I am really happy about that.
4   These sentences do not contain any relevant information. There is no keyword. And it is not relevant.
                                                                     findings
1                                                  This is the first sentence
2 I hope this project will work fine, Then I will analyze the third sentence.
3                                               I am really happy about that.
4                                                                            

We can work with a nested list by splitting each row in text column and looking for the needles inside each resulting sentence of each row.我们可以通过拆分text列中的每一行并在每行的每个结果句子中查找针头来处理嵌套列表。

The reduce functions are to take levels of depth of the lists. reduce函数将获取列表的深度级别。

code:代码:

library(tidyverse)


needles <- c("first", "hope", "analyze", "happy")

mydata <- data.frame(
  text = c(
    "This is the first sentence. It is the beginning of this project",
    "My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
    "And this is the last sentence. Finally my work ends. I am really happy about that.",
    "These sentences do not contain any relevant information. There is no keyword. And it is not relevant."
  ),
  findings = c(
    "This is the first sentence.",
    "I hope this project will work fine. Then I will analyze the third sentence.",
    "I am really happy about that.",
    NA
  )
)


(map(mydata$text, ~ str_split(., "\\.\\s")) %>%
  map_depth(2, function(row) map(needles, ~ str_subset(row, .))) %>%
  map_depth(2, ~ reduce(., c)) %>%
  map(~ reduce(., c)) %>%
  map_if(~ length(.) > 1, ~ reduce(., paste, sep = ". ")) %>%
  reduce(c) -> findings)
#> [1] "This is the first sentence"                                                 
#> [2] "I hope this project will work fine. Then I will analyze the third sentence."
#> [3] "I am really happy about that."

Created on 2021-11-26 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2021 年 11 月 26 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM