简体   繁体   English

二人如何在句号分隔的句子中以任意顺序找到两个单词

[英]How two find two words in any order within a period delimited sentence

I'm trying to extract any sentence (defined as being between two periods) which have the two words column and Barr in any order in them.我正在尝试提取任何句子(定义为两个句点之间),其中包含以任何顺序包含两个单词columnBarr句子。 This is tricky as at the moment I have created a regex that only finds the two words in any order before a period but if the two words are present in two sentences then all the text between the two sentences are selected.这很棘手,因为目前我已经创建了一个正则表达式,它只能在句点之前以任何顺序查找两个单词,但是如果两个单词出现在两个句子中,则选择两个句子之间的所有文本。 How can I make the regex sentence specific?我怎样才能使正则表达式句子具体?

Input输入

try<-c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.","Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.")

Desired output期望输出

[1] NA
[2] "I am a sentence and I contain column but also Barr.

Attempt试图

str_extract_all(try,"\\..*column. Barr. ?\\.|.*Barr. column. ?\\.") str_extract_all(try,"\\..*column.barr . ?\\.|.*barr.column . ?\\.")

Current output电流输出

[[1]]
[1] "I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well."

[[2]]
[1] ". I am a sentence and I contain column but also Barr. I only contain Barr."

In order to find two words present in any order, you can use two positive lookaheads: For example grepl((?=.*Barr)(?=.*column),x,perl=T) will return TRUE every time both words are present, regardless of their order and FALSE otherwise, but this doesn't take into account the sentence structure.为了找到以任何顺序出现的两个单词,您可以使用两个正向前瞻:例如grepl((?=.*Barr)(?=.*column),x,perl=T)每次两个单词都将返回TRUE存在,无论它们的顺序如何,否则为FALSE ,但这不考虑句子结构。 As you want to extract text, and you want to find the two words in between dots, we can change it to:如果你想提取文本,并且你想找到点之间的两个单词,我们可以将其更改为:

library(stringr)
## Example data
x <- c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.","Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.","Barr and column and also column. But just Barr. And just column. Now again column and Barr")
> x
[1] "I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well."
[2] "Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too."       
[3] "Barr and column and also column. But just Barr. And just column. Now again column and Barr"           

str_extract_all(x,"(\\.|^)(?=[^\\.]*Barr)(?=[^\\.]*column)[^\\.]*(\\.|$)")

This looks for a start of sentence or a dot (\\\\.|^) , followed by characters that are not dots and that contain Barr and column (?=[^\\\\.]*Barr)(?=[^\\\\.]*column)[^\\\\.]* , followed by a dot or an end of sentence (\\\\.|$) .这将查找句子的开头或点(\\\\.|^) ,后跟非点且包含 Barr 和列(?=[^\\\\.]*Barr)(?=[^\\\\.]*column)[^\\\\.]*的字符(?=[^\\\\.]*Barr)(?=[^\\\\.]*column)[^\\\\.]* ,后跟一个点或句尾(\\\\.|$) This returns a list:这将返回一个列表:

[[1]]
character(0)

[[2]]
[1] ". I am a sentence and I contain column but also Barr."

[[3]]
[1] "Barr and column and also column." ". Now again column and Barr"

This regex seems to do what you need:这个正则表达式似乎可以满足您的需求:

(\\.[^.]*column[^.]*Barr[^.]*)|(\\.[^.]*Barr[^.]*column[^.]*)

It will start with a point ( . ) and grabs anything that is not a point but also has column and Barr .它将以一个点 ( . ) 开始并抓取任何不是点但也有columnBarr Or the same with both words in a different order.或者两个词相同,但顺序不同。

Example:例子:

try = c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.",
        "Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.",
        "I am a sentence and I contain column but also Barr. I only contain Barr. I am too.",
        "I contain column and Barr. I have Barr and column. I don't.",
        "Hello. I contain Barr and column but also Barr. I only contain Barr. I am too.") 

k = sapply(try, function(x){
  str_extract(paste0(".",x), "(\\.[^.]*column[^.]*Barr[^.]*)|(\\.[^.]*Barr[^.]*column[^.]*)")
})
names(k) = NULL

Result:结果:

[1] NA                                                    
[2] ". I am a sentence and I contain column but also Barr"
[3] ".I am a sentence and I contain column but also Barr" 
[4] ".I contain column and Barr"                          
[5] ". I contain Barr and column but also Barr"

If you use str_extract_all keep in mind that it returns a list of matches.如果您使用str_extract_all请记住它返回匹配列表。

[[1]]
character(0)

[[2]]
[1] ". I am a sentence and I contain column but also Barr"

[[3]]
[1] ".I am a sentence and I contain column but also Barr"

[[4]]
[1] ".I contain column and Barr" ". I have Barr and column"  

[[5]]
[1] ". I contain Barr and column but also Barr"

I've added a paste0(".",x) in order to detect sentences that contain both words and are first (they don't start with a period).我添加了一个paste0(".",x)以检测包含两个单词和第一个单词的句子(它们不以句点开头)。

Here is a more general attempt which does not require creating every permutation of the desired words, helpful when more than two works are required.这是一个更一般的尝试,不需要创建所需单词的每一个排列,当需要两个以上的作品时很有帮助。

The strategy is find the sentences with each individual word and then find the intersection of for the results.策略是找到每个单词的句子,然后找到结果的交集。

#split the long text into individual sentences.
sentences<-strsplit(try, "\\.")

#create list of matches for each desired word
columnlist<-lapply(sentences, function(x) {grep("(column)", x)})
barrlist<-lapply(sentences, function(x) {grep("(Barr)", x)})

#find intersection between lists
intersection<-lapply(seq_along(columnlist), function(i){intersect(columnlist[[i]], barrlist[[i]])} )

#extract the sentences out
answer<-sapply(seq_along(intersection), function(i) { 
  if(length(intersection[[i]])) 
    {trimws(sentences[[i]][intersection[[i]] ])}  
  else {NA}
})

Result结果

#[[1]]
#[1] NA
#
#[[2]]
#[1] "I am a sentence and I contain column but also Barr" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM