[英]How two find two words in any order within a period delimited sentence
I'm trying to extract any sentence (defined as being between two periods) which have the two words column
and Barr
in any order in them.我正在尝试提取任何句子(定义为两个句点之间),其中包含以任何顺序包含两个单词
column
和Barr
句子。 This is tricky as at the moment I have created a regex that only finds the two words in any order before a period but if the two words are present in two sentences then all the text between the two sentences are selected.这很棘手,因为目前我已经创建了一个正则表达式,它只能在句点之前以任何顺序查找两个单词,但是如果两个单词出现在两个句子中,则选择两个句子之间的所有文本。 How can I make the regex sentence specific?
我怎样才能使正则表达式句子具体?
Input输入
try<-c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.","Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.")
Desired output期望输出
[1] NA
[2] "I am a sentence and I contain column but also Barr.
Attempt试图
str_extract_all(try,"\\..*column. Barr. ?\\.|.*Barr. column. ?\\.") str_extract_all(try,"\\..*column.barr . ?\\.|.*barr.column . ?\\.")
Current output电流输出
[[1]]
[1] "I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well."
[[2]]
[1] ". I am a sentence and I contain column but also Barr. I only contain Barr."
In order to find two words present in any order, you can use two positive lookaheads: For example grepl((?=.*Barr)(?=.*column),x,perl=T)
will return TRUE
every time both words are present, regardless of their order and FALSE
otherwise, but this doesn't take into account the sentence structure.为了找到以任何顺序出现的两个单词,您可以使用两个正向前瞻:例如
grepl((?=.*Barr)(?=.*column),x,perl=T)
每次两个单词都将返回TRUE
存在,无论它们的顺序如何,否则为FALSE
,但这不考虑句子结构。 As you want to extract text, and you want to find the two words in between dots, we can change it to:如果你想提取文本,并且你想找到点之间的两个单词,我们可以将其更改为:
library(stringr)
## Example data
x <- c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.","Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.","Barr and column and also column. But just Barr. And just column. Now again column and Barr")
> x
[1] "I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well."
[2] "Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too."
[3] "Barr and column and also column. But just Barr. And just column. Now again column and Barr"
str_extract_all(x,"(\\.|^)(?=[^\\.]*Barr)(?=[^\\.]*column)[^\\.]*(\\.|$)")
This looks for a start of sentence or a dot (\\\\.|^)
, followed by characters that are not dots and that contain Barr and column (?=[^\\\\.]*Barr)(?=[^\\\\.]*column)[^\\\\.]*
, followed by a dot or an end of sentence (\\\\.|$)
.这将查找句子的开头或点
(\\\\.|^)
,后跟非点且包含 Barr 和列(?=[^\\\\.]*Barr)(?=[^\\\\.]*column)[^\\\\.]*
的字符(?=[^\\\\.]*Barr)(?=[^\\\\.]*column)[^\\\\.]*
,后跟一个点或句尾(\\\\.|$)
。 This returns a list:这将返回一个列表:
[[1]]
character(0)
[[2]]
[1] ". I am a sentence and I contain column but also Barr."
[[3]]
[1] "Barr and column and also column." ". Now again column and Barr"
This regex seems to do what you need:这个正则表达式似乎可以满足您的需求:
(\\.[^.]*column[^.]*Barr[^.]*)|(\\.[^.]*Barr[^.]*column[^.]*)
It will start with a point ( .
) and grabs anything that is not a point but also has column
and Barr
.它将以一个点 (
.
) 开始并抓取任何不是点但也有column
和Barr
。 Or the same with both words in a different order.或者两个词相同,但顺序不同。
Example:例子:
try = c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.",
"Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.",
"I am a sentence and I contain column but also Barr. I only contain Barr. I am too.",
"I contain column and Barr. I have Barr and column. I don't.",
"Hello. I contain Barr and column but also Barr. I only contain Barr. I am too.")
k = sapply(try, function(x){
str_extract(paste0(".",x), "(\\.[^.]*column[^.]*Barr[^.]*)|(\\.[^.]*Barr[^.]*column[^.]*)")
})
names(k) = NULL
Result:结果:
[1] NA
[2] ". I am a sentence and I contain column but also Barr"
[3] ".I am a sentence and I contain column but also Barr"
[4] ".I contain column and Barr"
[5] ". I contain Barr and column but also Barr"
If you use str_extract_all
keep in mind that it returns a list of matches.如果您使用
str_extract_all
请记住它返回匹配列表。
[[1]]
character(0)
[[2]]
[1] ". I am a sentence and I contain column but also Barr"
[[3]]
[1] ".I am a sentence and I contain column but also Barr"
[[4]]
[1] ".I contain column and Barr" ". I have Barr and column"
[[5]]
[1] ". I contain Barr and column but also Barr"
I've added a paste0(".",x)
in order to detect sentences that contain both words and are first (they don't start with a period).我添加了一个
paste0(".",x)
以检测包含两个单词和第一个单词的句子(它们不以句点开头)。
Here is a more general attempt which does not require creating every permutation of the desired words, helpful when more than two works are required.这是一个更一般的尝试,不需要创建所需单词的每一个排列,当需要两个以上的作品时很有帮助。
The strategy is find the sentences with each individual word and then find the intersection of for the results.策略是找到每个单词的句子,然后找到结果的交集。
#split the long text into individual sentences.
sentences<-strsplit(try, "\\.")
#create list of matches for each desired word
columnlist<-lapply(sentences, function(x) {grep("(column)", x)})
barrlist<-lapply(sentences, function(x) {grep("(Barr)", x)})
#find intersection between lists
intersection<-lapply(seq_along(columnlist), function(i){intersect(columnlist[[i]], barrlist[[i]])} )
#extract the sentences out
answer<-sapply(seq_along(intersection), function(i) {
if(length(intersection[[i]]))
{trimws(sentences[[i]][intersection[[i]] ])}
else {NA}
})
Result结果
#[[1]]
#[1] NA
#
#[[2]]
#[1] "I am a sentence and I contain column but also Barr"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.