简体   繁体   English

R中的grepl:字内破折号阻止匹配

[英]grepl in R: matching impeded by intra-word dashes

I have 3 words: x, y, and z, from which two compound words can be built: xy, and yz. 我有3个字:x,y和z,可以从中建立两个复合字:xy和yz。

In naturally occuring text, x, y, and z can follow each other. 在自然出现的文本中,x,y和z可以彼此跟随。 In the first case, I have: 在第一种情况下,我有:

text="x-y z"

And I want to detect: "xy" but not "yz". 我想检测:“ xy”而不是“ yz”。 If I do: 如果我做:

v=c("x-y","y z")
vv=paste("\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

I get c(TRUE,TRUE). 我得到c(TRUE,TRUE)。 In other words, grepl does not capture the fact that y is already linked to x via the intra-word dash, and that therefore, "yz" is not actually there in the text. 换句话说,grepl没有捕获到y已通过字内破折号链接到x的事实,因此在文本中实际上不存在“ yz”。 So I use a lookbehind after adding whitespace at the beginning of the text: 因此,在文本开头添加空格后,我使用了一个回首:

text=paste("",text,sep=" ")
vv=paste("(?<= )\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

this time, I get what I want: c(TRUE, FALSE). 这次,我得到了我想要的:c(TRUE,FALSE)。 Now, in the second case, I have: 现在,在第二种情况下,我有:

text="x y-z"

and I want to detect "yz" but not "xy". 我想检测“ yz”而不是“ xy”。 Adopting a symmetrical approach with a lookahead this time, I tried: 这次采用了一种前瞻性的对称方法,我尝试过:

text=paste(text,"",sep=" ")
v=c("x y","y-z")
vv=paste("(?= )\\b",v,"\\b",sep="")
sapply(vv,grepl,text,perl=TRUE)

But this time I get c(FALSE,FALSE) instead of c(FALSE,TRUE) as I was expecting. 但是这次我得到的是c(FALSE,FALSE)而不是我期望的c(FALSE,TRUE)。 The FALSE in first position is expected (the lookahead detected the presence of the intra-word dash after y and prevented matching with "xy"). 期望FALSE处于第一位置(提前完成检测到y之后存在单词内破折号,并阻止了与“ xy”的匹配)。 But I really do not understand what is preventing the matching with "yz". 但是我真的不明白是什么阻止了与“ yz”的匹配。

Thanks a lot in advance for your help, 在此先感谢您的帮助,

I think this matches the description in your comment of what you want to accomplish. 我认为这与您要完成的任务的评论中的描述相符。

spaceInvader <- function(a, b, text) {
  # look ahead of `a` to see if there is a space
  hasa <- grepl(paste0(a, '(?= )'), text, perl = TRUE)
  # look behind `b` to see if there is a space 
  hasb <- grepl(paste0('(?<= )', b), text, perl = TRUE)

  result <- c(hasa, hasb)
  names(result) <- c(a, b)
  cat('In: "', text, '"\n', sep = '')
  return(result)
}

spaceInvader('x-y', 'y z', 'x-y z')
# In: "x-y z"
#   x-y   y z 
#  TRUE FALSE 
spaceInvader('x y', 'y-z', 'x y-z')
# In: "x y-z"
#   x y   y-z 
# FALSE  TRUE 
spaceInvader('x-y', 'y z', 'x y-z')
# In: "x y-z"
#   x-y   y z 
# FALSE FALSE 
spaceInvader('x y', 'y-z', 'x-y z')
# In: "x-y z"
#   x y   y-z 
# FALSE FALSE 

Is this a problem? 这有问题吗?

spaceInvader('x-y', 'y-z', 'x-y-z')
# In: "x-y-z"
#   x-y   y-z 
# FALSE FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM