简体   繁体   English

R (dplyr) 中接近的 2 个单词/短语的 Grepl

[英]Grepl for 2 words/phrases in proximity in R (dplyr)

I'm trying to create a filter for large dataframe. I'm trying to use grepl to search for a series of text within a specific column.我正在尝试为大 dataframe 创建过滤器。我正在尝试使用 grepl 来搜索特定列中的一系列文本。 I've done this for single words/combinations, but now I want to search for two words in close proximity (ie the word tumo(u)r within 3 words of the word colon).我已经为单个单词/组合完成了此操作,但现在我想搜索两个非常接近的单词(即单词冒号的 3 个单词内的单词 tumo(u)r)。

I've checked my regular expression on https://www.regextester.com/109207 and my search works there, but it doesn't work within R.我已经在https://www.regextester.com/109207上检查了我的正则表达式,我的搜索在那里有效,但它在 R 中不起作用。

The error I get is Error: '\W' is an unrecognized escape in character string starting ""\btumor|tumour)\W"我得到的错误是 Error: '\W' is an unrecognized escape in character string starting ""\btumor|tumour)\W"

Example below - trying to search for tumo(u)r within 3 words of cancer.下面的示例 - 尝试在 cancer 的 3 个词内搜索 tumo(u)r。

Can anyone help?有人可以帮忙吗?

library(tibble)
example.df <- tibble(number = 1:4, AB = c('tumor of the colon is a very hard disease to cure', 'breast cancer is also known as a neoplasia of the breast', 'tumour of the colon is bad', 'colon cancer is also bad'))

filtered.df <- example.df %>% 
    filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB, ignore.case=T) 

R uses backslashes as escapes and the regex engine does,too. R 使用反斜杠作为转义符,正则表达式引擎也这样做。 Need to double your backslashes.需要加倍你的反斜杠。 This is explained in multiple prior questions on StackOverflow as well as in the help page brought up at ?regex .这在 StackOverflow 上的多个先前问题以及在?regex上提出的帮助页面中都有解释。 You should try to use the escaped operators in a more simple set of tests before attempting complex operations.在尝试复杂操作之前,您应该尝试在一组更简单的测试中使用转义运算符。 And you should pay better attention to the proper placement of parentheses and quotes in the pattern argument.并且您应该更加注意模式参数中括号和引号的正确放置。

filtered.df <- example.df %>% 

   #filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB, 

# errors here ....^.^..............^..^...^..^.............^.^

    filter(grepl( "(\\btumor|tumour)\\W|\\w+(\\w+\\W+){0,3}colon\\b", AB,
ignore.case=T) )

> filtered.df
# A tibble: 2 × 2
  number AB                                               
   <int> <chr>                                            
1      1 tumor of the colon is a very hard disease to cure
2      3 tumour of the colon is bad   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM