简体   繁体   中英

Using negative lookbehind in R

I have these strings in a folder. Assume there are other similar files in this folder.

 [3] "/farm/chickens_industrial_meat_location_df.csv"   
 [4] "farm/goats_grassland_meat_location_df.csv" 

I am trying to extract the files with string location_df while excluding the files with the string chickens & location_df .

I thought I could do this by typing: list.files(pattern = "location_df(?<!(chickens))"

My understanding is that using a negative lookaround would remove strings that have chickens . What am I not understanding about regex here and what is the solution to my problem.

An option with grepl would be

str1[!grepl('chickens_.*location_df', str1) & grepl('location_df', str1)]
#[1] "farm/goats_grassland_meat_location_df.csv"

Or more simplified version would be

str1[!grepl('chickens_', str1) & grepl('location_df', str1)]

data

str1 <- c("/farm/chickens_industrial_meat_location_df.csv",
        "farm/goats_grassland_meat_location_df.csv" )
> list.files(pattern = "location_df")
[1] "chickens_industrial_meat_location_df.csv" "goats_grassland_meat_location_df.csv"    

> setdiff(list.files(pattern = "location_df"), list.files(pattern = "chickens"))
[1] "goats_grassland_meat_location_df.csv"

> setdiff(list.files(pattern = "location_df"), list.files(pattern = "goats"))
[1] "chickens_industrial_meat_location_df.csv"

According to the R-helpfile for regex, "...functions which use regular expressions (often via the use of grep) include apropos, browseEnv, help.search, list.files and ls. These will all use extended regular expressions." (ERE).

Reading the above indicates that the list.files() and list.dirs() functions do not implement lookarounds which are generally available with Perl-compatible regular expressions (PCRE). A tiny clue is that the R-helpfile for list.files() / list.dirs() doesn't include the option perl=TRUE .

So instead of lookarounds, the code shown above uses setdiff() to help you interrogate a directory. Of course, with the code above the two regex 'tokens' you're searching for can appear in any order, but you can help yourself out by searching for "location_df.csv" or "location_df.csv$" (since the ".csv" extension would come at the end of the filename, and the $ -zerowidth assertion would similarly anchor the pattern to the end of the string). You could also try using ^ to anchor "chickens" or "goats" to the beginning of the string. Putting it all together gives the code below:

> setdiff(list.files(pattern = "location_df.csv$"), list.files(pattern = "^chickens"))
[1] "goats_grassland_meat_location_df.csv"

> setdiff(list.files(pattern = "location_df.csv$"), list.files(pattern = "^goats"))
[1] "chickens_industrial_meat_location_df.csv"

https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
https://www.r-project.org/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM