简体   繁体   中英

subset() with grepl() using REGEX for filtering a dataframe in R

I am learning R and experimenting with subset() and grepl() with Regex for filtering a dataframe. I have created a very small dataframe to play with:

x   y   z   w
1   10  a   k
2   12  b   l
3   14  c   m
4   16  d   n
5   18  e   o

My code is the following:

subset(df14, grepl('^c | [l - n]', c(df14$z , df14$w) ), grepl('[yz]', colnames(df14)) )

In my mind, the second argument should return the indices of the rows found by grepl() to match the pattern in the columns with names: 'z' or 'w'. However, this is not what happens (returns an empty dataframe with columns y and z).

I would expect to return the rows 2,3,4 since column 'w' contains the letters l, m, n specified in the [ln] regex pattern and the columns z and w since these names match the regex [yz] in the third argument of the subset().

(I suspect that it is looking for a match in the names of the columns rather the contents of the columns, which is what interests me.)

Obviously, I am not interested in the result per se. This is an experiment to understand how the functions work. So, what I am looking for is an explanation and a method to correct the specific code -- not an alternative solution.

Your advice will be appreciated.

There are a variety of problems.

One issue is the extra spaces in your patterns. Drop them or use the free-spacing modifier (?x) with perl = TRUE . Either way, you have to get rid of the spaces in the character class. [ln] matches "m" and [l - n] does not, even with (?x) . You can read more about the free-spacing modifier and its impact inside and outside character classes here .

Another issues is that in your first grepl , you're searching within a vector (character vector? we can't tell from the example) of length 10. What would a TRUE in the 6th position mean for a 5 row data.frame? It doesn't make sense to return the 6th row of a 5 row data frame. Instead, you can see if your pattern is found for column "w" or ( | ) column "z". Look within each column, not a concatenation of columns.

Another issue is in your second grepl , "w" is not a match for [yz] . If you want to select the columns with a name containing a "w" or a "z", one way would be with [wz] :

There is no need for the ^ anchor since all your strings contain a single character, but I'll leave it in anyway:

subset(df14, 
       subset = grepl('^c|[l-n]', df14$z) | 
           grepl('^c|[l-n]', df14$w),
       select = grepl('[wz]', colnames(df14)))
#  z w
#2 b l
#3 c m
#4 d n

Or with the free-spacing mode modifier and a different pattern ( [wz] vs w|z ) for the second grepl :

subset(df14, 
       subset = grepl('(?x)^c | [l-n]', df14$z, perl = TRUE) | 
           grepl('(?x)^c | [l-n]', df14$w, perl = TRUE),
       select = grepl('w|z', colnames(df14)))
#  z w
#2 b l
#3 c m
#4 d n

The '^c | [l - n]' '^c | [l - n]' search expression can't find anything in those columns. Also, a more intuitive approach is use [ , ] to do this type of subsetting. See http://adv-r.had.co.nz/Subsetting.html .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM