I am learning R and experimenting with subset() and grepl() with Regex for filtering a dataframe. I have created a very small dataframe to play with:
x y z w
1 10 a k
2 12 b l
3 14 c m
4 16 d n
5 18 e o
My code is the following:
subset(df14, grepl('^c | [l - n]', c(df14$z , df14$w) ), grepl('[yz]', colnames(df14)) )
In my mind, the second argument should return the indices of the rows found by grepl() to match the pattern in the columns with names: 'z' or 'w'. However, this is not what happens (returns an empty dataframe with columns y and z).
I would expect to return the rows 2,3,4 since column 'w' contains the letters l, m, n specified in the [ln] regex pattern and the columns z and w since these names match the regex [yz] in the third argument of the subset().
(I suspect that it is looking for a match in the names of the columns rather the contents of the columns, which is what interests me.)
Obviously, I am not interested in the result per se. This is an experiment to understand how the functions work. So, what I am looking for is an explanation and a method to correct the specific code -- not an alternative solution.
Your advice will be appreciated.
There are a variety of problems.
One issue is the extra spaces in your patterns. Drop them or use the free-spacing modifier (?x)
with perl = TRUE
. Either way, you have to get rid of the spaces in the character class. [ln]
matches "m" and [l - n]
does not, even with (?x)
. You can read more about the free-spacing modifier and its impact inside and outside character classes here .
Another issues is that in your first grepl
, you're searching within a vector (character vector? we can't tell from the example) of length 10. What would a TRUE in the 6th position mean for a 5 row data.frame? It doesn't make sense to return the 6th row of a 5 row data frame. Instead, you can see if your pattern is found for column "w" or ( |
) column "z". Look within each column, not a concatenation of columns.
Another issue is in your second grepl
, "w" is not a match for [yz]
. If you want to select the columns with a name containing a "w" or a "z", one way would be with [wz]
:
There is no need for the ^
anchor since all your strings contain a single character, but I'll leave it in anyway:
subset(df14,
subset = grepl('^c|[l-n]', df14$z) |
grepl('^c|[l-n]', df14$w),
select = grepl('[wz]', colnames(df14)))
# z w
#2 b l
#3 c m
#4 d n
Or with the free-spacing mode modifier and a different pattern ( [wz]
vs w|z
) for the second grepl
:
subset(df14,
subset = grepl('(?x)^c | [l-n]', df14$z, perl = TRUE) |
grepl('(?x)^c | [l-n]', df14$w, perl = TRUE),
select = grepl('w|z', colnames(df14)))
# z w
#2 b l
#3 c m
#4 d n
The '^c | [l - n]'
'^c | [l - n]'
search expression can't find anything in those columns. Also, a more intuitive approach is use [ , ]
to do this type of subsetting. See http://adv-r.had.co.nz/Subsetting.html .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.