I'm leaning R and I'm a little lost. I have a data.frame with 13 columns. My 13th column is ratings for a brand. However I have a lot of bad data in that column. How would I filter that column? For example, for any product a rating of 1-5 in fine, but in my .csv file, there are blanks, words like INC, words like "bar", etc. So I only want to use rows with a rating 1-5 in them and not use the row that has anything else. So do I have to write a function? Use ddply? Thank you for any help
I'll just make a simple 2-column data set.
dd <- data.frame(
band=letters[1:8],
rating=c("1","5","INC","3","bar",NA,"2","1")
)
# band col
# 1 a 1
# 2 b 5
# 3 c INC
# 4 d 3
# 5 e bar
# 6 f <NA>
# 7 g 2
# 8 h 1
I can subset this to only values in rating
that are 1, 2, 3, 4, or 5 with
dd[which(as.numeric(as.character(dd$rating)) %in% 1:5), ]
# band col
# 1 a 1
# 2 b 5
# 4 d 3
# 7 g 2
# 8 h 1
So your column is probably a factor in R. So I use as.character to get the labels, and then use as.numeric to get the numeric values of that label. If the label is not a number, it will be turned into an NA
value. Not I check which values are in the set 1:5
and i wrap that in a which
in order to drop the NA values. Then I use this numeric vector to subset the data.frame to just the rows i'm interested in. You can reassign this result to a new variable. You will get a warning about NA
values in match, but that's OK and what we expect.
First, welcome to the best open-source software on the planet.
Okay, here's an example. Take this messy data frame x
> x <- data.frame(a = c("foo", "bar", "2", "INC", "5"),
b = c("1", "NO", "foo", "3", "no"))
> x
# a b
# 1 foo 1
# 2 bar NO
# 3 2 foo
# 4 INC 3
# 5 5 no
We can find the numeric values numerous different ways, but I like grep
. The following shows us that rows 1 and 4 of column b contain numeric values
> grep('[0-9]+', as.character(x$b))
# [1] 1 4
The we can save that as numsb
> numsb <- grep('[0-9]+', as.character(x$b))
And subset the data frame for those rows with vector operations
> x[numsb, ]
# a b
# 1 foo 1
# 4 INC 3
Notice that you could also just put grep
into the above subset. But I'll use grepl
, the logical grep
, for column a.`
> x[grepl('[0-9]+', as.character(x$a)), ]
# a b
# 3 2 foo
# 5 5 no
The same follows for the other columns. You'll need to coerce the columns to class numeric
if you need them for calculations
> z <- x[numsb,]
> z$b <- as.numeric(z$b)
and the same for the other subsets
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.