简体   繁体   中英

Filtering data.frame values

I'm leaning R and I'm a little lost. I have a data.frame with 13 columns. My 13th column is ratings for a brand. However I have a lot of bad data in that column. How would I filter that column? For example, for any product a rating of 1-5 in fine, but in my .csv file, there are blanks, words like INC, words like "bar", etc. So I only want to use rows with a rating 1-5 in them and not use the row that has anything else. So do I have to write a function? Use ddply? Thank you for any help

I'll just make a simple 2-column data set.

dd <- data.frame(
    band=letters[1:8],
    rating=c("1","5","INC","3","bar",NA,"2","1")
)
#   band  col
# 1    a    1
# 2    b    5
# 3    c  INC
# 4    d    3
# 5    e  bar
# 6    f <NA>
# 7    g    2
# 8    h    1

I can subset this to only values in rating that are 1, 2, 3, 4, or 5 with

dd[which(as.numeric(as.character(dd$rating)) %in% 1:5), ]
#   band col
# 1    a   1
# 2    b   5
# 4    d   3
# 7    g   2
# 8    h   1

So your column is probably a factor in R. So I use as.character to get the labels, and then use as.numeric to get the numeric values of that label. If the label is not a number, it will be turned into an NA value. Not I check which values are in the set 1:5 and i wrap that in a which in order to drop the NA values. Then I use this numeric vector to subset the data.frame to just the rows i'm interested in. You can reassign this result to a new variable. You will get a warning about NA values in match, but that's OK and what we expect.

First, welcome to the best open-source software on the planet.

Okay, here's an example. Take this messy data frame x

> x <- data.frame(a = c("foo", "bar", "2", "INC", "5"), 
                  b = c("1", "NO", "foo", "3", "no"))
> x
#     a   b
# 1 foo   1
# 2 bar  NO
# 3   2 foo
# 4 INC   3
# 5   5  no

We can find the numeric values numerous different ways, but I like grep . The following shows us that rows 1 and 4 of column b contain numeric values

> grep('[0-9]+', as.character(x$b))
# [1] 1 4

The we can save that as numsb

> numsb <- grep('[0-9]+', as.character(x$b))

And subset the data frame for those rows with vector operations

> x[numsb, ]
#     a b
# 1 foo 1
# 4 INC 3

Notice that you could also just put grep into the above subset. But I'll use grepl , the logical grep , for column a.`

> x[grepl('[0-9]+', as.character(x$a)), ]
#   a   b
# 3 2 foo
# 5 5  no

The same follows for the other columns. You'll need to coerce the columns to class numeric if you need them for calculations

> z <- x[numsb,]
> z$b <- as.numeric(z$b)

and the same for the other subsets

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM