简体   繁体   中英

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).

RawData <- function(x)
{
  for(i in 1:nrow(x))
  {
    if(grep(".DERIVED", x[i,]) >= 1)
    {
      x <- x[-i,]
    }
  }
  for(i in 1:ncol(x))
  {
    if(is.numeric(x[,i]) != TRUE)
    {
      x <- x[,-i]
    }
  }
  return(x)
}

The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:

if(grep(".DERIVED", x[i,]) >= 1)

The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.

If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).

library(dplyr)
library(stringr)

df %>%
  filter_all(!str_detect(., '\\.DERIVED')) %>%
  select_if(is.numeric)

You can make it a function just as you would anything else:

mattsFunction <- function(dat){
  dat %>%
    filter_all(!str_detect(., '\\.DERIVED')) %>%
    select_if(is.numeric)
}

you should probably give it a better name though

The error is from the line

if(grep(".DERIVED", x[i,]) >= 1)

When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0) . The error is telling you that the if statement cannot evaluate whether logical(0) >= 1

A simple example:

if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}

You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)

There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in [.data.frame (x, , i) : undefined columns selected )

I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.

Notice that you should consider using the regex version of "\\\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".

I don't have example data or output, so here's my best go...

# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
                   b = (c(1,2,3,4,5)),
                   c = c("A","B","C","D","E"),
                   d = c(2,5,6,8,9),
                   stringsAsFactors = FALSE)

# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not 
# detects if the column is full of a string of character values of only numbers.

# Using the base subset command
test2 <- subset(test,
                subset = !grepl("\\.DERIVED",test$a),
                select = sapply(test,is.numeric))

# > test2
#   b d
# 1 1 2
# 3 3 6
# 4 4 8


# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]

# > test2
#   b d
# 1 1 2
# 3 3 6
# 4 4 8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM