简体   繁体   中英

Don't understand R's apply function

I have a data frame in which one column represents a numerical value and I'd like to add a column to the data frame that is a discretized version of this column. Here is a reproducible example:

# create example data
smallData <- data.frame(name = as.character(c("IC","IC","IC","IC","IC","BC","BC","BC","BC","BC")), 
                        value = as.integer(c(29,29,29,29,29,29,29,29,43,26)))

This creates the small example here:

 smallData
   name value
1    IC    29
2    IC    29
3    IC    29
4    IC    29
5    IC    29
6    BC    29
7    BC    29
8    BC    29
9    BC    43
10   BC    26

Now I'd like to add a column to the data frame that discretizes the rows based on the 'value' column:

# add new column to data frame
smallData$category <- ""
# define function to categorize data frame objects
categorize <- function(r)
{
  target <- r[c("value")]

  if(target < 27)
  {
    r[c("category")] <- "A"
  } else if(target < 30) {
    r[c("category")] <- "B"
  } else {
    r[c("category")] <- "C"
  }
  return(r)
}
# call to apply
smallData <- apply(smallData,1,categorize)
smallData

The output for this code is:

> smallData
         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
name     "IC" "IC" "IC" "IC" "IC" "BC" "BC" "BC" "BC" "BC" 
value    "29" "29" "29" "29" "29" "29" "29" "29" "43" "26" 
category "B"  "B"  "B"  "B"  "B"  "B"  "B"  "B"  "C"  "A"  

Here is the output of the str() function for smallData :

> str(smallData)
 chr [1:3, 1:10] "IC" "29" "B" "IC" "29" "B" "IC" "29" "B" "IC" "29" "B" ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:3] "name" "value" "category"
  ..$ : NULL

I'm unfamiliar with this data type. Is smallData now a list, a vector, or something else? I thought that since apply() returns a vector or array that when I fed it in a row from the smallData data frame that it would return the result in the same data format. Why is it not the case? I've also looked at sapply() and lapply() but they seem to explicitly return a list, which doesn't seem like it's what I want.

I seem to have a misunderstanding of the apply() function. I thought it was essentially a vectorized replacement for a 'for' loop but converting a simple for loop to use apply() isn't as straightforward as it seems like it should be.

smallData[ ,"category"] <- c("A","B","C")[ 
                   findInterval(smallData[, "value"], c(-Inf,27,30, Inf)

The suggestion to use cut would also make sense. My preference is to use cut2 from pkg Hmisc. You culd have also used a couple of ifelse assignments. The reason you got a matrix (and a character matrix at that ) is that apply always returns a matrix. It is tempting to use, but often very damaging to your data structure.

A further note. When you use cut you get a factor object, whereas the method I outlined above give you a character vector. There are situations where you would want a factor, such as in the immediate preparation of data fro regression functions, but I find it better to put off constructing factors. They can be kind of pain to work with.

As @Adrian says, you can use cut() :

smallData$category <- cut(smallData$value,breaks=c(0,27,30,Inf),
                          labels=c("A","B","C"))

(use as.character() on the result if, as @DWin suggests, you want a character rather than a factor result ...)

There are two reasons apply isn't working the way you think:

  • it coerces the result into a matrix, which means that all the elements will be of type character (the most general type that includes all the data in the matrix): from ?apply ,

    If 'X' is not an array but an object of a class with a non-null 'dim' value (such as a data frame), 'apply' attempts to coerce it to an array via 'as.matrix' if it is two-dimensional (eg, a data frame) or via 'as.array'.

  • apply() effectively transposes your array in this case:

    If each call to 'FUN' returns a vector of length 'n', then 'apply' returns an array of dimension 'c(n, dim(X)[MARGIN])' if 'n > 1'.

The other two answers here are great, and they are a more elegant solution to your problem. I am adding my own post here so that you can see how an apply statement would accomplish what you were trying to do:

smallData <- data.frame(name = as.character(c("IC","IC","IC","IC","IC","BC","BC","BC","BC","BC")), 
                        value = as.integer(c(29,29,29,29,29,29,29,29,43,26)))

# Create custom categorize function
categorize <- function(r)
{
  if(r < 27) {
    return("A")
  } else if(r < 30) {
    return("B")
  } else {
    return("C")
  }
}

# call to apply
smallData$category <- apply(smallData[match("value", names(smallData))],1,categorize)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM