不明白R的应用功能

Question

I have a data frame in which one column represents a numerical value and I'd like to add a column to the data frame that is a discretized version of this column. 我有一个数据框，其中一列代表一个数值，我想在数据框中添加一列，这是该列的离散化版本。 Here is a reproducible example: 这是一个可重复的例子：

# create example data
smallData <- data.frame(name = as.character(c("IC","IC","IC","IC","IC","BC","BC","BC","BC","BC")), 
                        value = as.integer(c(29,29,29,29,29,29,29,29,43,26)))

This creates the small example here: 这在这里创建了一个小例子：

 smallData
   name value
1    IC    29
2    IC    29
3    IC    29
4    IC    29
5    IC    29
6    BC    29
7    BC    29
8    BC    29
9    BC    43
10   BC    26

Now I'd like to add a column to the data frame that discretizes the rows based on the 'value' column: 现在，我想在数据框中添加一列，根据“值”列对行进行离散化：

# add new column to data frame
smallData$category <- ""
# define function to categorize data frame objects
categorize <- function(r)
{
  target <- r[c("value")]

  if(target < 27)
  {
    r[c("category")] <- "A"
  } else if(target < 30) {
    r[c("category")] <- "B"
  } else {
    r[c("category")] <- "C"
  }
  return(r)
}
# call to apply
smallData <- apply(smallData,1,categorize)
smallData

The output for this code is: 此代码的输出是：

> smallData
         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
name     "IC" "IC" "IC" "IC" "IC" "BC" "BC" "BC" "BC" "BC" 
value    "29" "29" "29" "29" "29" "29" "29" "29" "43" "26" 
category "B"  "B"  "B"  "B"  "B"  "B"  "B"  "B"  "C"  "A"

Here is the output of the str() function for smallData : 这是smallData的str（）函数的输出：

> str(smallData)
 chr [1:3, 1:10] "IC" "29" "B" "IC" "29" "B" "IC" "29" "B" "IC" "29" "B" ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:3] "name" "value" "category"
  ..$ : NULL

I'm unfamiliar with this data type. 我不熟悉这种数据类型。 Is smallData now a list, a vector, or something else? smallData现在是列表，向量还是其他什么？ I thought that since apply() returns a vector or array that when I fed it in a row from the smallData data frame that it would return the result in the same data format. 我认为，因为apply（）返回一个向量或数组，当我从smallData数据框中连续输入它时，它将以相同的数据格式返回结果。 Why is it not the case? 为什么不是这样？ I've also looked at sapply() and lapply() but they seem to explicitly return a list, which doesn't seem like it's what I want. 我也看了sapply（）和lapply（），但它们似乎明确地返回一个列表，这似乎不是我想要的。

I seem to have a misunderstanding of the apply() function. 我似乎对apply（）函数有误解。 I thought it was essentially a vectorized replacement for a 'for' loop but converting a simple for loop to use apply() isn't as straightforward as it seems like it should be. 我认为它本质上是'for'循环的矢量化替换，但转换一个简单的for循环使用apply（）并不像它应该的那样简单。

Answer 1

smallData[ ,"category"] <- c("A","B","C")[ 
                   findInterval(smallData[, "value"], c(-Inf,27,30, Inf)

The suggestion to use cut would also make sense. 使用cut的建议也是有道理的。 My preference is to use cut2 from pkg Hmisc. 我的偏好是使用来自pkg Hmisc的cut2。 You culd have also used a couple of ifelse assignments. 你还使用了一些ifelse任务。 The reason you got a matrix (and a character matrix at that ) is that apply always returns a matrix. 你得到一个矩阵（以及一个字符矩阵）的原因是apply总是返回一个矩阵。 It is tempting to use, but often very damaging to your data structure. 它很容易使用，但通常会对您的数据结构造成极大的破坏。

A further note. 进一步说明。 When you use cut you get a factor object, whereas the method I outlined above give you a character vector. 当你使用cut你得到一个因子对象，而我上面概述的方法给你一个字符向量。 There are situations where you would want a factor, such as in the immediate preparation of data fro regression functions, but I find it better to put off constructing factors. 在某些情况下，您需要一个因子，例如立即为回归函数准备数据，但我发现最好推迟构造因素。 They can be kind of pain to work with. 他们可能会有点痛苦。

Answer 2

As @Adrian says, you can use cut() : 正如@Adrian所说，你可以使用cut() ：

smallData$category <- cut(smallData$value,breaks=c(0,27,30,Inf),
                          labels=c("A","B","C"))

(use as.character() on the result if, as @DWin suggests, you want a character rather than a factor result ...) （对结果使用as.character() ，如果@DWin建议，你想要一个character而不是一个factor结果......）

There are two reasons apply isn't working the way you think: 有两个原因apply于你的想法：

it coerces the result into a matrix, which means that all the elements will be of type character (the most general type that includes all the data in the matrix): from ?apply , 它将结果强制转换为矩阵，这意味着所有元素都是类型character （包含矩阵中所有数据的最常见类型）：from ?apply ，

If 'X' is not an array but an object of a class with a non-null 'dim' value (such as a data frame), 'apply' attempts to coerce it to an array via 'as.matrix' if it is two-dimensional (eg, a data frame) or via 'as.array'. 如果'X'不是数组而是具有非空'dim'值的类的对象（例如数据帧），'apply'会尝试通过'as.matrix'将其强制转换为数组，如果它是二维（例如，数据帧）或通过'as.array'。
apply() effectively transposes your array in this case: apply()在这种情况下有效地转换你的数组：

If each call to 'FUN' returns a vector of length 'n', then 'apply' returns an array of dimension 'c(n, dim(X)[MARGIN])' if 'n > 1'. 如果每次调用'FUN'都返回一个长度为'n'的向量，那么'apply'将返回一个维度为'c（n，dim（X）[MARGIN]）'的数组，如果'n> 1'。

Answer 3

The other two answers here are great, and they are a more elegant solution to your problem. 这里的另外两个答案很棒，它们是解决您问题的更优雅的解决方案。 I am adding my own post here so that you can see how an apply statement would accomplish what you were trying to do: 我在这里添加自己的帖子，以便您可以看到apply语句将如何完成您尝试执行的操作：

smallData <- data.frame(name = as.character(c("IC","IC","IC","IC","IC","BC","BC","BC","BC","BC")), 
                        value = as.integer(c(29,29,29,29,29,29,29,29,43,26)))

# Create custom categorize function
categorize <- function(r)
{
  if(r < 27) {
    return("A")
  } else if(r < 30) {
    return("B")
  } else {
    return("C")
  }
}

# call to apply
smallData$category <- apply(smallData[match("value", names(smallData))],1,categorize)

不明白R的应用功能

问题描述

3 个解决方案

解决方案1
3 2013-11-19 21:29:25

解决方案2
3 已采纳 2013-11-19 21:30:41

解决方案3
2 2013-11-19 21:40:02

不明白R的应用功能

问题描述

3 个解决方案

解决方案1 3 2013-11-19 21:29:25

解决方案2 3 已采纳 2013-11-19 21:30:41

解决方案3 2 2013-11-19 21:40:02

解决方案1
3 2013-11-19 21:29:25

解决方案2
3 已采纳 2013-11-19 21:30:41

解决方案3
2 2013-11-19 21:40:02