简体   繁体   English

我可以计算行或列的最小值或最大值,我可以计算列的平均值,但我无法计算行的平均值。 为什么不?

[英]I can calculate the min or max of a row or column and I can calculate the mean of a column, but I can't calculate the mean of a row. Why not?

Given a simple 2x2 data frame, I can calculate the min or max of a row or column and I can calculate the mean of a column, but I can't calculate the mean of a row. 给定一个简单的2x2数据帧,我可以计算行或列的最小值或最大值,我可以计算列的平均值,但我无法计算行的平均值。 Why not? 为什么不?

> dat <- data.frame( A=c(1,2),B=c(3,4))
> dat
  A B
1 1 3
2 2 4
> min(dat[1,])
[1] 1
> max(dat[1,])
[1] 3
> mean(dat[,1])
[1] 1.5
> mean(dat[1,])
[1] NA
Warning message:
In mean.default(dat[1, ]) :
  argument is not numeric or logical: returning NA

max and min accept multiple vectors as parameters, and calculate the maximum/minimum in all of them. maxmin接受多个向量作为参数,并计算所有向量的最大值/最小值。

mean is more limited, it takes a single argument of a supported type. mean更有限,它需要一个支持类型的单个参数。 For example vector is a supported type. 例如,vector是受支持的类型。

For more details see ?max and ?mean , especially the Usage , Arguments , and Details sections. 有关更多详细信息,请参阅?max?mean ,尤其是UsageArgumentsDetails部分。

The type of dat is data.frame . dat的类型是data.frame And so is the type of dat[1,] , because a row of a data frame is also a data frame, with a single value in each of its columns. dat[1,]的类型也是如此,因为数据帧的一行也是一个数据帧,每列中都有一个值。

When you pass a data frame to max , it operates on the columns (vectors) of the data frame, returning the maximum value of all of them. 将数据帧传递给max ,它会对数据框的列(向量)进行操作,并返回所有数据框的最大值。

When you pass a data frame to mean , it gives you an error because data frame is not one of the supported types. 当您将数据帧传递给mean ,它会给您一个错误,因为数据帧不是受支持的类型之一。

You can use unlist to get a vector from a data frame. 您可以使用unlist从数据框中获取向量。 It does that practically by concatenating all the vectors of the data frame. 它实际上是通过连接数据帧的所有向量来实现的。 For example unlist(dat) will return the vector 1 2 3 4 . 例如, unlist(dat)将返回向量1 2 3 4 dat[1,] is the first row of dat , which has vectors 1 and 3 , so unlist(dat[1,]) will return the vector 1 2 . dat[1,]是第一行dat ,它有向量13 ,因此unlist(dat[1,])将返回向量1 2 You can call mean on that. 你可以打电话给那个mean

If all of your columns are numeric, you can just use rowMeans(dat) . 如果所有列都是数字,则可以使用rowMeans(dat) To compactly select the numeric ones, you could do (for example) rowMeans(iris[, 1:4]) . 要紧凑地选择数字,你可以(例如) rowMeans(iris[, 1:4])

If you don't want to have to worry about identifying which columns are numeric, you could also use sapply() to generate logical column indices for subsetting: 如果您不想担心识别哪些列是数字,您还可以使用sapply()生成用于子集化的逻辑列索引:

rowMeans(iris[, sapply(iris, is.numeric)])

Note also that rowMeans() has an na.rm parameter, which you can set to TRUE if you think your data might have missing values. 另请注意, rowMeans()具有na.rm参数,如果您认为数据可能缺少值,则可以将其设置为TRUE

Adding to lefft's amswer, you don't need to know the numeric columns, and can use Filter to find them. 添加到lefft的amswer,您不需要知道数字列,并可以使用Filter来查找它们。

rowMeans(Filter(is.numeric,dat),na.rm=T)

will do the trick. 会做的。 That being said, if you know the columns, is.numeric and Filter in conjuction are a lot slower than simply listing out the columns. 话虽这么说,如果你知道列, is.numericFilter in conjuction比简单地列出列慢很多。

EDIT 编辑

Sorry, I wished I could have left that as a comment to the previous answer, as I thought it was useful clarification, but had no other way of posting. 对不起,我希望我可以将其作为对前一个答案的评论,因为我认为这是有用的澄清,但没有其他方式发布。 To give it a little more info about the overhead, I ran a micro benchmark on the ways of grabbing the numeric columns: 为了给它提供更多关于开销的信息,我在抓取数字列的方式上运行了一个微基准:

library(microbenchmark)
df.mb<-data.frame(
  c(runif(10000)),c(runif(10000)),c(runif(10000)),
  c(rep("A",10000)),c(rep("A",10000)),c(rep("A",10000)),
  c(rep("A",10000)),c(rep("A",10000)),c(rep("A",10000)))
names(df.mb)<-c("a","b","c","d","e","f","g","h","i")


function1<-function(x) {rowMeans(Filter(is.numeric,x))}
function2<-function(x) {rowMeans(x[,1:3])}
function3<-function(x) {rowMeans(x[,c("a","b","c")])}
function4<-function(x) {rowMeans(x[ ,sapply(x,is.numeric)])}

microbenchmark(
  function1(df.mb),
  function2(df.mb),
  function3(df.mb),
  function4(df.mb)
)

Unit: microseconds
         expr     min       lq     mean   median       uq       max neval cld
 function1(df.mb) 351.148 372.4810 768.2310 464.0005 492.5875 16216.321   100   a
 function2(df.mb) 317.441 338.5605 667.6871 429.6545 442.0270 15281.921   100   a
 function3(df.mb) 317.867 340.4810 581.0908 421.1205 439.0410  8965.121   100   a
 function4(df.mb) 363.521 385.2810 735.4673 461.6535 519.2545 15701.334   100   a

As long as you know the columns by name and number, you are faster, but barring that either Filter or sapply will help. 只要您按名称和数字知道列,就会更快,但除非Filter或sapply会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM