社区矩阵上的多元随机森林

Question

I want to use random forest modeling to understand variable importance on community assembly - my response data is a community matrix.我想使用随机森林建模来了解社区组装的可变重要性 - 我的响应数据是一个社区矩阵。

library(randomForestSRC)

# simulated species matrix
species 
# site       species 1    species2     species 3
# 1             1            1            0
# 2             1            0            1
# 3             1            1            1
# 4             1            0            1
# 5             1            0            0
# 6             1            1            0
# 7             1            1            0
# 8             1            0            0
# 9             1            0            0
# 10            1            1            0


# environmental data
data
# site   elevation_m     PRECIPITATION_mm  
# 1        500                28
# 2        140                37
# 3        445                15
# 4        340                45
# 5        448                20
# 6        55                 70
# 7        320                18
# 8        200                42
# 9        420                22
# 10       180                8


# adding my species matrix into the environmental data frame
data[["species"]] <-(species)

# running the model
rf_model <- rfsrc(Multivar(species) ~.,data = data, importance = T)

but I'm getting an error message:但我收到一条错误消息：

Error in parseFormula(formula, data, ytry) : 
  the y-outcome must be either real or a factor.

I'm guessing that the issue is the presence/absence data, but I'm not sure how to move past that.我猜问题是存在/不存在数据，但我不确定如何解决这个问题。 Is this a limitation of the function?这是功能的限制吗？

Answer 1

I think it MIGHT have to do with how you built your "data" data frame.我认为这可能与您构建“数据”数据框的方式有关。 When you used data[["species"]] <- (species) , you had a data frame inside a data frame.当您使用data[["species"]] <- (species) ，您在数据框中有一个数据框。 If you str(data) after the step I just referred to, the output is this:如果你在我刚刚提到的步骤之后str(data) ，输出是这样的：

> str(data)
'data.frame':   10 obs. of  4 variables:
$ site     : int  1 2 3 4 5 6 7 8 9 10
$ elevation: num  500 140 445 340 448 55 320 200 420 180
$ precip   : num  28 37 15 45 20 70 18 42 22 8
$ species  :'data.frame':      10 obs. of  4 variables: #2nd data frame
..$ site     : int  1 2 3 4 5 6 7 8 9 10
..$ species.1: num  1 1 1 1 1 1 1 1 1 1
..$ species2 : num  1 0 1 0 0 1 1 0 0 1
..$ species.3: num  0 1 1 1 0 0 0 0 0 0

If you instead build your data frame as data2 <- as.data.frame(cbind(data,species)) , then如果您将数据框构建为data2 <- as.data.frame(cbind(data,species)) ，则

rfsrc(Multivar(species.1,species2,species.3)~.,data = data2, importance=T)

seems to work because I don't get an error message, instead I get some reasonable looking output:似乎有效，因为我没有收到错误消息，而是得到了一些合理的输出：

Sample size: 10
Number of trees: 1000
Forest terminal node size: 5
Average no. of terminal nodes: 2
No. of variables tried at each split: 2
Total no. of variables: 4
Total no. of responses: 3
User has requested response: species.1
Resampling used to grow trees: swr
Resample size used to grow trees: 10
Analysis: mRF-R
Family: regr+
Splitting rule: mv.mse *random*
Number of random split points: 10
% variance explained: NaN
Error rate: 0

I don't think your method for building the data frame you wanted is the customary way, but I could be wrong.我不认为你构建你想要的数据框的方法是惯用的方法，但我可能是错的。 I think rfsrc() did not know how to read a nested data frame.我认为rfsrc()不知道如何读取嵌套数据框。 I doubt most modeling functions do without extra customized code.我怀疑大多数建模功能不需要额外的定制代码。

Answer 2

Here's an example, using example data from the vegan package, of automatically constructing a formula that includes all of the species names in the response:这是一个示例，使用来自vegan包的示例数据，自动构建一个包含响应中所有物种名称的公式：

library(vegan)
library(randomForestSRC)
data("dune.env")
data("dune")

all <- as.data.frame(cbind(dune,dune.env))
form <- formula(sprintf("Multivar(%s) ~ .",
                        paste(colnames(dune),collapse=",")))

rfsrc(form, data=all)

Suppose we want to do this with 2000 species.假设我们要对 2000 个物种进行此操作。 Here's a simulated example:下面是一个模拟示例：

nsp <- 2000
nsamp <- 100
nenv <- 10
set.seed(101)
spmat <- matrix(rpois(nsp*nsamp, lambda=5), ncol=nsp,
                dimnames=list(NULL,paste0("sp",seq(nsp))))
envmat <- matrix(rnorm(nenv*nsamp), ncol=nenv,
                dimnames=list(NULL,paste0("env",seq(nenv))))

all2 <- as.data.frame(cbind(spmat,envmat))
form2 <- formula(sprintf("Multivar(%s) ~ .",
                        paste(colnames(spmat),collapse=",")))

rfsrc(form2, data=all2)

In this particular example we seem to explain -3% (!!) of the variance, but it doesn't crash, so that's a good thing ...在这个特定的例子中，我们似乎解释了 -3% (!!) 的方差，但它没有崩溃，所以这是一件好事......

社区矩阵上的多元随机森林

问题描述

2 个解决方案

解决方案1
1 2019-04-11 22:11:16

解决方案2
0 2020-09-06 02:13:24

社区矩阵上的多元随机森林

问题描述

2 个解决方案

解决方案1 1 2019-04-11 22:11:16

解决方案2 0 2020-09-06 02:13:24

解决方案1
1 2019-04-11 22:11:16

解决方案2
0 2020-09-06 02:13:24