从我的随机森林 Model 获得仅响应的问题

Question

this is my first post on Stake overflow so please ask me follow questions if more info is needed!这是我关于 Stake overflow 的第一篇文章，所以如果需要更多信息，请向我提问！

Situation: I've complied water chemistry data for the Maritimes (Atlantic Canada) for freshwater ecosystems because I am trying to create a predictive species distribution model using a random forest model (RFM) for an invasive species.情况：我已经为淡水生态系统的海洋（加拿大大西洋）编制了水化学数据，因为我正在尝试使用随机森林 model（RFM）为入侵物种创建预测物种分布 model。 Unfortunately, Atlantic Canada lack consistent water monitoring programs and the ones that do exist don't monitor for the same parameters as other group.不幸的是，加拿大大西洋地区缺乏一致的水监测计划，并且确实存在的监测计划与其他组不监测相同的参数。 So, my databases (both the training and testing) have many NAs.所以，我的数据库（训练和测试）有很多 NA。

Issue: This is the response I keep getting from my RFM:问题：这是我不断从我的 RFM 得到的回复：

> p1 <- predict(model2, newdata=Test_Dataset,type="prob")[,2]
> p1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 35 36 37 NA NA NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 35 36 37 NA NA NA

What I have tried:我试过的：

I built the RFM (ie model2) using various predictors.我使用各种预测器构建了 RFM（即模型 2）。 I did include:我确实包括：
model2 <- randomForest(CMS ~ Lat + Lon + pH + Alkalinity + Ca + Hardness + DO + TOC + T_P + T_N + Cond + Na + No_Stocking + No_Fish_Species + Dist_Hwy + No_Boat_Launches + Connected_Lakes + Invasives, importance = TRUE, data=TrainSet, na.action=na.roughfix) model2 model2 <- randomForest（CMS ~ Lat + Lon + pH + Alkalinity + Ca + Hardness + DO + TOC + T_P + T_N + Cond + Na + No_Stocking + No_Fish_Species + Dist_Hwy + No_Boat_Launches + Connected_Lakes + Invasives，重要性 = TRUE，数据 = TrainSet , na.action=na.roughfix) 模型2

**Note that the big list of variables are the predictors and CMS is the species. **请注意，变量的大列表是预测变量，CMS 是物种。

I tried matching the test dataset (Test_Dataset) with the training dataset (Validation_Dataset).我尝试将测试数据集（Test_Dataset）与训练数据集（Validation_Dataset）进行匹配。
Test_Dataset <- rbind(Validation_Dataset[1, ], Validation_Dataset) Test_Dataset <- Test_Dataset[-1,] Test_Dataset <- rbind(Validation_Dataset[1, ], Validation_Dataset) Test_Dataset <- Test_Dataset[-1,]
I have searched for and read multiple resources (including the obvious R pages and references linked there).我搜索并阅读了多个资源（包括明显的 R 页面和链接的参考资料）。
I have mutated the dataframe as follows (I'll just show the Validation_Dataset as it is the same mutations for both):我已经对 dataframe 进行了如下突变（我将只显示 Validation_Dataset，因为它对两者都是相同的突变）：
Mutate dataset to fix issues with R reading NA cells变异数据集以修复 R 读取 NA 单元格的问题
Validation_Dataset <- Validation_Dataset %>% dplyr::mutate( # convert year into a categorical variable Year = factor (Year), #convert Chlorophyll concentrations from a character file to a number file # convert "NA" into a missing value data whenever appropriate Chlorophyll = dplyr::na_if(Chlorophyll, "NA"), Chlorophyll = factor (Chlorophyll), Hardness = dplyr::na_if(Hardness, "NA"), Hardness= factor (Hardness), Alkalinity = dplyr::na_if(Alkalinity, "NA"), Alkalinity = factor (Alkalinity), Ca = dplyr::na_if(Ca, "NA"), Ca = factor (Ca), TOC = dplyr::na_if(TOC, "NA"), TOC = factor (TOC), Cond = dplyr::na_if(Cond, "NA"), Cond = factor (Cond), Na = dplyr::na_if(Na, "NA"), Na = factor (Cond), NH4 = dplyr::na_if(NH4, "NA"), NH4 = factor Validation_Dataset <- Validation_Dataset %>% dplyr::mutate( # 将年份转换为分类变量 Year = factor (Year), #convert 叶绿素浓度从字符文件到数字文件 # 在适当的时候将“NA”转换为缺失值数据Chlorophyll = dplyr::na_if(Chlorophyll, "NA"), Chlorophyll = factor (Chlorophyll), Hardness = dplyr::na_if(Hardness, "NA"), Hardness= factor (Hardness), Alkalinity = dplyr::na_if(Alkalinity , "NA"), 碱度 = 因子 (碱度), Ca = dplyr::na_if(Ca, "NA"), Ca = 因子 (Ca), TOC = dplyr::na_if(TOC, "NA"), TOC = factor (TOC), Cond = dplyr::na_if(Cond, "NA"), Cond = factor (Cond), Na = dplyr::na_if(Na, "NA"), Na = factor (Cond), NH4 = dplyr ::na_if(NH4, "NA"), NH4 = 因子 (NH4), NO3 = dplyr::na_if(NO3, "NA"), NO3 = factor (NO3), pH = dplyr::na_if(pH, "NA"), pH = factor (pH), T_N = dplyr::na_if(T_N, "NA"), T_N = factor (T_N), T_P = dplyr::na_if(T_P, "NA"), T_P = factor (T_P), DO = dplyr::na_if(DO, "NA"), DO = factor (DO), Salinity = dplyr::na_if(Salinity, "NA"), Salinity = factor (Salinity), No_Stocking = dplyr::na_if(No_Stocking, "NA"), No_Stocking = factor (No_Stocking), No_Fish_Species = dplyr::na_if(No_Fish_Species, "NA"), No_Fish_Species = factor (No_Fish_Species), Dist_Hwy = dplyr::na_if(Dist_Hwy, "NA"), Dist_Hwy = factor (Dist_Hwy), No_Boat_Launches = dplyr::na_if(No_Boat_Launches, "NA"), No_Boat_Launches = factor (No_Boat_Launches), Connected_Lakes = Z15E9FB2F4 (NH4), NO3 = dplyr::na_if(NO3, "NA"), NO3 = 因子 (NO3), pH = dplyr::na_if(pH, "NA"), pH = 因子 (pH), T_N = dplyr::na_if(pH, "NA"), pH = 因子 (pH), T_N893328BBEFB04 :na_if(T_N, "NA"), T_N = 因子 (T_N), T_P = dplyr::na_if(T_P, "NA"), T_P = 因子 (T_P), DO = dplyr::na_if(DO," ), DO = 系数 (DO), 盐度 = dplyr::na_if(盐度, "NA"), 盐度 = 系数 (盐度), No_Stocking = dplyr::na_if(No_Stocking, "NA"tocking =) , No_Fish_Species = dplyr::na_if(No_Fish_Species, "NA"), No_Fish_Species = factor (No_Fish_Species), Dist_Hwy = dplyr::na_if(Dist_Hwy, "NA"), Dist_Hwy = factor (Dist_Hwy), No_Boat_Launches = dplyr::na_if( No_Boat_Launches，“NA”），No_Boat_Launches = 因子（No_Boat_Launches），Connected_Lakes = Z15E9FB2F4 0B9E33252CE5FBAF7D8B068Z::na_if(Connected_Lakes, "NA"), Connected_Lakes = factor (Connected_Lakes), Invasives = dplyr::na_if(Invasives, "NA"), Invasives = factor (Invasives), Lat = factor (Lat), Lon = factor (Lon), CMS = factor (CMS)) 0B9E33252CE5FBAF7D8B068Z::na_if(Connected_Lakes, "NA"), Connected_Lakes = 因子 (Connected_Lakes), Invasives = dplyr::na_if(Invasives, "NA"), Invasives = factor (Invasives), Lat = factor (Lat), (Lon), CMS = 因子 (CMS))

Question: Does anybody know how to actually make the coding work so that model2 predicts on Test_Dataset?问题：有人知道如何真正使编码工作，以便 model2 在 Test_Dataset 上进行预测吗？ I think this problem might actually be very small, but I'm not seeing it.我认为这个问题实际上可能非常小，但我没有看到它。

Here is a glimpse of the training dataset (Validation_Dataset):这是训练数据集（Validation_Dataset）的一瞥：

> str(Validation_Dataset)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    37 obs. of  31 variables:
 $ Name            : chr  "Canard River" "Cedar Creek" "Holland River" "Speed River" ...
 $ STN #/COUNTY    : chr  "10000200202" "16001800202" "3007700202" "16018403402" ...
 $ Province        : chr  "ON" "ON" "ON" "ON" ...
 $ Lat             : Factor w/ 37 levels "42.03204214",..: 2 1 11 9 10 8 7 5 6 3 ...
 $ Lon             : Factor w/ 37 levels "-83.01879548",..: 1 2 11 8 10 6 7 9 5 4 ...
 $ Year            : Factor w/ 9 levels "2007, 2011","2010, 2015, 2011",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ Month           : chr  "4" "4" "4" "4" ...
 $ Day             : chr  "11" "12" "26" "27" ...
 $ Data Source     : chr  "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" ...
 $ pH              : Factor w/ 35 levels "6.073333","6.13",..: 18 21 28 29 25 34 30 32 19 26 ...
 $ Alkalinity      : Factor w/ 31 levels "1.8","2.8","3.933333333",..: 19 22 31 30 27 NA NA 26 NA 21 ...
 $ Hardness        : Factor w/ 13 levels "14.8","36.8",..: 7 8 11 10 9 NA NA 13 NA NA ...
 $ Ca              : Factor w/ 24 levels "3.833333333",..: 18 19 24 20 21 NA NA 22 NA NA ...
 $ Chlorophyll     : Factor w/ 15 levels "0.423601","0.453791",..: NA NA NA NA NA NA NA NA NA NA ...
 $ DO              : Factor w/ 26 levels "0.27","6.2","6.96",..: 21 24 18 16 4 25 17 14 2 7 ...
 $ TOC             : Factor w/ 3 levels "4.8","5.5","8.8": NA NA NA NA NA NA NA NA NA NA ...
 $ T_P             : Factor w/ 24 levels "0.002","0.003",..: 23 22 18 10 15 14 16 13 21 20 ...
 $ T_N             : Factor w/ 32 levels "0.006","0.13",..: 30 31 27 28 17 29 24 32 21 25 ...
 $ NO3+NO2         : num  2.173 2.292 1.092 1.695 0.426 ...
 $ NO3             : Factor w/ 32 levels "0.027","0.035",..: 30 31 26 27 11 29 24 32 22 8 ...
 $ NH4             : Factor w/ 27 levels "0.005","0.006",..: 26 25 22 17 9 11 13 19 23 27 ...
 $ Cond            : Factor w/ 34 levels "41","97","134",..: 24 21 29 23 22 14 34 31 21 17 ...
 $ Salinity        : Factor w/ 9 levels "0.11","0.15",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Na              : Factor w/ 34 levels "41","97","134",..: 24 21 29 23 22 14 34 31 21 17 ...
 $ No_Stocking     : Factor w/ 3 levels "0","1","2": 1 2 2 3 1 2 1 2 1 2 ...
 $ No_Fish_Species : Factor w/ 9 levels "0","1","2","3",..: 1 4 6 4 1 5 1 9 1 9 ...
 $ Dist_Hwy        : Factor w/ 16 levels "0.003","0.006",..: NA NA 16 NA NA NA NA 8 NA 5 ...
 $ No_Boat_Launches: Factor w/ 8 levels "0","1","2","3",..: 1 1 5 1 1 1 1 8 1 3 ...
 $ Connected_Lakes : Factor w/ 11 levels "0","1","2","3",..: 7 2 3 4 9 6 2 3 2 5 ...
 $ Invasives       : Factor w/ 3 levels "0","1","2": NA NA NA NA NA NA NA NA NA NA ...
 $ CMS             : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...

Answer 1

To use the argument na.roughfix.使用参数 na.roughfix。 It must first be specified outside the randomForest function if it were to be used.如果要使用它，必须首先在 randomForest function 之外指定。 I will use the iris dataset as an example.我将使用 iris 数据集作为示例。

iris.roughfix <- na.roughfix(iris.na)
iris.narf <- randomForest(Species ~ ., iris.na, na.action=na.roughfix)

从我的随机森林 Model 获得仅响应的问题

问题描述

Mutate dataset to fix issues with R reading NA cells变异数据集以修复 R 读取 NA 单元格的问题

1 个解决方案

解决方案1
0 2019-11-20 09:25:02

从我的随机森林 Model 获得仅响应的问题

问题描述

Mutate dataset to fix issues with R reading NA cells变异数据集以修复 R 读取 NA 单元格的问题

1 个解决方案

解决方案1 0 2019-11-20 09:25:02

解决方案1
0 2019-11-20 09:25:02