简体   繁体   English

从我的随机森林 Model 获得仅响应的问题

[英]Issue with getting NA only response from my Random Forest Model

this is my first post on Stake overflow so please ask me follow questions if more info is needed!这是我关于 Stake overflow 的第一篇文章,所以如果需要更多信息,请向我提问!

Situation: I've complied water chemistry data for the Maritimes (Atlantic Canada) for freshwater ecosystems because I am trying to create a predictive species distribution model using a random forest model (RFM) for an invasive species.情况:我已经为淡水生态系统的海洋(加拿大大西洋)编制了水化学数据,因为我正在尝试使用随机森林 model(RFM)为入侵物种创建预测物种分布 model。 Unfortunately, Atlantic Canada lack consistent water monitoring programs and the ones that do exist don't monitor for the same parameters as other group.不幸的是,加拿大大西洋地区缺乏一致的水监测计划,并且确实存在的监测计划与其他组不监测相同的参数。 So, my databases (both the training and testing) have many NAs.所以,我的数据库(训练和测试)有很多 NA。

Issue: This is the response I keep getting from my RFM:问题:这是我不断从我的 RFM 得到的回复:

> p1 <- predict(model2, newdata=Test_Dataset,type="prob")[,2]
> p1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 35 36 37 NA NA NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 35 36 37 NA NA NA

What I have tried:我试过的:

  1. I built the RFM (ie model2) using various predictors.我使用各种预测器构建了 RFM(即模型 2)。 I did include:我确实包括:

    model2 <- randomForest(CMS ~ Lat + Lon + pH + Alkalinity + Ca + Hardness + DO + TOC + T_P + T_N + Cond + Na + No_Stocking + No_Fish_Species + Dist_Hwy + No_Boat_Launches + Connected_Lakes + Invasives, importance = TRUE, data=TrainSet, na.action=na.roughfix) model2 model2 <- randomForest(CMS ~ Lat + Lon + pH + Alkalinity + Ca + Hardness + DO + TOC + T_P + T_N + Cond + Na + No_Stocking + No_Fish_Species + Dist_Hwy + No_Boat_Launches + Connected_Lakes + Invasives,重要性 = TRUE,数据 = TrainSet , na.action=na.roughfix) 模型2

**Note that the big list of variables are the predictors and CMS is the species. **请注意,变量的大列表是预测变量,CMS 是物种。

  1. I tried matching the test dataset (Test_Dataset) with the training dataset (Validation_Dataset).我尝试将测试数据集(Test_Dataset)与训练数据集(Validation_Dataset)进行匹配。

    Test_Dataset <- rbind(Validation_Dataset[1, ], Validation_Dataset) Test_Dataset <- Test_Dataset[-1,] Test_Dataset <- rbind(Validation_Dataset[1, ], Validation_Dataset) Test_Dataset <- Test_Dataset[-1,]

  2. I have searched for and read multiple resources (including the obvious R pages and references linked there).我搜索并阅读了多个资源(包括明显的 R 页面和链接的参考资料)。

  3. I have mutated the dataframe as follows (I'll just show the Validation_Dataset as it is the same mutations for both):我已经对 dataframe 进行了如下突变(我将只显示 Validation_Dataset,因为它对两者都是相同的突变):

    Mutate dataset to fix issues with R reading NA cells变异数据集以修复 R 读取 NA 单元格的问题

    Validation_Dataset <- Validation_Dataset %>% dplyr::mutate( # convert year into a categorical variable Year = factor (Year), #convert Chlorophyll concentrations from a character file to a number file # convert "NA" into a missing value data whenever appropriate Chlorophyll = dplyr::na_if(Chlorophyll, "NA"), Chlorophyll = factor (Chlorophyll), Hardness = dplyr::na_if(Hardness, "NA"), Hardness= factor (Hardness), Alkalinity = dplyr::na_if(Alkalinity, "NA"), Alkalinity = factor (Alkalinity), Ca = dplyr::na_if(Ca, "NA"), Ca = factor (Ca), TOC = dplyr::na_if(TOC, "NA"), TOC = factor (TOC), Cond = dplyr::na_if(Cond, "NA"), Cond = factor (Cond), Na = dplyr::na_if(Na, "NA"), Na = factor (Cond), NH4 = dplyr::na_if(NH4, "NA"), NH4 = factor Validation_Dataset <- Validation_Dataset %>% dplyr::mutate( # 将年份转换为分类变量 Year = factor (Year), #convert 叶绿素浓度从字符文件到数字文件 # 在适当的时候将“NA”转换为缺失值数据Chlorophyll = dplyr::na_if(Chlorophyll, "NA"), Chlorophyll = factor (Chlorophyll), Hardness = dplyr::na_if(Hardness, "NA"), Hardness= factor (Hardness), Alkalinity = dplyr::na_if(Alkalinity , "NA"), 碱度 = 因子 (碱度), Ca = dplyr::na_if(Ca, "NA"), Ca = 因子 (Ca), TOC = dplyr::na_if(TOC, "NA"), TOC = factor (TOC), Cond = dplyr::na_if(Cond, "NA"), Cond = factor (Cond), Na = dplyr::na_if(Na, "NA"), Na = factor (Cond), NH4 = dplyr ::na_if(NH4, "NA"), NH4 = 因子 (NH4), NO3 = dplyr::na_if(NO3, "NA"), NO3 = factor (NO3), pH = dplyr::na_if(pH, "NA"), pH = factor (pH), T_N = dplyr::na_if(T_N, "NA"), T_N = factor (T_N), T_P = dplyr::na_if(T_P, "NA"), T_P = factor (T_P), DO = dplyr::na_if(DO, "NA"), DO = factor (DO), Salinity = dplyr::na_if(Salinity, "NA"), Salinity = factor (Salinity), No_Stocking = dplyr::na_if(No_Stocking, "NA"), No_Stocking = factor (No_Stocking), No_Fish_Species = dplyr::na_if(No_Fish_Species, "NA"), No_Fish_Species = factor (No_Fish_Species), Dist_Hwy = dplyr::na_if(Dist_Hwy, "NA"), Dist_Hwy = factor (Dist_Hwy), No_Boat_Launches = dplyr::na_if(No_Boat_Launches, "NA"), No_Boat_Launches = factor (No_Boat_Launches), Connected_Lakes = Z15E9FB2F4 (NH4), NO3 = dplyr::na_if(NO3, "NA"), NO3 = 因子 (NO3), pH = dplyr::na_if(pH, "NA"), pH = 因子 (pH), T_N = dplyr::na_if(pH, "NA"), pH = 因子 (pH), T_N893328BBEFB04 :na_if(T_N, "NA"), T_N = 因子 (T_N), T_P = dplyr::na_if(T_P, "NA"), T_P = 因子 (T_P), DO = dplyr::na_if(DO," ), DO = 系数 (DO), 盐度 = dplyr::na_if(盐度, "NA"), 盐度 = 系数 (盐度), No_Stocking = dplyr::na_if(No_Stocking, "NA"tocking =) , No_Fish_Species = dplyr::na_if(No_Fish_Species, "NA"), No_Fish_Species = factor (No_Fish_Species), Dist_Hwy = dplyr::na_if(Dist_Hwy, "NA"), Dist_Hwy = factor (Dist_Hwy), No_Boat_Launches = dplyr::na_if( No_Boat_Launches,“NA”),No_Boat_Launches = 因子(No_Boat_Launches),Connected_Lakes = Z15E9FB2F4 0B9E33252CE5FBAF7D8B068Z::na_if(Connected_Lakes, "NA"), Connected_Lakes = factor (Connected_Lakes), Invasives = dplyr::na_if(Invasives, "NA"), Invasives = factor (Invasives), Lat = factor (Lat), Lon = factor (Lon), CMS = factor (CMS)) 0B9E33252CE5FBAF7D8B068Z::na_if(Connected_Lakes, "NA"), Connected_Lakes = 因子 (Connected_Lakes), Invasives = dplyr::na_if(Invasives, "NA"), Invasives = factor (Invasives), Lat = factor (Lat), (Lon), CMS = 因子 (CMS))

Question: Does anybody know how to actually make the coding work so that model2 predicts on Test_Dataset?问题:有人知道如何真正使编码工作,以便 model2 在 Test_Dataset 上进行预测吗? I think this problem might actually be very small, but I'm not seeing it.我认为这个问题实际上可能非常小,但我没有看到它。

Here is a glimpse of the training dataset (Validation_Dataset):这是训练数据集(Validation_Dataset)的一瞥:

> str(Validation_Dataset)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    37 obs. of  31 variables:
 $ Name            : chr  "Canard River" "Cedar Creek" "Holland River" "Speed River" ...
 $ STN #/COUNTY    : chr  "10000200202" "16001800202" "3007700202" "16018403402" ...
 $ Province        : chr  "ON" "ON" "ON" "ON" ...
 $ Lat             : Factor w/ 37 levels "42.03204214",..: 2 1 11 9 10 8 7 5 6 3 ...
 $ Lon             : Factor w/ 37 levels "-83.01879548",..: 1 2 11 8 10 6 7 9 5 4 ...
 $ Year            : Factor w/ 9 levels "2007, 2011","2010, 2015, 2011",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ Month           : chr  "4" "4" "4" "4" ...
 $ Day             : chr  "11" "12" "26" "27" ...
 $ Data Source     : chr  "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" ...
 $ pH              : Factor w/ 35 levels "6.073333","6.13",..: 18 21 28 29 25 34 30 32 19 26 ...
 $ Alkalinity      : Factor w/ 31 levels "1.8","2.8","3.933333333",..: 19 22 31 30 27 NA NA 26 NA 21 ...
 $ Hardness        : Factor w/ 13 levels "14.8","36.8",..: 7 8 11 10 9 NA NA 13 NA NA ...
 $ Ca              : Factor w/ 24 levels "3.833333333",..: 18 19 24 20 21 NA NA 22 NA NA ...
 $ Chlorophyll     : Factor w/ 15 levels "0.423601","0.453791",..: NA NA NA NA NA NA NA NA NA NA ...
 $ DO              : Factor w/ 26 levels "0.27","6.2","6.96",..: 21 24 18 16 4 25 17 14 2 7 ...
 $ TOC             : Factor w/ 3 levels "4.8","5.5","8.8": NA NA NA NA NA NA NA NA NA NA ...
 $ T_P             : Factor w/ 24 levels "0.002","0.003",..: 23 22 18 10 15 14 16 13 21 20 ...
 $ T_N             : Factor w/ 32 levels "0.006","0.13",..: 30 31 27 28 17 29 24 32 21 25 ...
 $ NO3+NO2         : num  2.173 2.292 1.092 1.695 0.426 ...
 $ NO3             : Factor w/ 32 levels "0.027","0.035",..: 30 31 26 27 11 29 24 32 22 8 ...
 $ NH4             : Factor w/ 27 levels "0.005","0.006",..: 26 25 22 17 9 11 13 19 23 27 ...
 $ Cond            : Factor w/ 34 levels "41","97","134",..: 24 21 29 23 22 14 34 31 21 17 ...
 $ Salinity        : Factor w/ 9 levels "0.11","0.15",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Na              : Factor w/ 34 levels "41","97","134",..: 24 21 29 23 22 14 34 31 21 17 ...
 $ No_Stocking     : Factor w/ 3 levels "0","1","2": 1 2 2 3 1 2 1 2 1 2 ...
 $ No_Fish_Species : Factor w/ 9 levels "0","1","2","3",..: 1 4 6 4 1 5 1 9 1 9 ...
 $ Dist_Hwy        : Factor w/ 16 levels "0.003","0.006",..: NA NA 16 NA NA NA NA 8 NA 5 ...
 $ No_Boat_Launches: Factor w/ 8 levels "0","1","2","3",..: 1 1 5 1 1 1 1 8 1 3 ...
 $ Connected_Lakes : Factor w/ 11 levels "0","1","2","3",..: 7 2 3 4 9 6 2 3 2 5 ...
 $ Invasives       : Factor w/ 3 levels "0","1","2": NA NA NA NA NA NA NA NA NA NA ...
 $ CMS             : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...

To use the argument na.roughfix.使用参数 na.roughfix。 It must first be specified outside the randomForest function if it were to be used.如果要使用它,必须首先在 randomForest function 之外指定。 I will use the iris dataset as an example.我将使用 iris 数据集作为示例。

iris.roughfix <- na.roughfix(iris.na)
iris.narf <- randomForest(Species ~ ., iris.na, na.action=na.roughfix)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM