简体   繁体   中英

Why do I get 'Error in T[, col] <- data[, col]' when I use SMOTE in R?

I have a big dataset of fire occurring in forests, and I want to predict when the fire ignites. This happens very rarely: 290 times out of 620 000 times.

A tibble: 62,905 x 13
   amplitude polarity DEM_avg   DC   DMC   DSR    FFMC    Pd    RH  TEMP  WS  tree_cover  fire
       <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl> <fct>
 1     -37.8      0     165.   269.  21.9  0.607  84.0   0    65.1  290. 4.36      8        0
 2     -68.1      0     303.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     34.7     0
 3     -54.3      0     332.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     35.8     1
 4    -108.       0     338.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     30.3     0
 5     -60.3      0     374.   171.  35.7  2.30   88.9   0.3  51.7  295. 4.01      29.6     1
 6     -82.8      0     48.2   133.  18.4  0.210  84.9   0    65.1  289. 1.35      18.7     0
 7     -99.6      0     299.   219.  42.6  2.09   90.8   0    34.2  297. 1.42       7       1
 8     -98.1      0     116.   153.  44.7  0.988  89.0   0    41.3  298. 0.235     32.6     0

I tried to use SMOTE to balance my highly imbalanced dataset with the changes suggested by StupidWolf. I do the following:

library(readr)
library(tidyverse)
library(caret)
library(DMwR)
data <- read_csv("data/fire2018.csv", 
    col_types = cols(fire = col_factor(levels = c("0", 
        "1"))))
training.samples <- data$fire %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]
SMOTE(fire ~ amplitude + polarity_dummy + DEM_avg + DC + DMC + DSR + FFMC + Pd + RH + T + VPD + WS + tree_cover, data = data.frame(train.data), perc.over = 600, perc.under = 100)

However, when I use SMOTE from the DMwR package I now get the following error:

Error in factor(newCases[, a], levels = 1:nlevels(data[, a]), labels = levels(data[,  : 
  invalid 'labels'; length 0 should be 1 or 2
In addition: Warning messages:
1: In if (class(data[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
2: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
  NAs introduced by coercion
3: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
  NAs introduced by coercion

I have looked for different solutions. One suggested transforming variables into numeric and factor, but my variables are already transformed correctly. My dependent variable is factor w/ 2 levels and the independent variables are numeric, and I have no N/A in any of my variables. But, that did not help my case. I got a similar error.

In the example you showed, the dependent is still numeric, you need to encode it as a factor. The function SMOTE also doesn't work well with tibble. I cannot get the same error as you did, but I suspect if you do like what I did below, it should work, otherwise please provide reproducible examples:

library(DMwR)
library(tibble)
data = iris
data$Species = ifelse(data$Species=="versicolor",1,0)
data = tibble(data)

In the above example, Species is the dependent, encoded as 0/1. You can the structure, the dependent is numeric like yours (see the under Species and under fire in yours):

head(data)
# A tibble: 6 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>   <dbl>
1          5.1         3.5          1.4         0.2       0
2          4.9         3            1.4         0.2       0
3          4.7         3.2          1.3         0.2       0

These throw an error:

newData <- SMOTE(Species ~ Sepal.Width+Sepal.Length,data=data,perc.over = 100, perc.under = 200)

# convert to factor
data$Species = factor(data$Species)

newData <- SMOTE(Species ~ Sepal.Width+Sepal.Length,data=data,perc.over = 100, perc.under = 200)

If you do this, it will be ok:

newData <- SMOTE(Species ~ Sepal.Width+Sepal.Length,
data=data.frame(data),perc.over = 100, perc.under = 200)

dim(newData)
[1] 200   5

So, after spending hours on this problem. I finally with help from StupidWolf came to the following solution: I had to clean up my dataset, which included a lot of different variables that I did not use. Here, there were N/A's. Apparently, R could not handle that while I was not using the variable anyhow. So to sum it up. I ended up changing the data part in the SMOTE function to data.frame . My code ended like this:

library(readr)
library(tidyverse)
library(caret)
library(DMwR)
data <- read_csv("data/test.csv", 
+                  col_types = cols(fire = col_factor(levels = c("0", 
+                                                                "1"))))
training.samples <- data$fire %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]
newData <- SMOTE(fire ~ amplitude + polarity_dummy + DEM_avg + DC + DMC + DSR + FFMC + Pd + RH + T + VPD + WS + tree_cover, data = data.frame(train.data), perc.over = 10000, perc.under = 1000)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM