简体   繁体   中英

MSE using h2o for anomaly detection

I was using the example given by h2o for ECG anomaly detection. When trying to compute manually the MSE, I got different results. To demonstrate the difference I used the last test case but all 23 cases differ. Attached is the full code:

Thanks, Eli.

suppressMessages(library(h2o))
localH2O = h2o.init(max_mem_size = '6g', # use 6GB of RAM of *GB available
                nthreads = -1) # use all CPUs (8 on my personal computer :3)

# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
                          header = FALSE,
                          sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
                         header = FALSE,
                         sep = ",")
# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train_ecg),
                                 training_frame = train_ecg,
                                 activation = "Tanh",
                                 autoencoder = TRUE,
                                 hidden = c(50,20,50),
                                 l1 = 1e-4,
                                 epochs = 100)

# Compute reconstruction error with the Anomaly
# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)

# Pull reconstruction error data into R and
# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)
test_recon <- h2o.predict(anomaly_model, test_ecg)

t <- as.vector(test_ecg[23,])
r <- as.vector(test_recon[23,])
mse.23 <- sum((t-r)^2)/length(t)
mse.23
recon_error[23,]

> mse.23
[1] 2.607374
> recon_error[23,]
[1] 8.264768

it is not really an answer but I did what @Arno Candel has suggested. I have tried to combine test and train data and normalize to 0 - 1. After that, I split the combined and normalized data back to test and train data and run the scripts as generated by the OP. However, I am still getting a different MSE using manual calculation. The MSE is also different when I normalized test and train data separately. Is there something I can do to get the manual calculation correctly?

suppressMessages(library(purrr))
suppressMessages(library(dplyr))
suppressMessages(library(h2o))

localH2O = h2o.init(max_mem_size = '6g', # use 6GB of RAM of *GB available
                nthreads = -1) # use all CPUs (8 on my personal computer :3)

# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
                          header = FALSE,
                          sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
                         header = FALSE,
                         sep = ",")
### adding this section
# normalize data 
train_ecg <- as.data.frame(train_ecg)
test_ecg <- as.data.frame(test_ecg)

dat <- rbind(train_ecg,test_ecg)

get_desc <- function(x) {
  map(x, ~list(
    min = min(.x),
    max = max(.x),
    mean = mean(.x),
    sd = sd(.x)
  ))
}

normalization_minmax <- function(x, desc) {
  map2_dfc(x, desc, ~(.x - .y$min)/(.y$max - .y$min))
}

desc <- dat %>%
  get_desc()

dat <- dat %>%
  normalization_minmax(desc)

train_ecg  <- as.matrix(dat[1:20,]) ; test_ecg <- as.matrix(dat[21:43,])

# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train_ecg),
                                 training_frame = train_ecg,
                                 activation = "Tanh",
                                 autoencoder = TRUE,
                                 hidden = c(50,20,50),
                                 l1 = 1e-4,
                                 epochs = 100)

# Compute reconstruction error with the Anomaly
# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)

# Pull reconstruction error data into R and
# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)
test_recon <- h2o.predict(anomaly_model, test_ecg)

t <- as.vector(test_ecg[23,])
r <- as.vector(test_recon[23,])
mse.23 <- sum((t-r)^2)/length(t)
mse.23
recon_error[23,]


> mse.23
[1] 23.14947
> recon_error[23,]
[1] 8.076866

For autoencoders in H2O, the MSE math is done in the normalized space to avoid numerical scaling issues. For example, if you have categorical features or very large numbers, the neural network autoencoder can't directly operate on those numbers, but instead, it first does dummy one-hot encoding and normalization of numeric features, then it does the fwd/back propagation and computation of reconstruction errors (in the normalized and expanded space). You can manually divide each column by its range (max-min) first for purely numerical data, and your results should match.

Here is a JUnit that does this check explicitly (on that very dataset): https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/test/java/hex/deeplearning/DeepLearningAutoEncoderTest.java#L86-L104

You can also see https://0xdata.atlassian.net/browse/PUBDEV-2078 for more info.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM