简体   繁体   中英

Aggregating Max using h2o in R

I have started using h2o for aggregating large datasets and I have found peculiar behaviour when trying to aggregate the maximum value using h2o's h2o.group_by function. My dataframe often has variables which comprise some or all NA's for a given grouping. Below is an example dataframe.

df <- data.frame("ID" = 1:16)
df$Group<- c(1,1,1,1,2,2,2,3,3,3,4,4,5,5,5,5)
df$VarA <- c(NA_real_,1,2,3,12,12,12,12,0,14,NA_real_,14,16,16,NA_real_,16)
df$VarB <- c(NA_real_,NA_real_,NA_real_,NA_real_,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)

   ID Group VarA VarB VarD
1   1     1   NA   NA   10
2   2     1    1   NA   12
3   3     1    2   NA   14
4   4     1    3   NA   16
5   5     2   12   10   10
6   6     2   12   12   12
7   7     2   12   14   14
8   8     3   12   16   16
9   9     3    0   10   10
10 10     3   14   12   12
11 11     4   NA   14   14
12 12     4   14   16   16
13 13     5   16   10   10
14 14     5   16   12   12
15 15     5   NA   14   14
16 16     5   16   16   16

In this dataframe Group == 1 is completely missing data for VarB (but this is important information to know, so the output for aggregating for the maximum should be NA), while for Group == 1 VarA only has one missing value so the maximum should be 3.

This is a link which includes the behaviour of the behaviour of the na.methods argument ( https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/groupby.html ).

If I set the na.methods = 'all' as below then the aggregated output is NA for Group 1 for both Vars A and B (which is not what I want, but I completely understand this behaviour).

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "all"))

  Group max_ID max_VarA max_VarB max_VarD
1     1      4      NaN      NaN       16
2     2      7       12       14       14
3     3     10       14       16       16
4     4     12      NaN       16       16
5     5     16      NaN       16       16

If I set the na.methods = 'rm' as below then the aggregated output for Group 1 is 3 for VarA (which is the desired output and makes complete sense) but for VarB is -1.80e308 (which is not what I want, and I do not understand this behaviour).

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "rm"))

  Group max_ID max_VarA  max_VarB max_VarD
  <int>  <int>    <int>     <dbl>    <int>
1     1      4        3 -1.80e308       16
2     2      7       12  1.4 e  1       14
3     3     10       14  1.6 e  1       16
4     4     12       14  1.6 e  1       16
5     5     16       16  1.6 e  1       16

Similarly I get the same output if set the na.methods = 'ignore' .

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "ignore"))

  Group max_ID max_VarA  max_VarB max_VarD
  <int>  <int>    <int>     <dbl>    <int>
1     1      4        3 -1.80e308       16
2     2      7       12  1.4 e  1       14
3     3     10       14  1.6 e  1       16
4     4     12       14  1.6 e  1       16
5     5     16       16  1.6 e  1       16

I am not sure why something as common as completely missing data for a given variable within a specific group is being given a value of -1.80e308? I tried the same workflow in dplyr and got results which match my expectations (but this is not a solution as I cannot process datasets of this size in dplyr, and hence my need for a solution in h2o). I realise dplyr is giving me -inf values rather than NA, and I can easily recode both -1.80e308 and -Inf to NA, but I am trying to make sure that this isn't a symptom of a larger problem in h2o (or that I am not doing something fundamentally wrong in my code when attempting to aggregate in h2o ). I also have to aggregate normalised datasets which often have values which are approximately similar to -1.80e308, so I do not want to accidentally recode legitimate values to NA.

library(dplyr)
df %>%
  group_by(Group) %>% 
  summarise(across(everything(), ~max(.x, na.rm = TRUE)))

  Group    ID  VarA  VarB  VarD
  <dbl> <int> <dbl> <dbl> <dbl>
1     1     4     3  -Inf    16
2     2     7    12    14    14
3     3    10    14    16    16
4     4    12    14    16    16
5     5    16    16    16    16

 

This is happening because H2O considers value -Double.MAX_VALUE to be the lowest possible representable floating-point number. This value corresponds to -1.80e308. I agree this is confusing and I would consider this to be a bug. You can file an issue in our bug tracker: https://h2oai.atlassian.net/ (PUBDEV project)

Not sure how to achieve that with h2o.group_by() – I get the same weird value when running your code. If you are open for a somewhat hacky workaround, you might want to try the following (I included the part on H2O initialization for future reference):

  1. convert your frame to long format, ie key-value representation
  2. split by group and apply aggregate function using h2o.ddply()
  3. convert your frame back to wide format
## initialize h2o
library(h2o)

h2o.init(
  nthreads = parallel::detectCores() * 0.5
)

df_h2o = as.h2o(
  df
)

## aggregate per group
df_h2o |> 
  
  # convert to long format
  h2o.melt(
    id_vars = "Group"
    , skipna = TRUE # does not include `NA` in the result
  ) |> 
  
  # calculate `max()` per group
  h2o.ddply(
    .variables = c("Group", "variable")
    , FUN = function(df) {
      max(df[, 3])
    }
  ) |> 
  
  # convert back to wide format
  h2o.pivot(
    index = "Group"
    , column = "variable"
    , value = "ddply_C1"
  )

# Group ID VarA VarB VarD
#     1  4    3  NaN   16
#     2  7   12   14   14
#     3 10   14   16   16
#     4 12   14   16   16
#     5 16   16   16   16
# 
# [5 rows x 5 columns] 

## shut down h2o instance
h2o.shutdown(
  prompt = FALSE
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM