简体   繁体   English

在 R 中使用 h2o 聚合 Max

[英]Aggregating Max using h2o in R

I have started using h2o for aggregating large datasets and I have found peculiar behaviour when trying to aggregate the maximum value using h2o's h2o.group_by function. My dataframe often has variables which comprise some or all NA's for a given grouping.我已经开始使用h2o来聚合大型数据集,并且在尝试使用 h2o 的h2o.group_by function 聚合最大值时发现了奇怪的行为。我的 dataframe 通常包含包含给定分组的部分或全部 NA 的变量。 Below is an example dataframe.下面是一个例子 dataframe。

df <- data.frame("ID" = 1:16)
df$Group<- c(1,1,1,1,2,2,2,3,3,3,4,4,5,5,5,5)
df$VarA <- c(NA_real_,1,2,3,12,12,12,12,0,14,NA_real_,14,16,16,NA_real_,16)
df$VarB <- c(NA_real_,NA_real_,NA_real_,NA_real_,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)

   ID Group VarA VarB VarD
1   1     1   NA   NA   10
2   2     1    1   NA   12
3   3     1    2   NA   14
4   4     1    3   NA   16
5   5     2   12   10   10
6   6     2   12   12   12
7   7     2   12   14   14
8   8     3   12   16   16
9   9     3    0   10   10
10 10     3   14   12   12
11 11     4   NA   14   14
12 12     4   14   16   16
13 13     5   16   10   10
14 14     5   16   12   12
15 15     5   NA   14   14
16 16     5   16   16   16

In this dataframe Group == 1 is completely missing data for VarB (but this is important information to know, so the output for aggregating for the maximum should be NA), while for Group == 1 VarA only has one missing value so the maximum should be 3.在这个 dataframe Group == 1 中,VarB 完全缺失数据(但这是要知道的重要信息,因此用于最大值聚合的 output 应该是 NA),而对于 Group == 1 VarA 只有一个缺失值,因此最大值应该是3。

This is a link which includes the behaviour of the behaviour of the na.methods argument ( https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/groupby.html ).这是一个链接,其中包含na.methods参数的行为( https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/groupby.html )。

If I set the na.methods = 'all' as below then the aggregated output is NA for Group 1 for both Vars A and B (which is not what I want, but I completely understand this behaviour).如果我设置na.methods = 'all'如下所示,则聚合 output 对于 Vars A 和 B 的第 1 组都是 NA(这不是我想要的,但我完全理解这种行为)。

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "all"))

  Group max_ID max_VarA max_VarB max_VarD
1     1      4      NaN      NaN       16
2     2      7       12       14       14
3     3     10       14       16       16
4     4     12      NaN       16       16
5     5     16      NaN       16       16

If I set the na.methods = 'rm' as below then the aggregated output for Group 1 is 3 for VarA (which is the desired output and makes complete sense) but for VarB is -1.80e308 (which is not what I want, and I do not understand this behaviour).如果我如下设置na.methods = 'rm'那么第 1 组的聚合 output 对于 VarA 是 3(这是所需的 output 并且完全有意义)但是对于 VarB 是 -1.80e308(这不是我想要的,我不明白这种行为)。

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "rm"))

  Group max_ID max_VarA  max_VarB max_VarD
  <int>  <int>    <int>     <dbl>    <int>
1     1      4        3 -1.80e308       16
2     2      7       12  1.4 e  1       14
3     3     10       14  1.6 e  1       16
4     4     12       14  1.6 e  1       16
5     5     16       16  1.6 e  1       16

Similarly I get the same output if set the na.methods = 'ignore' .同样,如果设置na.methods = 'ignore' ,我得到相同的 output 。

h2o_agg <-  h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "ignore"))

  Group max_ID max_VarA  max_VarB max_VarD
  <int>  <int>    <int>     <dbl>    <int>
1     1      4        3 -1.80e308       16
2     2      7       12  1.4 e  1       14
3     3     10       14  1.6 e  1       16
4     4     12       14  1.6 e  1       16
5     5     16       16  1.6 e  1       16

I am not sure why something as common as completely missing data for a given variable within a specific group is being given a value of -1.80e308?我不确定为什么像特定组中给定变量的完全缺失数据这样常见的东西被赋予 -1.80e308 的值? I tried the same workflow in dplyr and got results which match my expectations (but this is not a solution as I cannot process datasets of this size in dplyr, and hence my need for a solution in h2o).我在 dplyr 中尝试了相同的工作流程并得到了符合我预期的结果(但这不是解决方案,因为我无法在 dplyr 中处理这种大小的数据集,因此我需要 h2o 中的解决方案)。 I realise dplyr is giving me -inf values rather than NA, and I can easily recode both -1.80e308 and -Inf to NA, but I am trying to make sure that this isn't a symptom of a larger problem in h2o (or that I am not doing something fundamentally wrong in my code when attempting to aggregate in h2o ).我意识到 dplyr 给我的是-inf值而不是 NA,我可以轻松地将-1.80e308-Inf重新编码为 NA,但我试图确保这不是h2o (或当我尝试在h2o中聚合时,我的代码并没有做一些根本性的错误)。 I also have to aggregate normalised datasets which often have values which are approximately similar to -1.80e308, so I do not want to accidentally recode legitimate values to NA.我还必须聚合规范化数据集,这些数据集通常具有与 -1.80e308 大致相似的值,因此我不想意外地将合法值重新编码为 NA。

library(dplyr)
df %>%
  group_by(Group) %>% 
  summarise(across(everything(), ~max(.x, na.rm = TRUE)))

  Group    ID  VarA  VarB  VarD
  <dbl> <int> <dbl> <dbl> <dbl>
1     1     4     3  -Inf    16
2     2     7    12    14    14
3     3    10    14    16    16
4     4    12    14    16    16
5     5    16    16    16    16

 

This is happening because H2O considers value -Double.MAX_VALUE to be the lowest possible representable floating-point number.发生这种情况是因为 H2O 认为值 -Double.MAX_VALUE 是最低可能的可表示浮点数。 This value corresponds to -1.80e308.该值对应于 -1.80e308。 I agree this is confusing and I would consider this to be a bug.我同意这令人困惑,我认为这是一个错误。 You can file an issue in our bug tracker: https://h2oai.atlassian.net/ (PUBDEV project)您可以在我们的错误跟踪器中提交问题: https://h2oai.atlassian.net/ (PUBDEV 项目)

Not sure how to achieve that with h2o.group_by() – I get the same weird value when running your code.不确定如何使用h2o.group_by()实现这一点——我在运行您的代码时得到了同样奇怪的值。 If you are open for a somewhat hacky workaround, you might want to try the following (I included the part on H2O initialization for future reference):如果您愿意接受一些 hacky 解决方法,您可能想尝试以下操作(我包括了 H2O 初始化部分以供将来参考):

  1. convert your frame to long format, ie key-value representation将您的框架转换为长格式,即键值表示
  2. split by group and apply aggregate function using h2o.ddply()按组拆分并使用h2o.ddply()应用聚合 function
  3. convert your frame back to wide format将您的框架转换回宽格式
## initialize h2o
library(h2o)

h2o.init(
  nthreads = parallel::detectCores() * 0.5
)

df_h2o = as.h2o(
  df
)

## aggregate per group
df_h2o |> 
  
  # convert to long format
  h2o.melt(
    id_vars = "Group"
    , skipna = TRUE # does not include `NA` in the result
  ) |> 
  
  # calculate `max()` per group
  h2o.ddply(
    .variables = c("Group", "variable")
    , FUN = function(df) {
      max(df[, 3])
    }
  ) |> 
  
  # convert back to wide format
  h2o.pivot(
    index = "Group"
    , column = "variable"
    , value = "ddply_C1"
  )

# Group ID VarA VarB VarD
#     1  4    3  NaN   16
#     2  7   12   14   14
#     3 10   14   16   16
#     4 12   14   16   16
#     5 16   16   16   16
# 
# [5 rows x 5 columns] 

## shut down h2o instance
h2o.shutdown(
  prompt = FALSE
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM