简体   繁体   中英

Calculate hash of an h2o frame

I would like to calculate some hash value of an h2o.frame.H2OFrame . Ideally, in both R and python . My understanding of h2o.frame.H2OFrame is that these objects basically "live" on the h2o server (ie, are represented by some Java objects) and not within R or python from where they might have been uploaded.

I want to calculate the hash value "as close as possible" to the actual training algorithm. That rules out calculation of the hash value on (serializations of) the underlying R or python objects, as well as on any underlying files from where the data was loaded. The reason for this is that I want to capture all (possible) changes that h2o 's upload functions perform on the underlying data.

Inferring from the h2o docs , there is no hash-like functionality exposed through h2o.frame.H2OFrame . One possibility to achieve a hash-like summary of the h2o data is through summing over all numerical columns and doing something similar for categorical columns. However, I would really like to have some avalanche effect in my hash function so that small changes in the function input result in large differences of the output. This requirement rules out simple sums and the like.

Is there already some interface which I might have overlooked? If not, how could I achieve the task described above?

import h2o
h2o.init()
iris_df=h2o.upload_file(path="~/iris.csv")

# what I would like to achieve
iris_df.hash()
# >>> ab2132nfqf3rf37 

# ab2132nfqf3rf37 is the (made up) hash value of iris_df

Thank you for your help.

REST API 1中提供了该功能(请参见屏幕截图),您可能也可以在Python的H2OFrame对象中使用它,但它没有直接公开。

So here a complete solution in python based on Michal Kurka's and Tom Kraljevic's suggestions:

import h2o
import requests
import json

h2o.init()

iris_df=h2o.upload_file(path="~/iris.csv")

apiEndpoint="http://127.0.0.1:54321/3/Frames/"
res=json.loads(requests.get(apiEndpoint+iris_df.frame_id).text)

print("Checksum 1: ",res["frames"][0]["checksum"])

#change a bit
iris_df[0,1]=iris_df[0,1]+1e-3

res=json.loads(requests.get(apiEndpoint+iris_df.frame_id).text)

print("Checksum 2: ", res["frames"][0]["checksum"])

h2o.cluster().shutdown()

This gives

Checksum 1:  8858396055714143663
Checksum 2:  -4953793257165767052

Thanks for your help!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM