I would like to calculate some hash value of an h2o.frame.H2OFrame
. Ideally, in both R
and python
. My understanding of h2o.frame.H2OFrame
is that these objects basically "live" on the h2o
server (ie, are represented by some Java
objects) and not within R
or python
from where they might have been uploaded.
I want to calculate the hash value "as close as possible" to the actual training algorithm. That rules out calculation of the hash value on (serializations of) the underlying R
or python
objects, as well as on any underlying files from where the data was loaded. The reason for this is that I want to capture all (possible) changes that h2o
's upload functions perform on the underlying data.
Inferring from the h2o docs , there is no hash-like functionality exposed through h2o.frame.H2OFrame
. One possibility to achieve a hash-like summary of the h2o
data is through summing over all numerical columns and doing something similar for categorical columns. However, I would really like to have some avalanche effect in my hash function so that small changes in the function input result in large differences of the output. This requirement rules out simple sums and the like.
Is there already some interface which I might have overlooked? If not, how could I achieve the task described above?
import h2o
h2o.init()
iris_df=h2o.upload_file(path="~/iris.csv")
# what I would like to achieve
iris_df.hash()
# >>> ab2132nfqf3rf37
# ab2132nfqf3rf37 is the (made up) hash value of iris_df
Thank you for your help.
REST API 1中提供了该功能(请参见屏幕截图),您可能也可以在Python的H2OFrame对象中使用它,但它没有直接公开。
So here a complete solution in python
based on Michal Kurka's and Tom Kraljevic's suggestions:
import h2o
import requests
import json
h2o.init()
iris_df=h2o.upload_file(path="~/iris.csv")
apiEndpoint="http://127.0.0.1:54321/3/Frames/"
res=json.loads(requests.get(apiEndpoint+iris_df.frame_id).text)
print("Checksum 1: ",res["frames"][0]["checksum"])
#change a bit
iris_df[0,1]=iris_df[0,1]+1e-3
res=json.loads(requests.get(apiEndpoint+iris_df.frame_id).text)
print("Checksum 2: ", res["frames"][0]["checksum"])
h2o.cluster().shutdown()
This gives
Checksum 1: 8858396055714143663
Checksum 2: -4953793257165767052
Thanks for your help!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.