简体   繁体   中英

Pandas DataFrame Hash Values Differ Between Unix and Windows

I've noticed that hash values created from Pandas DataFrames change depending whether the below snippet is executed on Unix or Windows.

import pandas as pd
import numpy as np
import hashlib

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                          columns=['a', 'b', 'c'])

hashvalue_new = hashlib.md5(df.values.flatten().data).hexdigest()
print(hashvalue_new)

The above code prints d0ecb84da86002807de1635ede730f0a on Windows machines and 586962852295d584ec08e7214393f8b2 on Unix machines. Can someone more knowledgeable (or smarter) than me explain to me why this is happening and suggest a way to create a consistent hash value across platforms? I'm running Python 3.8.5 and pandas 1.2.5.

Thx!

I'm unsure why this happens, however I'm able to achieve consistent results by relying on pandas.util.hash_pandas_object , as suggested elsewhere here on Stackoverflow . In full my solution looks like:

import pandas as pd
import numpy as np
import hashlib

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])

hashvalue_new = hashlib.md5(pd.util.hash_pandas_object(df, index=True).values).hexdigest()

Which consistently gives me 9762ced20d27292712e6a2065b6d6226 across operating systems.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM