简体   繁体   English

酸洗,压缩Python对象并存储在SQL中时信息丢失

[英]Information lost when pickling, compressing Python object and storing in SQL

I am using the pickle and zlib libraries to serialize and compress 3 objects for storage in TSQL as nvarchar(MAX), using pyodbc. 我正在使用pickle和zlib库来序列化和压缩3个对象,并使用pyodbc压缩为nvarchar(MAX)形式存储在TSQL中。 When retrieving these objects from SQL on the other end, 2 successfully translate back to their original form (2d numpy array and sklearn.StandardScaler). 在另一端从SQL检索这些对象时,2成功转换回其原始形式(2d numpy数组和sklearn.StandardScaler)。 The third object (sklearn.GradientBoostingRegressor) does not translate back. 第三个对象(sklearn.GradientBoostingRegressor)不会转回。 Below is a summary of the code used. 以下是所用代码的摘要。 Should also note that I'm using Pandas for some SQL work. 还应注意,我正在使用Pandas进行一些SQL工作。

Below is the pickling, compression and upload to SQL code: 以下是腌制,压缩和上载到SQL代码的方法:

model_pickle = zlib.compress(pickle.dumps(est))
scaler_pickle = zlib.compress(pickle.dumps(scaler),1)
boundary_pickle = zlib.compress(pickle.dumps(bounds),1)

cnxn = pyodbc.connect('driver={SQL Server};server=XXX;database=XYZ;trusted_connection=true')
cursor = cnxn.cursor()

cursor.execute("""INSERT INTO AD_RLGNMNT_MDL_PICKLE (CHAIN_NUM,AD_LENGTH,GL_CODE,AD_TYPE,MODEL_PICKLE,SCALER_PICKLE,WFA_TEST,BOUNDS_PICKLE) VALUES (?,?,?,?,?,?,?,?);""",
               (chain_num.astype(np.int32),ad_length.astype(np.int32),gl_code.astype(np.int32),ad_type,model_pickle,scaler_pickle,WFA,boundary_pickle))
cursor.commit()
cursor.close()
cnxn.close()

Below is the code for pulling from SQL, decompressing and unpickling the objects: 以下是用于从SQL中提取,解压缩和取消提取对象的代码:

model_pickle = pd.read_sql("""SELECT A.MODEL_PICKLE from AD_RLGNMNT_MDL_PICKLE A WHERE A.CHAIN_NUM = %s and A.ad_length = %s and A.GL_CODE = %s and A.AD_TYPE in ('%s');"""
            % (Current_chain_num, Current_ad_length, Current_GL, Current_ad_type),cnxn)
scaler_pickle = pd.read_sql("""SELECT A.SCALER_PICKLE from AD_RLGNMNT_MDL_PICKLE A WHERE A.CHAIN_NUM = %s and A.ad_length = %s and A.GL_CODE = %s and A.AD_TYPE in ('%s');"""
            % (Current_chain_num, Current_ad_length, Current_GL, Current_ad_type),cnxn)
bounds_pickle = pd.read_sql("""SELECT A.BOUNDS_PICKLE from AD_RLGNMNT_MDL_PICKLE A WHERE A.CHAIN_NUM = %s and A.ad_length = %s and A.GL_CODE = %s and A.AD_TYPE in ('%s');"""
            % (Current_chain_num, Current_ad_length, Current_GL, Current_ad_type),cnxn)

combos_list.set_value(i, 'model', model_pickle[['MODEL_PICKLE']].iloc[0].values[0])
combos_list.set_value(i, 'scaler', scaler_pickle[['SCALER_PICKLE']].iloc[0].values[0])
combos_list.set_value(i, 'bounds', bounds_pickle[['BOUNDS_PICKLE']].iloc[0].values[0])

model  = pickle.loads(zlib.decompress(model_pickle[['MODEL_PICKLE']].iloc[0].values[0]))
scaler = pickle.loads(zlib.decompress(scaler_pickle[['SCALER_PICKLE']].iloc[0].values[0]))
bounds = pickle.loads(zlib.decompress(bounds_pickle[['BOUNDS_PICKLE']].iloc[0].values[0]))

When I run the "model = pickle.loads" line, I receive the following error (again, the last two lines work correctly): 当我运行“ model = pickle.loads”行时,收到以下错误(同样,最后两行正常工作):

error: Error -5 while decompressing data: incomplete or truncated stream 错误:解压缩数据时出现错误-5:流不完整或被截断

I have scoured SO and the web for solutions, and have tried countless variations of the above code. 我一直在SO和Web上搜索解决方案,并尝试了上述代码的无数变体。 I have compared the raw ASCI input and output, and input is 203,663 characters long, while the output is 203,649. 我已经比较了原始的ASCI输入和输出,输入的长度为203,663个字符,而输出的则为203,649个字符。 The output is missing an "\\xb3" at the end, as well as an "xaa\\\\" and an "\\\\n" in the middle. 输出末尾缺少“ \\ xb3”,中间缺少“ xaa \\\\”和“ \\\\ n”。

Is there something unique about the GradientBoostingRegressor object? GradientBoostingRegressor对象有什么独特之处吗? Pulling my hair out here. 在这里拉我的头发。

You need use binary BLOB to store your data in database (more complexe to implement) 您需要使用二进制BLOB将数据存储在数据库中(实现起来更为复杂)

or 要么

Simply use Character CLOB to store your data after having made a base64 encoding 进行base64编码后,只需使用Character CLOB来存储数据

import base64

model_pickle=base64.b64encode(zlib.compress(pickle.dumps(est)))

model=pickle.loads(zlib.decompress(base64.b64decode(model_pickle[['MODEL_PICKLE']].iloc[0].values[0])))

note : base64 increase data volume (+33%), but I think it's not very significant 注意: base64会增加数据量(+ 33%),但是我认为这不是很重要

**Update: This only worked for one example. **更新:这仅适用于一个示例。 The selected answer above by user indent fixed my problem. 上面由用户缩进选择的答案解决了我的问题。

Figured it out! 弄清楚了! Modified the first line of code to the following, which includes selecting a more recent pickle protocol. 将第一行代码修改为以下内容,其中包括选择最新的pickle协议。

model_pickle = zlib.compress(pickle.dumps(est,protocol = 2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM