简体   繁体   中英

keras model evaluation with quantized weights post training

I have a model trained in keras and is saved as a .h5 file. The model is trained with single precision floating point values with tensorflow backend. Now I want to implement an hardware accelerator which performs the convolution operation on an Xilinx FPGA. However, before I decide on the fixed point bit width to be used on the FPGA, I need to evaluate the model accuracy by quantizing the weights to 8 or 16 bit numbers. I came across the tensorflow quantise but I am not sure how I can go about taking weights from each layer, quantise it and store it in a list of numpy arrays. After all layers are quantised, I want to set the weights of the model to the new formed quantised weights. Could someone help me do this?

This is what I have tried so far to reduce precision from float32 to float16. Please let me know if this is the correct approach.

for i in range(len(w_orginal)):
temp_shape = w_orginal[i].shape
print('Shape of index: '+ str(i)+ 'array is :')
print(temp_shape)
temp_array = w_orginal[i]
temp_array_flat = w_orginal[i].flatten()
for j in range(len(temp_array)):
    temp_array_flat[j] = temp_array_flat[j].astype(np.float16)

temp_array_flat = temp_array_flat.reshape(temp_shape)
w_fp_16_test.append(temp_array_flat)

Sorry for that I'm not familiar to tensorflow, so I can't give you the code, but maybe my experience with quantizing a caffe model could make sense.

If I understand you correctly, you have a tensorflow model(float32) which you want to quantize it to int8 and save it in a numpy.array .

Firstly, you should read all weights for each layer, which might be python list or numpy.array or something else, it does't matter.

Then, the quantize algorithm will influence the accuracy significantly, you must choose the best one for your model. However, these algorithms have the same core -- scale. All you need to do is scale all the weights to -127 to 127(int8), like the scale layer without bias , and record the scale factor.

Meanwile, if want to implement it on FPGA, the data should be qiantized too. Here we have a new problem -- the result of int8 * int8 is a int16, which is obvious overflow.

To solve this, we create a new parameter -- shift -- to shift int16 result back to int8. Notice, the shift parameter won't be constant 8, suppose we have 0 * 0 = 0, we don't need to shift the result at all.

The last question we shoud think over is that if the net is too deep, the layer result could overflow because some unreasonable scale parameters, so we can't directly quantize each single layer without think about other layers.

After all the net finished on FPGA, if you want to dequantize int8 to float32, just use the last scale parameter(of final result) to do some mul/div(depend on how you define scale ).

This is a basic quantize algorithm, others like tf.quantization may have higher accuracy. Now we have the quantized model, you can save it into whatever you like, it's not a hard work.

PS Why numpy? bin file is the best for FPGA, isn't it?

And, do you have some idea about implementing softmax on FPGA? I'm confused about it...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM