Difference between revisions of "TensorRT/AccuracyIssues"

From eLinux.org
Jump to: navigation, search
(Created page with " ---- ===== <big> How to fix FP16 accuracy issue?</big> ===== The following is the data range of FP32, FP16 and INT8, {| class="wikitable" |- ! !! '''Dynamic Range''' !! '''M...")
 
(How to fix INT8 accuracy issue?)
 
(11 intermediate revisions by 2 users not shown)
Line 7: Line 7:
 
!  !! '''Dynamic Range''' !! '''Min Positive Value'''
 
!  !! '''Dynamic Range''' !! '''Min Positive Value'''
 
|-
 
|-
| FP32 || -3.4 x 1038 ~ +3.4 x 1038 || 1.4 x 10-45
+
| FP32 || -3.4 x 10<sup>38</sup> ~ +3.4 x 10<sup>38</sup> || 1.4 x 10<sup>-45</sup>
 
|-
 
|-
| FP16 || -65504 ~ +65504 || 5.96 x 10-8
+
| FP16 || -65504 ~ +65504 || 5.96 x 10<sup>-8</sup>
 
|-
 
|-
 
| INT8 || -128 ~ +127 || 1
 
| INT8 || -128 ~ +127 || 1
Line 35: Line 35:
 
* Activation/Relu can also help (since negative overflow values get clipped to zero for both FP16 and FP32, so the loss will be decreasing by half ?).
 
* Activation/Relu can also help (since negative overflow values get clipped to zero for both FP16 and FP32, so the loss will be decreasing by half ?).
 
----
 
----
 +
 
===== <big> How to fix INT8 accuracy issue?</big> =====
 
===== <big> How to fix INT8 accuracy issue?</big> =====
Basically, you should be able to get an absolutely correct result for FP32 mode and roughly correct result for INT8 mode after calibration. Otherwise, if FP32 result is as expected, while INT8 result is totally messing,  it’s probably due to the incorrect calibration. <br>
+
 
The IInt8Calibrator contains four virtual methods need to be implemented, as shown below, the most important and problematic one is getBatch(),
+
Basically, you should be able to get an absolutely correct result for FP32 mode and roughly correct result for INT8 mode after TensorRT auto calibration or inserting external dynamic ranges. Otherwise, if FP32 result is as expected, while INT8 result is totally messing up,  it’s probably due to invalid calibration procedure or inaccurate dynamic range. <br>
 +
 
 +
If you are leveraging TensorRT auto calibration mechanism, please do the following checks to rule out calibration issue(refer to [https://elinux.org/TensorRT/Int8CFAQ here] regarding how to perform calibration without using the approach of ''BatchStream'').
 +
 
 +
'''IInt8Calibrator''' contains four virtual methods that need to be implemented, as shown below, the most important and problematic one is ''getBatch()'',
 +
 
 
  virtual int getBatchSize() const = 0;
 
  virtual int getBatchSize() const = 0;
 
  virtual bool getBatch(void* bindings[], const char* names[], int nbBindings) = 0;
 
  virtual bool getBatch(void* bindings[], const char* names[], int nbBindings) = 0;
 
  virtual const void* readCalibrationCache(std::size_t& length) = 0;
 
  virtual const void* readCalibrationCache(std::size_t& length) = 0;
 
  virtual void writeCalibrationCache(const void* ptr, std::size_t length) = 0;
 
  virtual void writeCalibrationCache(const void* ptr, std::size_t length) = 0;
* Is the calibration input after preprocessing identical or not with the preprocessing of FP32 inferencing? If you are not sure about it, just compare the buff before feeding into TensorRT.
+
 
 +
* Is the calibration input after preprocessing identical as the preprocessing of FP32 inferencing? If you are not sure about it, just dump the buff before feeding into TensorRT and compare them.
 +
 
 
* Is the calibration dataset enough or not? Ensure the calibration dataset is diverse and representative.  
 
* Is the calibration dataset enough or not? Ensure the calibration dataset is diverse and representative.  
* Is there any cached and incorrect calibration table being used or not?
+
 
After you get a roughly correct result for INT8 mode, you can start evaluating its accuracy against the whole test dataset. If you get a poor classification or detection accuracy as opposed to FP32 mode (Q: which case can be treated as ‘poor’ result, for example, we are seeing within 1% INT8 accuracy loss for popular classification CNNs, like AlexNet, VGG19, Resnet50/101/152 and detection network, like VGG16_FasterRCNN_500x375, VGG16_SSD_300x300, if your accuracy loss is extremely larger than 1%, it might be the ‘poor’ case.), then we would suggest you to do the following check,
+
* Is there any cached and incorrect calibration table being loaded unexpectedly?
* Whether your network can be suitable for INT8 mode? Mostly training framework can also run the network in INT8 mode, so you should validate it out of TensorRT’s scope. Only when you ensure your model can be a goden model for INT8, then let’s consider how to deploy it appropriately through TensorRT INT8, otherwise, you can do nothing with TensorRT.
+
 
* TensorRT does provide internal quantization way for customers to use. But it’s a post-training quantization way and expose less manipulation for users, so it can’t work for all the network cases. If your model is unluckily to be the case, then you should consider external quantization methodology and insert the dynamic range into TensorRT through the following API,
+
 
  virtual bool setDynamicRange(float min, float max) = 0;
+
Ultimately you should be able to get a roughly correct result for INT8 mode, and then you can start evaluating its accuracy against your whole test dataset.  
Additionally, someone might directly compare the value distribution of INT8 and FP32 for certain middle layer and surprisingly find out big discrepancy between them. It’s kind of expected, since TensorRT uses [http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf saturation] quantization way and there are indeed values overflowed after INT8 computation. Actually, the relative entropy methodology for INT8 is to minimize the loss of information, and retain the final detection or classification accuracy loss. Even if you are seeing big deviation for middle layer activation, after propagation through the whole network, the final accuracy loss would probably decrease just a bit. Hence, it’s not convincing to use the value deviation between INT8 and FP32 to evaluate the INT8 accuracy.
+
 
 +
 
 +
If you get a poor classification or detection accuracy as opposed to FP32 mode (Q: which case can be treated as ‘poor’ result, for example, we are able to see within 1% INT8 accuracy loss for popular classification CNNs, like AlexNet, VGG19, Resnet50/101/152 and detection network, like VGG16_FasterRCNN_500x375, VGG16_SSD_300x300, if your accuracy loss is extremely larger than 1%, it might be the ‘poor’ case.), then we would suggest you to try the following approaches to fix it,
 +
 
 +
* Mix-precision inference
 +
Follow the approach of [https://elinux.org/TensorRT/LayerDumpAndAnalyzeForINT8 page] to analyze the accuracy of all layers and set higher precision for the layer of which loss is extremely larger than others,
 +
 
 +
virtual void setPrecision(DataType dataType) = 0;
 +
 
 +
NOTE: Don't forget configuring strict type for your network, or else, this format setting may compromise during network optimization.
 +
 +
builder->setStrictTypeConstraints(true);
 +
 
 +
 
 +
 
 +
* TensorRT does provide internal quantization way for customers to use, but it’s a post-training quantization way and expose less manipulation for users, so it can’t work for all the network cases. If your model is unluckily to be the case, then you should consider external quantization methodology and insert the dynamic range into TensorRT through the following API,
 +
 
 +
  virtual bool setDynamicRange(float min, float max)
 +
 
 +
 
 +
Further reading about the quantization ways in the other frameworks, Tensorflow [https://www.tensorflow.org/lite/performance/post_training_quantization Post-training Quantization], Tensorflow [https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize Quantization-aware Training], Pytorch [https://github.com/pytorch/glow/blob/master/docs/Quantization.md Quantization].

Latest revision as of 02:55, 17 September 2019


How to fix FP16 accuracy issue?

The following is the data range of FP32, FP16 and INT8,

Dynamic Range Min Positive Value
FP32 -3.4 x 1038 ~ +3.4 x 1038 1.4 x 10-45
FP16 -65504 ~ +65504 5.96 x 10-8
INT8 -128 ~ +127 1

Not like INT8, generally, we wouldn’t see overflow case (activation or weight larger than 65504 or less than -65504) for FP16 computation, but the underflow (less than 5.96e-8) would be still appearing compared to FP32 values.
To debug FP16 accuracy analysis, we could dump the result of middle layer to scope whether FP16 activation value has big deviation compared to FP32’s (Refer to page to get how to do layer dumping and analyzing).

According to our experience, batch normalization and activation(Relu) can effectively decrease the information loss of FP16, like the following statistic we scoped from UNet semantic segmentation network,

Networks layer FP32_value - FP16 Value | / |FP32| > 10% Total Number Deviation ratio
(Diff_num/total_num * 100%)
UNet Conv0 23773 2621440 (40*256*256) 0.9069%
UNet bn0 371 2621440 (40*256*256) 0.0142%
UNet relu0 196 2621440 (40*256*256) 0.0075%

NOTE: If we want to dump FP16 result of the first layer, we have to set it as output layer, but setting certain layer as output probably causes TensorRT builder decides to run this layer in FP32, other than FP16 (it is probably due to the input and output both are FP32, if it runs FP16 computation, then it will need reformatting before and after, this reformat overhead might be larger than what we benefit from running FP16 mode). In this case, we shall use the following API to make the network run in FP16 mode strictly without considering any performance optimization,

builder->setStrictTypeConstraints(true);

Refer to the above result, we can see

  • Convolution FP16 does have 0.9% loss compared to FP32 result.
  • Batch normalization can help decrease the loss significantly from 0.9% to 0.014%.
  • Activation/Relu can also help (since negative overflow values get clipped to zero for both FP16 and FP32, so the loss will be decreasing by half ?).

How to fix INT8 accuracy issue?

Basically, you should be able to get an absolutely correct result for FP32 mode and roughly correct result for INT8 mode after TensorRT auto calibration or inserting external dynamic ranges. Otherwise, if FP32 result is as expected, while INT8 result is totally messing up, it’s probably due to invalid calibration procedure or inaccurate dynamic range.

If you are leveraging TensorRT auto calibration mechanism, please do the following checks to rule out calibration issue(refer to here regarding how to perform calibration without using the approach of BatchStream).

IInt8Calibrator contains four virtual methods that need to be implemented, as shown below, the most important and problematic one is getBatch(),

virtual int getBatchSize() const = 0;
virtual bool getBatch(void* bindings[], const char* names[], int nbBindings) = 0;
virtual const void* readCalibrationCache(std::size_t& length) = 0;
virtual void writeCalibrationCache(const void* ptr, std::size_t length) = 0;
  • Is the calibration input after preprocessing identical as the preprocessing of FP32 inferencing? If you are not sure about it, just dump the buff before feeding into TensorRT and compare them.
  • Is the calibration dataset enough or not? Ensure the calibration dataset is diverse and representative.
  • Is there any cached and incorrect calibration table being loaded unexpectedly?


Ultimately you should be able to get a roughly correct result for INT8 mode, and then you can start evaluating its accuracy against your whole test dataset.


If you get a poor classification or detection accuracy as opposed to FP32 mode (Q: which case can be treated as ‘poor’ result, for example, we are able to see within 1% INT8 accuracy loss for popular classification CNNs, like AlexNet, VGG19, Resnet50/101/152 and detection network, like VGG16_FasterRCNN_500x375, VGG16_SSD_300x300, if your accuracy loss is extremely larger than 1%, it might be the ‘poor’ case.), then we would suggest you to try the following approaches to fix it,

  • Mix-precision inference

Follow the approach of page to analyze the accuracy of all layers and set higher precision for the layer of which loss is extremely larger than others,

virtual void setPrecision(DataType dataType) = 0;

NOTE: Don't forget configuring strict type for your network, or else, this format setting may compromise during network optimization.

builder->setStrictTypeConstraints(true);


  • TensorRT does provide internal quantization way for customers to use, but it’s a post-training quantization way and expose less manipulation for users, so it can’t work for all the network cases. If your model is unluckily to be the case, then you should consider external quantization methodology and insert the dynamic range into TensorRT through the following API,
virtual bool setDynamicRange(float min, float max)


Further reading about the quantization ways in the other frameworks, Tensorflow Post-training Quantization, Tensorflow Quantization-aware Training, Pytorch Quantization.