Difference between revisions of "TensorRT/INT8 Accuracy"

From eLinux.org
Jump to: navigation, search
m (How to triage INT8 accuracy issue)
m (Approach to figure out the loss layers)
Line 111: Line 111:
 
Generally the first step to address INT8 quantization issue is break down which layer causes the significant accuracy loss to your network when deploying INT8.
 
Generally the first step to address INT8 quantization issue is break down which layer causes the significant accuracy loss to your network when deploying INT8.
  
==== Approach to figure out the loss layers ====  
+
=== Approach to figure out the loss layers ===
  
 
<pre>
 
<pre>
Line 146: Line 146:
  
 
For example, if running layer3 in INT8 mode generates big loss to the final output, then the layers after layer3 might also have big accuracy loss. So before we continue to figure out the other potential problematic layer, we should rule out layer3 firstly (through running FP32 mode) to get rid of interactive accuracy influence.
 
For example, if running layer3 in INT8 mode generates big loss to the final output, then the layers after layer3 might also have big accuracy loss. So before we continue to figure out the other potential problematic layer, we should rule out layer3 firstly (through running FP32 mode) to get rid of interactive accuracy influence.
 
  
 
=== Exercise with sampleMNIST ===
 
=== Exercise with sampleMNIST ===

Revision as of 23:10, 24 September 2019

This page intends to share some guidance regarding how to triage, debug and fix TensorRT INT8 accuracy issue.


Refer to TensorRT official documentation to get how to enable INT8 inference,


And run below sample to experiment INT8 inference,



The calibrator in above sample utilizes BatchStream to prepare calibration data. That's too complicate in practice, here we provide an assistant class BatchFactory to utilize OpenCV for data preprocessing,


How to triage INT8 accuracy issue

When customer/user encounter INT8 accuracy issue, they very likely would suspect whether it's caused by TensorRT INT8 quantization, or whether TensorRT INT8 quantization could be suitable for their special network model. If you are just in this case, don't be panic, and please go through the following check to rule out some silly problems. According to our experience, most of INT8 accuracy queries fall into this field, since TensorRT INT8 has been deployed successfully among comprehensive scenarios.

Ensure TensorRT FP32 result is identical as what your training framework produce

When people find out TensorRT FP32 result is different with training framework, they probably would ask whether it's expected. We can definitely tell you that there must be some problem along with your input. And that possibility that TensorRT native layer produces wrong FP32 result always comes as the last suspect. One fact to testify this is that most training frameworks utilize cuDNN or cuBlas as their GPU backend, and TensorRT does the same, although it also has its own kernels, which have been qualified for many many times in various networks.

In this case, the first step we recommend is to dump and compare the input data of training framework and TensorRT, and they are supposed to be identical. If not, you will need to examine all the preprocessing steps one by one carefully. Some training frameworks may have their own customized preprocessing steps which may be hard to port into your application, so you may utilize some other preprocessing methods, like OpenCV, the best thing you can do is to try different conversion arguments to make them as close as possible.

Ultimately, we hope you have validated TensorRT FP32 accuracy against all your test dataset before enabling INT8 inference, and ensure its KPI is the same as the last epoch you get during training.

Ensure the preprocessing steps within your calibrator is identical as FP32 inference

Here we assume you have followed our documentation or sample and completed implementing your own INT8 calibrator.

virtual int getBatchSize() const = 0;
virtual bool getBatch(void* bindings[], const char* names[], int nbBindings) = 0;
virtual const void* readCalibrationCache(std::size_t& length) = 0;
virtual void writeCalibrationCache(const void* ptr, std::size_t length) = 0;

IInt8Calibrator contains four virtual methods that need to be implemented, as shown above, and the most important and problematic one is getBatch(), with wither you will need to prepare the calibration data like the way you do for FP32 inference and feed the GPU data into TensorRT. Actually, the calibration purpose is to collect the activation distribution of all your network layer through running FP32. No matter how carefully you copy and paste the preprocessing steps from FP32 inference to your Calibrator, it's always worthy to double confirm they are exactly the same. The last thing we could do is to dump and compare the input.

Ensure the calibration dataset is diverse and representative

This is the hardest check we need to rule out. People frequently ask us what dataset could be diverse and representative. The rough guide is to ensure your calibration dataset contains typical image for all classes. And according to our experience, it's not to say that more calibration images take more better accuracy, and experiments indicate that about 500 images is sufficient for calibrating ImageNet classification networks.

Ensure there is no cached and incorrect calibration table being loaded unexpectedly

Customer/user often migrate their application from one platform to another, or from one TensorRT version to another, and the calibration table is remaining and loaded by TensorRT unexpectedly. What we will need to do is just removing the calibration table before each calibration. Moreover, sometimes we find the accuracy doesn't change even if we modify some stuff. One possible reason is that you forget to remove previous calibration table. So keep this in mind all the time.

Ensure you have tried all the calibration method TensorRT provides

For example, TensorRT 6.0 provides the following methods, except the first one which has been deprecated, ensure you have tried all the other three types.

enum class CalibrationAlgoType : int 
{                                    
    kLEGACY_CALIBRATION = 0,         
    kENTROPY_CALIBRATION = 1,        
    kENTROPY_CALIBRATION_2 = 2,      
    kMINMAX_CALIBRATION = 3,         
};                                   

BTW, since all the last three calibrators share the same base class, so it's very simple to switch the calibration algo, like below patch, which replaces kENTROPY_CALIBRATION_2 with kMINMAX_CALIBRATION,

//class CustomizedInt8Calibrator : public IInt8EntropyCalibrator2
class CustomizedInt8Calibrator : public IInt8MinMaxCalibrator   


Identify INT8 quantization issue

After you've done all above check,

  • If you still get totally incorrect result for INT8 inference, it probably indicates there is something wrong in your implementation. Basically, no matter which kind of inference task, a roughly correct result should be seen after TensorRT calibration.
  • If you observe INT8 inference is partially working now, but its accuracy is worse than FP32, like the mAP of only one class is unsatisfied while the others are good, we probably suspect you are facing an INT8 quantization issue.

What does 'partially working' mean?

That you can get correct detection or classification result for most input data.

What does 'worse than FP32' mean?

That after you evaluate the accuracy against the total test dataset, the KPI (like TOP1/TOP5 for classification task or mAP for detection task) is much lower than FP32. Typically, as per our experience, we are able to see within 1% INT8 accuracy loss for popular classification CNNs, like VGG, Resnet, MobileNet, and some detection network, like VGG16_FasterRCNN, VGG16_SSD. If your accuracy loss from FP32 to INT8 is extremely larger than 1%, like 5% or even 10%, it might be case we are trying to solve.

How to debug INT8 quantization issue

Why does it exist?

In math, INT8 quantization is based on the following formula,

FP32_value = INT8_value * scaling  // don't consider zero point

For weight, we know all the constant value after training done, so it's easy to determine the scaling factor.

But for activation (input or output of each layer/block), since the network input is unknown, the activation distribution of each layer is not predictable. So it's hard to determine a very ideal scaling factor to convert FP32 activation into INT8.

TensorRT introduces INT8 calibration to solve this problem, that run calibration dataset in FP32 mode to chart the histogram of FP32 and choose different scaling factor to evaluate the distribution loss through KL divergence (we called it relative entropy calibration algorithm).

EntropyCalibration is good way to achieve INT8 quantization, but never be the best one for all network layers. As we have emphasized all the time, for most cases, TensorRT INT8 quantization has been qualified for hundreds of times, but there must be some case it can't cover. After we understand this, we could consequently conclude our solutions.

Generally the first step to address INT8 quantization issue is break down which layer causes the significant accuracy loss to your network when deploying INT8.

Approach to figure out the loss layers

1. Find out an image with worse INT8 accuracy.
2. Use above image to perform FP32 inference and dump the output activation values
3. Iterate all layers and do the following experiment,
     a. Set the layer before target layer running INT8 mode
     b. Set the layer after target layer running FP32 mode
     c. Perform INT8 Inference and save the output activation values
     d. Compare the output activation values with FP32's. 
        If the loss is big (there is no fix threshold to judge what kind of loss could be considered as 'big' one, 
        but you can get a sense from the loss trend during recent iterations), set this layer running FP32 mode 
        in subsequent iterations.

The process is something like below,

 #1: layer1_int8 --> layer2_fp32 --> layer3_fp32 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32)
 #2: layer1_int8 --> layer2_int8 --> layer3_fp32 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32)
 #3: layer1_int8 --> layer2_int8 --> layer3_int8 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32)

If we observe big accuracy loss when layer3 starts running INT8, then we set layer3 running higher precision mode and continue the experiments.

 #4: layer1_int8 --> layer2_int8 --> layer3_fp32 --> layer4_int8 --> … --> layerN_fp32 --> layer_output (compare it with FP32)
 …
 #N: layer1_int8 --> layer2_int8 --> layer3_fp32 --> … --> layerN_int8 --> layer_output (compare it with FP32)


Why can't we compare layer by layer activation directly?

Because we don't actually care how the intermediate activation looks like compared to FP32's. Sometime even if the accuracy loss for some middle layer is very big, the final result may be not influenced (perhaps due to your network has big tolerance for current task). Hence, the loss of network output layer is more useful for us to evaluate accuracy.

Why can't we dump all layer INT8 result one time?

For example, if running layer3 in INT8 mode generates big loss to the final output, then the layers after layer3 might also have big accuracy loss. So before we continue to figure out the other potential problematic layer, we should rule out layer3 firstly (through running FP32 mode) to get rid of interactive accuracy influence.

Exercise with sampleMNIST