TensorRT/INT8 Accuracy

This page intends to share some guidance regarding how to triage, debug and fix TensorRT INT8 accuracy issue.

Refer to 8-bit-inference-with-tensorrt to understand the specification of TensorRT INT8.

Refer to TensorRT official documentation to get how to enable INT8 inference,
 * Enabling INT8 Inference Using C++
 * Enabling INT8 Inference Using Python

Run below sample to experiment INT8 inference,
 * sampleINT8 (utilize internal auto calibration)
 * sampleINT8API (import external dynamic range)

The calibrator in above sample utilizes BatchStream to prepare calibration data. That's too complicate in practice, here we provide an assistant class BatchFactory to utilize OpenCV for data preprocessing,
 * Calibration with OpenCV for preprocessing

How to triage INT8 accuracy issue
When customer/user encounter INT8 accuracy issue, they very likely would suspect whether it's caused by TensorRT INT8 quantization, or whether TensorRT INT8 quantization could be suitable for their special network model. If you are just in this case, don't be panic, and please go through the following check to rule out some silly problems. According to our experience, most of INT8 accuracy queries fall into this field, since TensorRT INT8 has been deployed successfully among comprehensive scenarios.

Ensure TensorRT FP32 result is identical as what your training framework produce
As long as your input is identical for TensorRT FP32 and your training framework, there is no doubt that the inference results on both side are identical (unless you implement plugin layer which might generate discrepancy for your network).


 * If not, you will need to examine all the preprocessing steps one by one, dump and compare their input data carefully.


 * Some training frameworks may have their own customized preprocessing steps which may be hard to port into your application, and then you utilize some other preprocessing methods, like OpenCV, the best thing you can do for this case is to try different conversion arguments to make the input as close as possible.


 * Ultimately, we hope you have validated TensorRT FP32 accuracy against all your test dataset before enabling INT8 inference, and ensure its KPI is the same as the last epoch you get during training.

Ensure the preprocessing steps within your calibrator is identical as FP32 inference
Here we assume you have followed our documentation or sample and completed implementing your own INT8 calibrator. virtual int getBatchSize const = 0; virtual bool getBatch(void* bindings[], const char* names[], int nbBindings) = 0; virtual const void* readCalibrationCache(std::size_t& length) = 0; virtual void writeCalibrationCache(const void* ptr, std::size_t length) = 0;

IInt8Calibrator contains four virtual methods that need to be implemented, as shown above, and the most important and problematic one is getBatch, with wither you will need to prepare the calibration data like the way you do for FP32 inference and feed the GPU data into TensorRT. Actually, the calibration purpose is to collect the activation distribution of all your network layer through running FP32. No matter how carefully you copy and paste the preprocessing steps from FP32 inference to your Calibrator, it's always worthy to double confirm they are exactly the same. The last thing we could do is to dump and compare the input.

Ensure the calibration dataset is diverse and representative
The rough guide is to ensure your calibration dataset contains typical image for all classes. And according to our experience, it's not to say that more calibration images take more better accuracy, and experiments indicate that about 500 images is sufficient for calibrating ImageNet classification networks.

Ensure there is no cached and incorrect calibration table being loaded unexpectedly
Sometimes we find the accuracy doesn't change even after we have modified something, like It's probably caused by you forget removing cached calibration table, so always keep this in mind for each experiments.
 * update the preprocessing steps
 * introduce/remove some calibration images
 * migrate from one platform to another
 * migrate from one TensorRT version to another

Ensure you have tried all the calibration method TensorRT provides
For example, TensorRT 6.0 provides the following methods, except the first one which has been deprecated, ensure you have tried all the other three types. enum class CalibrationAlgoType : int {                                       kLEGACY_CALIBRATION = 0, kENTROPY_CALIBRATION = 1, kENTROPY_CALIBRATION_2 = 2, kMINMAX_CALIBRATION = 3, };

BTW, since all the last three calibrators share the same base class, so it's very simple to switch the calibration algo, like below patch, which replaces kENTROPY_CALIBRATION_2 with kMINMAX_CALIBRATION, //class CustomizedInt8Calibrator : public IInt8EntropyCalibrator2 class CustomizedInt8Calibrator : public IInt8MinMaxCalibrator

Ensure you have tried BatchNorm layer
BatchNorm can normalize the value distribution, and make it more uniform and suitable for symmetric quantization that TensorRT support.

Identify INT8 quantization issue
After you've done all above check,
 * If you still get totally incorrect result for INT8 inference, it probably indicates there is something wrong in your implementation. Basically, no matter which kind of inference task, a roughly correct result should be seen after TensorRT calibration.

What does 'partially working' mean? That you can get correct detection or classification result for most input data.
 * If you observe INT8 inference is partially working now, but its accuracy is worse than FP32, like the mAP of only one class is unsatisfied while the others are good, we probably suspect you are facing an INT8 quantization issue.

What does 'worse than FP32' mean? That after you evaluate the accuracy against the total test dataset, the KPI (like TOP1/TOP5 for classification task or mAP for detection task) is much lower than FP32. Typically, as per our experience, we are able to see within 1% INT8 accuracy loss for popular classification CNNs, like VGG, Resnet, MobileNet, and some detection network, like VGG16_FasterRCNN, VGG16_SSD. If your accuracy loss from FP32 to INT8 is extremely larger than 1%, like 5% or even 10%, it might be case we are trying to solve.

Why does it exist?
In math, INT8 quantization is based on the following formula, FP32_value = INT8_value * scaling // don't consider zero point

For weight, we know all the constant value after training done, so it's easy to determine the scaling factor.

But for activation (input or output of each layer/block), since the network input is unknown, the activation distribution of each layer is not predictable. So it's hard to determine a very ideal scaling factor to convert FP32 activation into INT8.

TensorRT introduces INT8 calibration to solve this problem, that run calibration dataset in FP32 mode to chart the histogram of FP32 and choose different scaling factor to evaluate the distribution loss through KL divergence (we called it relative entropy calibration algorithm).

Generally the first step to address INT8 quantization issue is break down which layer causes the significant accuracy loss to your network when deploying INT8.

Approach to figure out the loss layers
1. Find out an image with worse INT8 accuracy. 2. Use above image to perform FP32 inference and dump the output activation values 3. Iterate all layers and do the following experiment, a. Set the layer before scoped layer running INT8 mode b. Set the layer after scoped layer running FP32 mode c. Perform INT8 Inference and save the output activation values d. Compare the output activation values with FP32's.        If the loss is big (there is no fix threshold to judge what kind of loss could be considered as 'big' one,         but you can get a sense from the loss trend during recent iterations), set this layer running FP32 mode in subsequent iterations.

The process is something like below,

#1: layer1_int8 --> layer2_fp32 --> layer3_fp32 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32) #2: layer1_int8 --> layer2_int8 --> layer3_fp32 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32) #3: layer1_int8 --> layer2_int8 --> layer3_int8 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32)

If we observe big accuracy loss when layer3 starts running INT8, then we set layer3 running higher precision mode and continue the experiments.

#4: layer1_int8 --> layer2_int8 --> layer3_fp32 --> layer4_int8 --> … --> layerN_fp32 --> layer_output (compare it with FP32) … #N: layer1_int8 --> layer2_int8 --> layer3_fp32 --> … --> layerN_int8 --> layer_output (compare it with FP32)

Why can't we compare layer by layer activation directly?

Because we don't actually care how the intermediate activation looks like compared to FP32's. Sometime even if the accuracy loss for some middle layer is very big, the final result may be not influenced (perhaps due to your network has big tolerance for current task). Hence, the loss of network output layer is more useful for us to evaluate accuracy.

Why can't we dump all layer INT8 result one time?

For example, if running layer3 in INT8 mode generates big loss to the final output, then the layers after layer3 might also have big accuracy loss. So before we continue to figure out the other potential problematic layer, we should rule out layer3 firstly (through running FP32 mode) to get rid of interactive accuracy influence.

Exercise with sampleMNIST
Here we take sampleMNIST as an example, and follow above approach to figure out which layer have bigger loss than others.


 * Download the samplemnist_accuracy_int8. It's based on public sampleMNIST but make it as a standalone sample (can be built and run all together).


 * Following the README.md to build the sample.


 * Run script to evaluate which layers have big loss.

// the option meaning of layer_analyzer.py can be found from its help (-h). python layer_analyzer.py -a ./sample_mnist -n 9 -o prob -d ./results/ -m 0 -t 0.9

Here is the output, ScopedLayerIndex|              LayerName|             OutputShape|        OutputSimilarity| 0|                  scale|           [1, 10, 1, 1]|                  1.0000| 1|                  conv1|           [1, 10, 1, 1]|                  0.6986| Choose layer [1] running higher precison mode because it leads the similarity of network output becomes lower than specified threshold [0.698629793342 vs 0.98]! 2|                  pool1|           [1, 10, 1, 1]|                  0.5000| Choose layer [2] running higher precison mode because it leads the similarity of network output becomes lower than specified threshold [0.500004345456 vs 0.98]! 3|                  conv2|           [1, 10, 1, 1]|                  1.0000| 4|                  pool2|           [1, 10, 1, 1]|                  1.0000| 5|                    ip1|           [1, 10, 1, 1]|                  1.0000| 6|                  relu1|           [1, 10, 1, 1]|                  1.0000| 7|                    ip2|           [1, 10, 1, 1]|                  1.0000| 8|                   prob|           [1, 10, 1, 1]|                  1.0000|

NOTE:

1. We change deliberately the dynamic range of conv1 output to 1 for demonstration, so that the network would be less accurate. From above result, we can see layer[1, 2] does show much lower similarity than other (the reason pool1 becomes loss layer is because the dynamic range of conv1 will be used as input scaling factor for pool1, so when we change the dynamic range of conv1, the pool1 will be influenced as well.

2. We remove the process parsing inference result, since we only care the similarity of network output.

Step by step guide to apply above approach into your own application
You can compare the source file with public sampleMNIST and get what we have modified.

Add two specific options
@samplemnist_accuracy_int8/common/argsParser.h    // scopedLayerIndex is the cutoff point, // [0 - scopedLayerIndex]  --> Run INT8 (includes scoped layer) // (scopedLayerIndex, NumberOfLayers] --> Run FP32 or FP16 (excludes scoped layer)     int scopedLayerIndex{-1};     // layerStickToFloat means those layers have big loss, so we rule them out before scoping the subsequent layers      // to avoid interactive effect.     std::vector layerStickToFloat;

Handle the two options
We have to set strict type for precision set, or else the final precision will be determined by builder according its profiled performance.

@ samplemnist_accuracy_int8/sampleMNIST.cpp builder->setStrictTypeConstraints(true); if (args.scopedLayerIndex != -1 && mParams.int8) {                                                                                      for (int i = args.scopedLayerIndex + 1; i < network->getNbLayers; i++) {                                                                                      network->getLayer(i)->setPrecision(nvinfer1::DataType::kFLOAT); }                                                                              }             if (!args.layerStickToFloat.empty && mParams.int8) {                                                                                                              for (int i = 0; i < args.layerStickToFloat.size; i++) {                                                                                               network->getLayer(args.layerStickToFloat[i])->setPrecision(nvinfer1::DataType::kFLOAT); }                                                                                            }

Dump network output into files
@ samplemnist_accuracy_int8/sampleMNIST.cpp for (auto& s: mParams.outputTensorNames) {                                                                                       std::string fnameStr; // Desired file name: //     fp32-.txt //     int8-_.txt ...                                                                       std::string filePath = "./results/" + fnameStr + ".txt"; std::ofstream file(filePath); buffers.dumpBuffer(file, s); }

How to fix INT8 quantization issue
Currently we have two solutions which have been qualified in some customer's cases.

Mixed precision
During above sample, we figure out that running layer[1,3,4] in INT8 mode leads big loss to the network output. So, we can choose running them in higher precision mode (like FP32 or FP16 if your platform supports half).

python layer_analyzer.py -a ./sample_mnist -n 9 -o prob -d ./results/ -m 0 -t 0.9 -l 1,2

ScopedLayerIndex|              LayerName|             OutputShape|        OutputSimilarity| 0|                  scale|           [1, 10, 1, 1]|                  1.0000| 1|                  conv1|           [1, 10, 1, 1]|                  1.0000| 2|                  pool1|           [1, 10, 1, 1]|                  1.0000| 3|                  conv2|           [1, 10, 1, 1]|                  1.0000| 4|                  pool2|           [1, 10, 1, 1]|                  1.0000| 5|                    ip1|           [1, 10, 1, 1]|                  1.0000| 6|                  relu1|           [1, 10, 1, 1]|                  1.0000| 7|                    ip2|           [1, 10, 1, 1]|                  1.0000| 8|                   prob|           [1, 10, 1, 1]|                  1.0000|

We can see all the layer achieve higher similarity than previous result. But this way will take performance degradation since some layers are running FP32 mode.

Mixed quantization
An intuitive sense is that TensorRT calibration algorithm behaves bad on quantizing the value distribution of loss layer. So this solution is trying to using other quantization method to estimate the dynamic range, like choose the mean value of TOP5 max values as the dynamic range. And then insert the dynamic range into TensorRT. In this case, most of layer are still using scaling factor from entropy calibration while several loss layer are using external quantization method.

Take above sample as an example, we could tune the dynamic range (concluded from external quantization method in practice) of loss layer to check whether the output accuracy/similarity could be improved or not. if (!strcmp(layer->getName, "conv1")) {                                                         for (int j = 0; j < layer->getNbOutputs; j++) {                                                         //layer->getOutput(j)->setDynamicRange(1, 1); layer->getOutput(j)->setDynamicRange(255, 255); }                                                 }

Make above change to sampleMNIST.cpp, re-build your inference application and re-run the evaluation script.

We can see the result does look better now. python layer_analyzer.py -a ./sample_mnist -n 9 -o prob -d ./results/ -m 0 -t 0.9 ScopedLayerIndex|              LayerName|             OutputShape|        OutputSimilarity| 0|                  scale|           [1, 10, 1, 1]|                  1.0000| 1|                  conv1|           [1, 10, 1, 1]|                  1.0000| 2|                  pool1|           [1, 10, 1, 1]|                  1.0000| 3|                  conv2|           [1, 10, 1, 1]|                  1.0000| 4|                  pool2|           [1, 10, 1, 1]|                  1.0000| 5|                    ip1|           [1, 10, 1, 1]|                  1.0000| 6|                  relu1|           [1, 10, 1, 1]|                  1.0000| 7|                    ip2|           [1, 10, 1, 1]|                  1.0000| 8|                   prob|           [1, 10, 1, 1]|                  1.0000|

QAT (Quantizion-Aware Training)
More and more training frameworks are developing QAT support, that introduce fake quant/dequant node and consider quantization error during training phase, so the pre-quantized model might be less sensitive to INT8 inference.

TensorRT 6.0 provides explicit precision feature to allow user adding fake quant/dequant node through scaling layer (only support symmetric quantization for both weights and activation). But it's not mature at present, like requires user to extract dynamic range (or scaling factor) from pre-quantized model and insert fake quant/dequant node by manual.