TensorRT/INT8 Accuracy

This page intends to share some guidance regarding how to triage, debug and fix TensorRT INT8 accuracy issue.

Refer to TensorRT official documentation to get how to enable INT8 inference,


 * Enabling INT8 Inference Using C++


 * Enabling INT8 Inference Using Python

And run below sample to experiment INT8 inference,


 * sampleINT8 (utilize internal auto calibration)


 * sampleINT8API (import external dynamic range)

The calibrator in above sample utilizes BatchStream to prepare calibration data. That's too complicate in practice, here we provide an assistant class BatchFactory to utilize OpenCV for data preprocessing,


 * Calibration with OpenCV for preprocessing

How to triage INT8 accuracy issue
When customer/user encounter INT8 accuracy issue, they very likely would suspect whether it's caused by TensorRT INT8 quantization, or whether TensorRT INT8 quantization could be suitable for their special network model. If you are just in this case, don't be panic, and please go through the following check to rule out some silly problems. According to our experience, most of INT8 accuracy queries fall into this field, since TensorRT INT8 has been deployed successfully among comprehensive scenarios.

Ensure TensorRT FP32 result is identical as what your training framework produce
When people find out TensorRT FP32 result is different with training framework, they probably would ask whether it's expected. We can definitely tell you that there must be some problem along with your input. And that possibility that TensorRT native layer produces wrong FP32 result always comes as the last suspect. One fact to testify this is that most training frameworks utilize cuDNN or cuBlas as their GPU backend, and TensorRT does the same, although it also has its own kernels, which have been qualified for many many times in various networks.

In this case, the first step we recommend is to dump and compare the input data of training framework and TensorRT, and they are supposed to be identical. If not, you will need to examine all the preprocessing steps one by one carefully. Some training frameworks may have their own customized preprocessing steps which may be hard to port into your application, so you may utilize some other preprocessing methods, like OpenCV, the best thing you can do is to try different conversion arguments to make them as close as possible.

Ultimately, we hope you have validated TensorRT FP32 accuracy against all your test dataset before enabling INT8 inference, and ensure its KPI is the same as the last epoch you get during training.

Ensure the preprocessing steps within your calibrator is identical as FP32 inference
Here we assume you have followed our documentation or sample and completed implementing your own INT8 calibrator. virtual int getBatchSize const = 0; virtual bool getBatch(void* bindings[], const char* names[], int nbBindings) = 0; virtual const void* readCalibrationCache(std::size_t& length) = 0; virtual void writeCalibrationCache(const void* ptr, std::size_t length) = 0;

IInt8Calibrator contains four virtual methods that need to be implemented, as shown above, and the most important and problematic one is getBatch, with wither you will need to prepare the calibration data like the way you do for FP32 inference and feed the GPU data into TensorRT. Actually, the calibration purpose is to collect the activation distribution of all your network layer through running FP32. No matter how carefully you copy and paste the preprocessing steps from FP32 inference to your Calibrator, it's always worthy to double confirm they are exactly the same. The last thing we could do is to dump and compare the input.

Ensure the calibration dataset is diverse and representative
This is the hardest check we need to rule out. People frequently ask us what dataset could be diverse and representative. The rough guide is to ensure your calibration dataset contains typical image for all classes. And according to our experience, it's not to say that more calibration images take more better accuracy, and experiments indicate that about 500 images is sufficient for calibrating ImageNet classification networks.

Ensure there is no cached and incorrect calibration table being loaded unexpectedly
Customer/user often migrate their application from one platform to another, or from one TensorRT version to another, and the calibration table is remaining and loaded by TensorRT unexpectedly. What we will need to do is just removing the calibration table before each calibration. Moreover, sometimes we find the accuracy doesn't change even if we modify some stuff. One possible reason is that you forget to remove previous calibration table. So keep this in mind all the time.

Ensure you have tried all the calibration method TensorRT provides
For example, TensorRT 6.0 provides the following methods, except the first one which has been deprecated, ensure you have tried all the other three types. enum class CalibrationAlgoType : int {                                       kLEGACY_CALIBRATION = 0, kENTROPY_CALIBRATION = 1, kENTROPY_CALIBRATION_2 = 2, kMINMAX_CALIBRATION = 3, };

BTW, since all the last three calibrators share the same base class, so it's very simple to switch the calibration algo, like below patch, which replaces kENTROPY_CALIBRATION_2 with kMINMAX_CALIBRATION, //class CustomizedInt8Calibrator : public IInt8EntropyCalibrator2 class CustomizedInt8Calibrator : public IInt8MinMaxCalibrator

Identify INT8 quantization issue
After you've done all above check,
 * If you still get totally incorrect result for INT8 inference, it probably indicates there is something wrong in your implementation. Basically, no matter which kind of inference task, a roughly correct result should be seen after TensorRT calibration.


 * If you observe INT8 inference is partially working now, but its accuracy is worse than FP32, like the mAP of only one class is unsatisfied while the others are good, we probably suspect you are facing an INT8 quantization issue.

What does 'partially working' mean?

That you can get correct detection or classification result for most input data.

What does 'worse than FP32' mean?

That after you evaluate the accuracy against the total test dataset, the KPI (like TOP1/TOP5 for classification task or mAP for detection task) is much lower than FP32. Typically, as per our experience, we are able to see within 1% INT8 accuracy loss for popular classification CNNs, like VGG, Resnet, MobileNet, and some detection network, like VGG16_FasterRCNN, VGG16_SSD. If your accuracy loss from FP32 to INT8 is extremely larger than 1%, like 5% or even 10%, it might be case we are trying to solve.

Why does it exist?
In math, INT8 quantization is based on the following formula, FP32_value = INT8_value * scaling // don't consider zero point

For weight, we know all the constant value after training done, so it's easy to determine the scaling factor.

But for activation (input or output of each layer/block), since the network input is unknown, the activation distribution of each layer is not predictable. So it's hard to determine a very ideal scaling factor to convert FP32 activation into INT8.

TensorRT introduces INT8 calibration to solve this problem, that run calibration dataset in FP32 mode to chart the histogram of FP32 and choose different scaling factor to evaluate the distribution loss through KL divergence (we called it relative entropy calibration algorithm).

EntropyCalibration is good way to achieve INT8 quantization, but never be the best one for all network layers. As we have emphasized all the time, for most cases, TensorRT INT8 quantization has been qualified for hundreds of times, but there must be some case it can't cover. After we understand this, we could consequently conclude our solutions.

Generally the first step to address INT8 quantization issue is break down which layer causes the significant accuracy loss to your network when deploying INT8.

Approach to figure out the loss layers
1. Find out an image with worse INT8 accuracy. 2. Use above image to perform FP32 inference and dump the output activation values 3. Iterate all layers and do the following experiment, a. Set the layer before target layer running INT8 mode b. Set the layer after target layer running FP32 mode c. Perform INT8 Inference and save the output activation values d. Compare the output activation values with FP32's.        If the loss is big (there is no fix threshold to judge what kind of loss could be considered as 'big' one,         but you can get a sense from the loss trend during recent iterations), set this layer running FP32 mode in subsequent iterations.

The process is something like below,

#1: layer1_int8 --> layer2_fp32 --> layer3_fp32 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32) #2: layer1_int8 --> layer2_int8 --> layer3_fp32 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32) #3: layer1_int8 --> layer2_int8 --> layer3_int8 --> layer4_fp32 --> … --> layerN_fp32 --> layer_output (compare it with FP32)

If we observe big accuracy loss when layer3 starts running INT8, then we set layer3 running higher precision mode and continue the experiments.

#4: layer1_int8 --> layer2_int8 --> layer3_fp32 --> layer4_int8 --> … --> layerN_fp32 --> layer_output (compare it with FP32) … #N: layer1_int8 --> layer2_int8 --> layer3_fp32 --> … --> layerN_int8 --> layer_output (compare it with FP32)

Why can't we compare layer by layer activation directly?

Because we don't actually care how the intermediate activation looks like compared to FP32's. Sometime even if the accuracy loss for some middle layer is very big, the final result may be not influenced (perhaps due to your network has big tolerance for current task). Hence, the loss of network output layer is more useful for us to evaluate accuracy.

Why can't we dump all layer INT8 result one time?

For example, if running layer3 in INT8 mode generates big loss to the final output, then the layers after layer3 might also have big accuracy loss. So before we continue to figure out the other potential problematic layer, we should rule out layer3 firstly (through running FP32 mode) to get rid of interactive accuracy influence.

Exercise with sampleMNIST
Here we take sampleMNIST as an example, and follow above approach to figure out which layer have bigger loss than others.


 * Download the samplemnist_accuracy_int8. It's based on public sampleMNIST but make it as a standalone sample (can be built and run all together).


 * Following the README.md to build the sample.


 * Run script to evaluate which layers have big loss.

python layer_analyzer.py -a ./sample_mnist -n 9 -o prob -d ./results/ -m 1 -t 0.98

You will get the following output, ScopedLayerIndex|              LayerName|             OutputShape|        OutputSimilarity| 0|                  scale|           [1, 10, 1, 1]|                  0.9821| 1|                  conv1|           [1, 10, 1, 1]|                  0.9797| Choose layer [1] running higher precison mode because it leads the similarity of network output becomes lower than specified threshold [0.979734370761 vs 0.98]! 2|                  pool1|           [1, 10, 1, 1]|                  0.9803| 3|                  conv2|           [1, 10, 1, 1]|                  0.9773| Choose layer [3] running higher precison mode because it leads the similarity of network output becomes lower than specified threshold [0.977308892198 vs 0.98]! 4|                  pool2|           [1, 10, 1, 1]|                  0.9784| Choose layer [4] running higher precison mode because it leads the similarity of network output becomes lower than specified threshold [0.978408826535 vs 0.98]! 5|                    ip1|           [1, 10, 1, 1]|                  0.9808| 6|                  relu1|           [1, 10, 1, 1]|                  0.9808| 7|                    ip2|           [1, 10, 1, 1]|                  0.9808| 8|                   prob|           [1, 10, 1, 1]|                  0.9808|

The option meaning of layer_analyzer.py can be found from its help.

From above output, we can see layer [1, 3, 4] have bigger loss than others (That's to say running them in INT8 mode leads the similarity of network output becomes lower than the threshold we specify).

NOTE:

1. We intend to change the predefined dynamic range for each tensor to 50, as a result the network layer would be less accurate. Or else, if we don't do that, the similarity of MNIST is very close to 1 and hard to be a convincing sample.

2. According to our experience, the metric  is too strict for loss evaluation. In practice,  might be preferable way. But in this case, even we change the dynamic range of each tensor to inaccurate value, the cosine similarity are still appearing 1.0 and the network result will be very accurate (probably due to the feature map of MNIST network is too simple and has enough tolerance).

Add two specific options
@samplemnist_accuracy_int8/common/argsParser.h    // scopedLayerIndex is the cutoff point, // [0 - scopedLayerIndex]  --> Run INT8 (includes scoped layer) // (scopedLayerIndex, NumberOfLayers] --> Run FP32 or FP16 (excludes scoped layer)     int scopedLayerIndex{-1};     // layerStickToFloat means those layers have big loss, so we rule them out before scoping the subsequent layers      // to avoid interactive effect.     std::vector layerStickToFloat;

Handle the two options
We have to set strict type for precision set, or else the final precision will be determined by builder according its profiled performance.

@ samplemnist_accuracy_int8/sampleMNIST.cpp builder->setStrictTypeConstraints(true); if (args.scopedLayerIndex != -1 && mParams.int8) {                                                                                      for (int i = args.scopedLayerIndex + 1; i < network->getNbLayers; i++) {                                                                                      network->getLayer(i)->setPrecision(nvinfer1::DataType::kFLOAT); }                                                                              }             if (!args.layerStickToFloat.empty && mParams.int8) {                                                                                                              for (int i = 0; i < args.layerStickToFloat.size; i++) {                                                                                               network->getLayer(args.layerStickToFloat[i])->setPrecision(nvinfer1::DataType::kFLOAT); }                                                                                            }

Dump network output into files
@ samplemnist_accuracy_int8/sampleMNIST.cpp for (auto& s: mParams.outputTensorNames) {                                                                                       std::string fnameStr; // Desired file name: //     fp32-.txt //     int8-_.txt ...                                                                       std::string filePath = "./results/" + fnameStr + ".txt"; std::ofstream file(filePath); buffers.dumpBuffer(file, s); }

Hacks to facilitate loss evaluation
//samplesCommon::setAllTensorScales(network.get, maxMean, maxMean); samplesCommon::setAllTensorScales(network.get, 50, 50);
 * Change dynamic range to 50 on purpose.

// Pick a fix digit to compare accuracy const int digit = 4;
 * Fix input image


 * Remove the logic that verifies the output result, since it's not necessary.

Invoke the script to start
python layer_analyzer.py -a ./sample_mnist -n 9 -o prob -d ./results/ -m 1 -t 0.98

How to fix INT8 quantization issue
Currently we have two solutions which have been qualified in some customer's cases.

Mixed precision
During above sample, we figure out that running layer[1,3,4] in INT8 mode leads big loss to the network output. So, we can choose running them in higher precision mode (like FP32 or FP16 if your platform supports half).

python layer_analyzer.py -a ./sample_mnist -n 9 -o prob -d ./results/ -m 1 -t 0.98 -l 1,3,4

ScopedLayerIndex|              LayerName|             OutputShape|        OutputSimilarity| 0|                  scale|           [1, 10, 1, 1]|                  0.9821| 1|                  conv1|           [1, 10, 1, 1]|                  0.9821| 2|                  pool1|           [1, 10, 1, 1]|                  0.9803| 3|                  conv2|           [1, 10, 1, 1]|                  0.9803| 4|                  pool2|           [1, 10, 1, 1]|                  0.9803| 5|                    ip1|           [1, 10, 1, 1]|                  0.9808| 6|                  relu1|           [1, 10, 1, 1]|                  0.9808| 7|                    ip2|           [1, 10, 1, 1]|                  0.9808| 8|                   prob|           [1, 10, 1, 1]|                  0.9808|

We can see all the layer achieve higher similarity than previous result. But this way will take performance degradation since some layers are running FP32 mode. In this case, we could consider next solution.

Mixed quantization
An intuitive sense is that TensorRT calibration algorithm behaves bad on quantizing the value distribution of loss layer. So this solution is trying to using other quantization method to estimate the dynamic range, like choose the mean value of TOP5 max values as the dynamic range. And then insert the dynamic range into TensorRT. In this case, most of layer are still using scaling factor from entropy calibration while several loss layer are using external quantization method.

Take above sample as an example, we could insert a more accurate dynamic range to check whether the accuracy/similarity could be improved or not. //samplesCommon::setAllTensorScales(network.get, 50, 50); samplesCommon::setAllTensorScales(network.get, 200, 200);

Re-build your inference application and re-run the evaluation script. We can see the result does look better now. python layer_analyzer.py -a ./sample_mnist -n 9 -o prob -d ./results/ -m 1 -t 0.98 ScopedLayerIndex|              LayerName|             OutputShape|        OutputSimilarity| 0|                  scale|           [1, 10, 1, 1]|                  0.9993| 1|                  conv1|           [1, 10, 1, 1]|                  0.9998| 2|                  pool1|           [1, 10, 1, 1]|                  0.9998| 3|                  conv2|           [1, 10, 1, 1]|                  0.9997| 4|                  pool2|           [1, 10, 1, 1]|                  0.9997| 5|                    ip1|           [1, 10, 1, 1]|                  0.9995| 6|                  relu1|           [1, 10, 1, 1]|                  0.9995| 7|                    ip2|           [1, 10, 1, 1]|                  0.9995| 8|                   prob|           [1, 10, 1, 1]|                  0.9995|

Quantizion-Aware Training
More and more training framework are developing QAT support, that introduce fake quant/dequant node during training phase.

The pre-quantized model might be less sensitive to INT8 quantization error during inference.

TensorRT 6.0 provides explicit precision feature to support such case, but it currently is not mature.

Further reading,
 * TensorFlow QAT
 * Pytorch QAT