TensorRT/Int8CFAQ

How to do INT8 calibration without using BatchStream?
The way using BatchStream to do calibration is too complicate to accommodate for practice.

Here we provide a sample class BatchFactory which utilizes OpenCV for calibration data pre-processing and simplify the calibration procedure and make user easy to understand what is really needed for calibration.

NOTE: The pre-processing flow in BatchFactory should be adjusted according to the requirement of your netWork.

And then when we implement the IInt8EntropyCalibrator, we can use the API loadBatch from assistant class to load batch data directly. bool getBatch(void* bindings[], const char* names[], int nbBindings) override {                                                                                        float mean[3]{102.9801f, 115.9465f, 122.7717f}; // also in BGR order float *batchBuf = mBF.loadBatch(mean, 1.0f); // Indicates calibration data feeding done if (!batchBuf) return false; CHECK(cudaMemcpy(mDeviceInput, batchBuf, mInputCount * sizeof(float), cudaMemcpyHostToDevice)); assert(!strcmp(names[0], INPUT_BLOB_NAME0)); bindings[0] = mDeviceInput; return true; }

Can INT8 calibration table be compatible among different TRT versions or HW platforms?
INT8 calibration table is absolutely NOT compatible between different TRT versions. This is because the optimized network graph is probably different among various TRT versions. If you enforce to use them, TRT may not find the corresponding scaling factor for given tensor. As long as the installed TensorRT version is identical for different HW platforms, then the INT8 calibration table can be compatible. That means you can perform INT8 calibration on a faster computation platform, like V100 or T4 and then deploy the calibration table to Tegra for INT8 inferencing as long as these platforms have the same TensorRT version installed (at least with the same major and minor version, like 5.1.5 and 5.1.6).

How to do INT8 calibration for the networks with multiple inputs
TensorRT uses bindings to denote the input and output buffer pointer and they are arranged in order. Hence, if your network has multiple input node/layer, you can pass through the input buffer pointers into bindings (void **) separately, like below network with two inputs required, bool getBatch(void* bindings[], const char* names[], int nbBindings) override {                                                                                         // Prepare the batch data (on GPU) for mDeviceInput and imInfoDev ...            assert(!strcmp(names[0], INPUT_BLOB_NAME0)); bindings[0] = mDeviceInput; assert(!strcmp(names[1], INPUT_BLOB_NAME1)); bindings[1] = imInfoDev; return true; }    NOTE: If your calibration batch is 10, then for each calibration cycle, you will need to fill each of your input buffer with 10 images accordingly.

How to interpret the value of calibration table?
The number in calibration table is float value in hex to denote the scaling factor of each output tensor, for example,

(Unnamed Layer* 50) [Convolution]_output: 3c55b689

'''0x3c55b689 convert to float in dec: 0.013044

dynamic range: 0.013044 * 127 = 1.656588'''

Here is a tool to decode them one time.

Why the inference result is not stable even with the same calibration dataset and same input image?
Sometimes you may encounter the issue, that FP32 inference is correct, but when you switch to use INT8, the result is not stable among multiple running cycles (even if the calibration dataset or calibration table is finalized). You probably wonder whether TensorRT would produce inconsistent result for INT8 inference in some corner case. Generally, it's almost impossible for TensorRT kernel to show random or unstable behavior, and there must be only two possibilities, correct or incorrect, as long as you perform inference with fixed engine (all running cycle uses the same engine for inference). Or else, if you generate engine for each time running, it's possible to appear inconsistent result for INT8 mode.

It's because when you run INT8 mode, if your network is very simple or your batch is very small, TensorRT builder may produce different engine for your network. Given such case, when builder profile all kernels, running FP32 for certain layer may be even better than running INT8. Hence the precision mode for all network layer might be not consistent among multiple time engine generation (you can use nvprof to profiling those engines to figure out whether different precision kernels for same layer are chosen).

Why running FP32 is even better?

If your network is too simple and meanwhile you run it with very small batch size, the perf gain from running some layer INT8 can't probably eliminate the side effect of reformat conversion.

To be more specific, think about the following two case,

a. input_fp32 + reformat_to_int8 + layer1_int8 ... + layerN_int8 + reformat_to_fp32 + softmax_f32 b. input_fp32 + layer1_fp32 ... + layerN_fp32 + softmax_f32

a might be slower than b for the very small network and batch size due to those 'reformat' cost.

There is one API which could force these layers running INT8 as long as it has INT8 based implementation. builder->setStrictTypeConstraints(true); Once you enable it, you will see the consistent result no matter how many time you build TensorRT engine. But this option might impact performance a bit, since it breaks the logic of TensorRT builder that choose precision always based on latency.