TensorRT/CommonErrorFix

From eLinux.org
Jump to: navigation, search

How to fix the error “Could not find scales for tensor xxxx” for INT8 mode?

Generally, after INT8 calibration is done, Int8Calibrator will save the scaling factors into a local file (through API writeCalibrationCache), so that it wouldn’t need to do calibration again for subsequent running and load the cached calibration table directly (through API readCalibrationCache).
If you change the network or update the network or run the network among different GPU platforms or different TensorRT versions, then you may probably get the error “Could not find scales for tensor xxxx”, that indicates builder couldn’t find corresponding scaling factor from local cached calibration table. It’s intended since the network graph after fusion would change among different GPU platform or different TensorRT version or modification to network itself. The solution is very simple that removes the local calibration table and does calibration again.


How to fix "LogicError: explicit_context_dependent failed" during running TRT Python in multi-thread?

If you are using the common.py of TRT/sample to do inference with multi-thread, and getting below error, this FAQ will help you to fix that.

 "pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently active context?"

Refer to "PyCuda FAQ: How does PyCUDA handle threading", this error is caused by the missing active context in work thread.
Please make context as below before trigger the GPU task that reported the error:

   dev = cuda.Device(0) // 0 is your GPU number
   ctx = dev.make_context()

and cleans up after the GPU task using:

   ctx.pop()
   del ctx

Why there is (Unnamed Layer* <N>) appearing in calibration table or verbose log?

For example, here is the calibration table for mnist,

TRT-5105-EntropyCalibration2
data: 3c000889
conv1: 3c8954be
pool1: 3c8954be
conv2: 3dd33169
pool2: 3dd33169
(Unnamed Layer* 4) [Fully Connected]_output: 3dcbd455
ip1: 3daeff02
ip2: 3e7d50e9
prob: 3c010a14

(Unnamed Layer* 4) [Fully Connected]_output actually denotes ip1 layer and the next ip1 denotes relu1 layer.

How does it happen?

It's because we use top attribute to name layer or tensor name, but ip1 and relu1 in mnist.prototxt share the same top name ip1. Hence there must be either of them to use system assigned name.

layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool2"
  top: "ip1"
  param {
    lr_mult: 1.0
  }
  param {
    lr_mult: 2.0
  }
  inner_product_param {
    num_output: 500
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
}
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param {
    lr_mult: 1.0
  }

This wouldn't have any impact on the execution behavior of network.

And it can be avoided through removing all in-place nodes, for mnist, updating the top attribute of relu1 with its layer name relu1, other than ip1 (and updating bottom attribute of ip2 with 'relu1' accordingly).


UFF Parser Error Messages

The following table captures common UFF parser error messages and the way to fix it.
Other common UFF Parser Error can be found in TRT Devemop Guide

Error Message Description Fixing
uff/UffParser.cpp:1130: std::shared_ptr UffParser::parseConvTranspose(const uff::Node&, const Fields&, NodesMap&): Assertion `outSpatial[0] == padLayer->getOutput(0)->getDimensions().d[1]' failed. It's known issue found in TRT 5.0.6.3 It fixed by TRT 6.0. Please update TRT to 6.0 if you got this error
UffParser: Parser error: XXX_LAYER: Order size is not matching the number dimensions of TensorRT It happened when Uffparser is trying to add transpose for the layer whose format is not NCHW and tensor is 5d Modify your model to make sure it's all NCHW format