TVM動かしてみた

TensorRT5を動かしたときからだいぶ時間が経ってしまったけど、今回はTVMを動かしてみた。

公式ドキュメントがそれなりに充実してるけど、あまり日本語の情報はないみたいなので参考になれば。

コードはこちら。

TVMの公式ドキュメントはこちら

環境

動作させた環境は以下で、一通りインストールはできている前提。

OS: Ubuntu16.04

Python: 3.8 (非推奨 今のところ世の中的には3.7までにしておいたほうが無難らしい)

GPU: GTX 1080 Ti

CUDA: 10.2

cuDNN: 7.6.5

TVM インストール

公式ドキュメントを参考にソースからビルドを進める。

docs.tvm.ai

とりあえず動かしたいだけならソースからのビルドではなくDockerのほうがラクそう。

何を有効にするかでビルドの前準備が結構大変になる。

手元で有効にしたのは以下

CUDA、LLVM、MKL BLAS、MKL DNN、NNPACK、CUDNN、DNNL CODEGEN、ANTLR

BLASはopen blasで良さそうだし、MKL-DNNとかNNPACKとかが実際に動いてくれてるかはよくわからない。 DNNL CODEGENとかANTLRは今回の範囲ではなくても問題なさそう(AutoTVMとかで必要になる？)。

Pythonパッケージインストール

今回のコードを動かすのに必要なPythonパッケージのインストール

$pip install -r requirements.txt

余談だけど、Python3.8向けにはまだPyPIにTensorFlowのオフィシャルパッケージが存在しない。世の中的にはまだ3.8向けの正式サポートはもう少し先らしい。今回はちょこちょこ問題潰しつつbazelでビルドしました。

モデルの準備

KerasのMobilenetV1学習済みモデルをダウンロード。

github.com

実行

$python run_tvm.py cat.jpg

.
.
.

Classification Result:
1 tiger cat 0.439913
2 tabby 0.434570
3 Egyptian cat 0.104559
4 lynx 0.011487
5 tiger 0.003490

Evaluate inference time cost...
Inference time 0: 0.990710
Inference time 1: 0.999410
Inference time 2: 0.995719
Inference time 3: 0.984265
Inference time 4: 0.985593
Inference time 5: 0.986964
Inference time 6: 0.991390
Inference time 7: 0.992035
Inference time 8: 1.000641
Inference time 9: 1.001794
Mean inference time (std dev): 0.992852 ms (0.006003 ms)

Kerasモデルからrelayを使ってモデル変換して、引数で与えた画像に対する推論を10回実行している。だいたい 1 msec / image = 1000 fps。 CUDA、cuDNNのバージョンは違えどほとんど前回のTensorRTと差がない。

コードは以下。

github.com

今回も基本的にはいくつかのサンプルコードの寄せ集めみたいな感じ。

あとがき

build_moduleのあたりでかなり大量にCannot find config ...といったメッセージが出てくる。

Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 3, 225, 225, 'float32'), (32, 3, 3, 3, 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 32, 112, 112, 'float32'), (64, 32, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 64, 56, 56, 'float32'), (128, 64, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 128, 56, 56, 'float32'), (128, 128, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 128, 28, 28, 'float32'), (256, 128, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 256, 28, 28, 'float32'), (256, 256, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 256, 14, 14, 'float32'), (512, 256, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 512, 14, 14, 'float32'), (512, 512, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 512, 7, 7, 'float32'), (1024, 512, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 1024, 7, 7, 'float32'), (1024, 1024, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('conv2d', (1, 1024, 1, 1, 'float32'), (1000, 1024, 1, 1, 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 1024, 7, 7, 'float32'), (1024, 1, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 512, 15, 15, 'float32'), (512, 1, 3, 3, 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 512, 14, 14, 'float32'), (512, 1, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 256, 29, 29, 'float32'), (256, 1, 3, 3, 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 256, 28, 28, 'float32'), (256, 1, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 128, 57, 57, 'float32'), (128, 1, 3, 3, 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 128, 56, 56, 'float32'), (128, 1, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 64, 113, 113, 'float32'), (64, 1, 3, 3, 'float32'), (2, 2), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda -model=1080ti, workload=('depthwise_conv2d_nchw', (1, 32, 112, 112, 'float32'), (32, 1, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

fallback configurationを使うのでパフォーマンス低下があるかもということなので、この辺りはAutoTVMを使うと改善されるかもしれない。それなりの数のfallbackが出てるのでAutoTVMにはちょっと期待。次回はAutoTVMでのチューニングを試してみる予定。