TensorFlow Serving系列之源码安装服务

编程入门行业动态更新时间:2024-10-10 06:20:04

TensorFlow Serving系列之<a href=https://www.elefans.com/category/jswz/34/1770099.html style= 源码安装服务"/>

TensorFlow Serving系列之源码安装服务

0 背景

在之前的文章中，我们介绍过用docker的方式来安装部署TFS（tensorflow serving），但实际测试时发现当有高并发请求时，容易崩溃，为了排除问题，我们选择从源码编译，在安装之前，我们要统一版本，否则会出现各种坑。首先根据自己服务器上已经安装的cuda和cudnn版本，选择要安装哪个版本的tensorflow，并对应安装同样版本的tensorflow serving，比如我服务器上cuda版本是9.0.176，cudnn版本是7.3.1

环境信息：Ubuntu16.04, tensorflow-gpu 1.12.0, cuda9.0, cudnn 7.3, tensorflow serving 1.12.0, bazel 0.15.0

系列文章目录

（一）TensorFlow Serving系列之安装及调用方法

（二）TensorFlow Serving系列之导出自己的训练模型

（三）TensorFlow Serving系列之客户端gRPC调用

（四）TensorFlow Serving系列之gRPC基本知识

（五）TensorFlow Serving系列之源码安装服务

（六）TensorFlow Serving系列之多模型多版本控制

1 bazel安装

Bazel 是谷歌开源的构建和测试工具，类似于Make、Maven及Gradle。它使用一种人易于理解的高级构建语言。Bazel 支持多种开发语言的项目，能够基于多个平台来构建。Bazel支持跨多个制品库和大规模用户的大型代码仓库。

我们用源码安装时要用到bazel，注意要选择合适的bazel版本，匹配方法可根据下图所示

安装步骤可参考官网介绍，这里选择下载0.15.0版本，首先安装依赖

sudo apt-get install pkg-config zip g++ zlib1g-dev unzip python3

然后下载二进制安装文件，在官网链接中找到对应版本，然后下载和系统对应的文件

下载好之后赋予权限并安装

chmod +x bazel-0.15.0-installer-linux-x86_64.sh
./bazel-0.15.0-installer-linux-x86_64.sh --user

安装好之后设置环境变量，也可以直接写在~/.bashrc文件中

export PATH="$PATH:$HOME/bin"

如果安装过其他版本，先卸载掉

rm -rf ~/.bazel
rm -rf ~/bin
rm -rf /usr/bin/bazel

然后再安装其它版本

2 nccl安装

NCCL是Nvidia Collective multi-GPU Communication Library的简称，它是一个实现多GPU的collective communication通信库，Nvidia做了很多优化，以在PCIe、Nvlink、InfiniBand上实现较高的通信速度。

在安装之前，可以先运行下whereis nccl来查看自己设备上是否安装过nccl，没有的话再用下边的方法安装。

首先在官网下载deb存储库安装包，注意要与自己cuda版本对应

这里我选择本地存储库（与网络存储库有什么区别不太清楚），下载好之后安装

sudo dpkg -i nccl-repo-ubuntu1604-2.4.8-ga-cuda9.0_1-1_amd64.debsudo apt update
sudo apt-get install libnccl2=2.4.8-1+cuda9.0 libnccl-dev=2.4.8-1+cuda9.0

如果不报错则安装成功

3 tensorflow 安装

法1：pip安装（GPU版本）

为了与其它环境隔离，可以创建一个conda环境，然后安装gpu版本的tensorflow

conda create -n tensorflow python=3.6
source activate tensorflow
pip install tensorflow-gpu==1.12.0

运行python，导入tensorflow没错则安装成功

法2：源码安装（CPU版本）

默认的tensorflow是针对最广泛的硬件范围，可能在自己的CPU平台上不兼容，如果要优化的话需要通过源码来安装，并在构建时可以添加一些指令来进行优化，从而支持构建特定平台的CPU指令集。那么如何查看自己CPU的支持哪些指令集呢，方法是运行cat /proc/cpuinfo，会输出每个CPU的指令集，在flags中可查找到

安装方法参考官方教程

# 安装依赖
pip install -U --user pip six numpy wheel setuptools mock future>=0.17.1
pip install -U --user keras_applications==1.0.6 --no-deps
pip install -U --user keras_preprocessing==1.0.5 --no-deps# 如果下载速度慢，可以指定pip源   
# pip install -i  numpy scipy # 下载源码
git clone .git
cd tensorflow
git checkout r1.12.0  # 检出需要编译的版本
./configure      # CPU版本时除了jemalloc选项外，其它都选N# 编译
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.1 --copt=-msse4.2  //tensorflow/tools/pip_package:build_pip_package# 编译软件包
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg# 安装软件包
pip install /tmp/tensorflow_pkg/tensorflow-1.12.0-cp35-cp35m-linux_x86_64.whl

但如果是使用GPU版本的tensorflow，则没必要进行优化，因为大多数计算都在GPU上进行

4 tensorflow-serving安装

安装依赖包

pip install grpcio

下载特定版本的源码进行编译安装

git clone -b 1.12.0 .git
cd serving# GPU方法
export TF_NEED_CUDA=1 && bazel build --config=cuda --config=nativeopt --copt="-fPIC" tensorflow_serving/...# CPU方法
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.1 --copt=-msse4.2 --copt=-march=native tensorflow_serving/...

如果编译时报错如下

ERROR: error loading package '': Encountered error while reading extension file 'tensorflow/workspace.bzl': no such package '@org_tensorflow//tensorflow': java.io.IOException: thread interrupted
ERROR: error loading package '': Encountered error while reading extension file 'tensorflow/workspace.bzl': no such package '@org_tensorflow//tensorflow': java.io.IOException: thread interrupted

则是因为之前版本安装时有遗留，清理掉即可

bazel clean --expunge

如果编译时报错如下

ERROR: /home/lthpc/workspace_zong/tensorflow/tensorflow_serving/tensorflow_serving_source/serving/tensorflow_serving/util/net_http/socket/testing/BUILD:9:1: Linking of rule '//tensorflow_serving/util/net_http/socket/testing:ev_print_req_server' failed (Exit 1)
/usr/bin/ld: bazel-out/k8-opt/bin/external/com_google_absl/absl/strings/libstrings.a(charconv.o): undefined reference to symbol 'nanf@@GLIBC_2.2.5'
//lib/x86_64-linux-gnu/libm.so.6: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

总共有三个BUILD文件需要修改（如下），在cc_binary中添加’linkopts = ["-lm"],’再重新编译即可。

tensorflow_serving/util/net_http/socket/testing/BUILD（两处）

tensorflow_serving/util/net_http/server/testing/BUILD

tensorflow_serving/util/net_http/client/testing/BUILD

如果报错如下

ERROR: /home/lthpc/.cache/bazel/_bazel_lthpc/6520a4c6caf958480b8678c1941eafd8/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:24:1: error while parsing .d file: /home/lthpc/.cache/bazel/_bazel_lthpc/6520a4c6caf958480b8678c1941eafd8/execroot/tf_serving/bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow/contrib/nccl/_objs/python/ops/_nccl_ops_gpu/external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.pic.d (No such file or directory)
In file included from external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager:15:0:
external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager.h:30:35: fatal error: third_party/nccl/nccl.h: No such file or directory

则是因为没有设置nccl的安装路径，需要在指令中设置TF_NCCL_VERSION和NCCL_INSTALL_PATH两个环境变量，首先用whereis nccl找到安装路径，然后在NCCL_INSTALL_PATH中指定即可

# GPU方法
export TF_NEED_CUDA=1 && export TF_NCCL_VERSION='1.3' && export NCCL_INSTALL_PATH=/usr/local/nccl/build/lib && bazel build --config=cuda --config=nativeopt --copt="-fPIC" tensorflow_serving/...

安装后查看版本

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --version

输出如下

TensorFlow ModelServer: 1.12.3-rc0+dev.sha.ff9191b
TensorFlow Library: 1.12.3

添加环境变量

sudo vim ~/.bashrc# 注意路径应更改为自己实际安装路径
export PATH="$PATH:/home/lthpc/workspace_zong/tensorflow/tensorflow_serving/tensorflow_serving_source/serving/bazel-bin/tensorflow_serving/model_servers"source ~/.bashrc

添加完之后，就可以在任意路径下执行相关指令，如

tensorflow_model_server --help

如果编译时发生其他错误，可参考这位博主的问题总结

5 运行服务

安装好tensorflow_model_server之后，我们就可以用该指令运行服务了，方法与之前介绍的docker run的类似，需要指定端口号、模型名称和模型路径，如

tensorflow_model_server --port=8500 --model_name=detection --model_base_path=$(pwd)/detection

启动后，就可以用之前介绍的方法来进行客户端调用了

也可以在指令之前输入GPU的配置，比如

CUDA_VISIBLE_DEVICES=0 tensorflow_model_server --port=8500 --model_name=detection --model_base_path=$(pwd)/detection

用上述方法启动服务运行时，发现当有并发数据请求时，会出现内存泄漏的问题，报错如下

Error in `tensorflow_model_server': double free or corruption (!prev)

解决方法是

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

======================20191028更新=======================

在用CPU版本的TFS做测试时，发现当有高并发请求时，容易出现崩溃的现象，报错是内存泄漏，后来又尝试用apt安装，装完后，同样的程序不再报错了，看来还是APT安装的版本比较稳定，于是将安装方法记录下

# 添加源
echo "deb [arch=amd64]  stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl .release.pub.gpg | sudo apt-key add -# 更新源并且安装
apt-get update && apt-get install tensorflow-model-server# 如果想升级，可以升级到最新版本
apt-get upgrade tensorflow-model-server

安装完之后，使用whereis tensorflow_model_server可以找到安装路径在/usr/bin/tensorflow_model_server下边，然后指定可执行程序的路径，再次按照上边的方法启动服务即可

注意：使用apt安装的tensorflow-model-server默认是运行在CPU上的，如果要调用GPU还是要源码安装

更多推荐

TensorFlow Serving系列之源码安装服务

本文发布于:2024-02-14 14:37:23，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1763652.html