Deploy your machine learning models with tensorflow serving and kubernetes

Deploy your machine learning models with tensorflow serving and kubernetes
Machine learning applications are booming and yet there is not a lot of tools available for Data Engineers to integrate those powerful models in production systems. Here I discuss how TensorFlow Serving can help you accelerate delivering models in production.
Serving is how you apply an ML model after you’ve trained it
To illustrate the capabilities of TensorFlow Serving, I will go through the steps of serving an object detection model. Find all the code related to this article on my GitHub: https://github.com/fpaupier/tensorflow-serving_sidecar

TensorFlow Serving in a nutshell
Tensorflow serving enables you to seamlessly serve your machine learning models.
- Deploy a new version of your model and let TensorFlow Serving gracefully finish current requests while starting to serve new requests with the new model.
- Separate concerns, data scientists can focus on building great models while Ops can focus on building highly resilient and scalable architectures that can serve those models.
Part 1 — Warm up: Set up a local TensorFlow server
Before going online it’s good to make sure your server works on local. I’m giving the big steps
here, find more documentation in the project readme
.
- git clone https://github.com/fpaupier/tensorflow-serving_sidecar, create a python3.6.5
virtual env and install the
requirements.txt
- Get TensorFlow Serving docker image:
docker pull tensorflow/serving
- Get a model to serve → I use this one, it performs object detection faster_rcnn_resnet101_coco
- Go to the model directory and rename the
saved model
subdirectory with a version number, since we are doing a v1 here let’s call it 00001 (it has to be figures). We do this because TensorFlow Serving docker image search for folders named with that convention when searching for a model to serve. - Now run the TensorFlow server:
# From tensorflow-serving_sidecar/
docker run -t --rm -p 8501:8501 \
-v "$(pwd)/data/faster_rcnn_resnet101_coco_2018_01_28:/models/faster_rcnn_resnet" \
-e MODEL_NAME=faster_rcnn_resnet \
tensorflow/serving &

faster_rcnn_resnet101_coco_2018_01_28
— where
the model is stored — with the container /models/faster_rcnn_resnet
path.
Just keep in mind that at this point the savedModel.pb
is solely on your machine,
not in the
container.
6. Perform the client call:
# Don't forget to activate your python3.6.5 venv
# From tensorflow-serving_sidecar/
python client.py --server_url "http://localhost:8501/v1/models/faster_rcnn_resnet:predict" \
--image_path "$(pwd)/object_detection/test_images/image1.jpg" \
--output_json "$(pwd)/object_detection/test_images/out_image1.json" \
--save_output_image "True" \
--label_map "$(pwd)/data/labels.pbtxt"
--output_json
and enjoy the result. (json and jpeg
output available)

Great, now that our model works well, let’s deploy it on the cloud.
Part 2 — Serve your machine learning application on a Kubernetes cluster with TensorFlow Serving
In a production setting, you want to be able to scale as the load is increasing on your app. You don’t want your server to be overwhelmed.

To avoid this issue, you will use a kubernetes cluster to serve your tensorflow-server app. Main improvements to expect:
- The load will be balanced among your replicas without you having to think about it.
- Do you want to deploy a new model with no downtime? No problem, kubernetes got your back. Perform a rolling update to progressively serve your new model while gracefully terminating the current requests on the former model.

Let's dive in
Create a custom tensorflow-serving docker image
- Run a serving image as a daemon:
docker run -d --name serving_base tensorflow/serving
- Copy the
faster_rcnn_resnet101_coco
model data to the container'smodels/
folder:bash# From tensorflow-serving_sidecar/ docker cp $(pwd)/data/faster_rcnn_resnet101_coco_2018_01_28 serving_base:/models/faster_rcnn_resnet
-
Commit the container to serve the faster_rcnn_resnet model:
bashdocker commit --change "ENV MODEL_NAME faster_rcnn_resnet" serving_base faster_rcnn_resnet_serving
Note: if you use a different model, changefaster_rcnn_resnet
in the--change
argument accordingly.faster_rcnn_resnet_serving
will be our new serving image. You can check this by running docker images, you should see a new docker image:Figure 6: docker images result after creating a custom tensorflow-serving image -
Stop the serving base container
bashdocker kill serving_base docker rm serving_base
Great, the next step is to test our brand-new faster_rcnn_resnet_serving
image.
Test the custom server
Before deploying our app on Kubernetes, let’s make sure it works correctly.
-
Start the server:
docker run -p 8501:8501 -t faster_rcnn_resnet_serving &
Note: Make sure you have stopped (docker stop CONTAINER_NAME) the previously running server otherwise the port 8501 may be locked. -
We can use the same client code to call the server:
bash
# From tensorflow-serving_sidecar/ python client.py --server_url "http://localhost:8501/v1/models/faster_rcnn_resnet:predict" \ --image_path "$(pwd)/object_detection/test_images/image1.jpg" \ --output_json "$(pwd)/object_detection/test_images/out_image2.json" \ --save_output_image "True" \ --label_map "$(pwd)/data/labels.pbtxt"
We can check we have the same result. Let’s run this on a Kubernetes cluster now.
Deploy our app on Kubernetes
Unless you already have run a project on GCP, I advise you to check the Google Cloud setup steps.
I assume you have created and logged in a gcloud
project named
tensorflow-serving
.
You will use
the
container image faster_rcnn_resnet_serving
built previously to deploy a serving
cluster with
Kubernetes in the Google Cloud Platform.
- Login to your project: first list the available projects with
gcloud projects list
, select thePROJECT_ID
of your project and run:bash# Get the PROJECT_ID, not the name gcloud projects list # Set the project with the right PROJECT_ID, i.e. for me it is tensorflow-serving-229609 gcloud config set project tensorflow-serving-229609 gcloud auth login
-
Create a container cluster:
First, we create a Google Kubernetes Engine cluster for service deployment. Due to the free trial limitation, you cannot do more than 2 nodes here, you can either upgrade or go with the two nodes which will be good enough for our use case. (You are limited to a quota of 8 CPUs in your free trial.)
bash# Create the cluster gcloud container clusters create faster-rcnn-serving-cluster \ --num-nodes 2 \ --zone 'us-east1'
You may update the zone arg, you can choose among e.g: europe-west1, asia-east1 - You check all the zones available with
gcloud compute zones list
. You should see tomething like that:Figure 7: kubernetes cluster creation output - Set the default cluster for gcloud container command and pass cluster credentials to
kubectl:
bash
# Set the default cluster gcloud config set container/cluster faster-rcnn-serving-cluster # Pass cluster credentials to kubectl gcloud container clusters get-credentials faster-rcnn-serving-cluster \ --zone 'us-east1'
-
Upload the custom tensorflow-serving docker image we built previously:
Let’s push our image to the Google Container Registry so that we can run it on Google Cloud Platform. Tag the
faster_rcnn_resnet_serving
image using the Container Registry format and our project id, change thetensorflow-serving-229609
with yourPROJECT_ID
. Also change the tag at the end, here it's our first version, so I set the tag tov0.1.0
.bashdocker tag faster_rcnn_resnet_serving gcr.io/tensorflow-serving-229609/faster_rcnn_resnet_serving:v0.1.0
docker images
, you now see an additionalgcr.io/tensorflow-serving-229609/faster_rcnn_resnet_serving:v0.1.0
image. This gcr.io prefix allows us to push the image directly to the Container registry,bash# To do only once gcloud auth configure-docker docker push gcr.io/tensorflow-serving-229609/faster_rcnn_resnet_serving:v0.1.0
Figure 8: docker image successfully pushed on Google Container Registry -
Create Kubernetes Deployment and Service
The deployment consists of a single replica of faster-rcnn inference server controlled by a Kubernetes Deployment. The replica is exposed externally by a Kubernetes Service along with an External Load Balancer.
Using a single replica does not really make sense. I just do so to pass within the free tier. Load balancing if you have only one instance to direct your query on is useless. In a production setup, use multiple replicas.
We create them using the example Kubernetes config
faster_rcnn_resnet_k8s.yaml
. You simply need to update the docker image to use in the file, replace the lineimage: <YOUR_FULL_IMAGE_NAME_HERE>
with your actual image full name:bash# Update the image in faster_rcnn_resnet_k8s.yaml image: gcr.io/tensorflow-serving-229609/faster_rcnn_resnet_serving@sha256:9f7eca6da7d833b240f7c54b630a9f85df8dbdfe46abe2b99651278dc4b13c53
Figure 9: find your docker full image name on google container registry bash# Run the following command from tensorflow-serving_sidecar/ kubectl create -f faster_rcnn_resnet_k8s.yaml
To check the status of the deployment and pods use the
kubectl get deployments
for the whole deployment,kubectl get pods
to monitor each replica of your deployment, andkubectl get services
for the service.Figure 10: Sanity check for deployment It can take a while for everything to be up and running. The service external IP address is listed next to LoadBalancer Ingress. You can check it with the
kubectl describe service
command:bash# Describe the service kubectl describe service faster-rcnn-resnet-service
Figure 11: Find the IP address to query upon to perform inference
Query your online model:
And finally, let’s test this. We can use the same client
code. Simply replace the previously used
localhost in the --server-url
arg with the IP address of the LoadBalancer Ingress
as
specified above.
# From tensorflow-serving_sidecar/
python client.py --server_url "http://34.73.137.228:8501/v1/models/faster_rcnn_resnet:predict" \
--image_path "$(pwd)/object_detection/test_images/image1.jpg" \
--output_json "$(pwd)/object_detection/test_images/out_image3.json" \
--save_output_image "True" \
--label_map "$(pwd)/data/labels.pbtxt"
Takeaways
Tensorflow serving offers a great basis on which you can rely to quickly deploy your model in production with very little overhead.
- Containerization of machine learning applications for their deployment enables to separate the concerns between Ops and Data Scientists.
- Container orchestration solutions such as Kubernetes combined with TensorFlow Serving offer the possibility to deploy high availability models in minutes even for people not familiar with distributed computing.
References
- Tensorflow serving explained by Noah Fiedel, Software Engineer at Google who worked on Tensorflow Serving. It gives insights on how it has been built and for which purposes https://www.youtube.com/watch?v=q_IkJcPyNl0
- Libraries of pre-trained models freely available https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
- Tyler Labonte medium post for exporting a tensorflow model as a saved checkpoints https://medium.com/@tmlabonte/serving-image-based-deep-learning-models-with-tensorflow-servings-restful-api-d365c16a7dc4
- An example of how cumbersome it can be to serve an ML model with no proper framework https://towardsdatascience.com/how-to-build-and-deploy-a-lyrics-generation-model-framework-agnostic-589f3026fd53
- Cloud ML, Google managed solution for deploying ML models: https://cloud.google.com/ml-engine/docs/tensorflow/deploying-models