Introduction

inference-server is a Python library to deploy an AI/ML model to Amazon SageMaker for real-time inference. The library simplifies deploying a model using your own Docker container image.

Basic steps to deploy a model

  1. Define a Docker container image containing the model and all dependencies.

  2. Write a couple of inference-server hooks as simple Python functions defining how the model should be loaded and invoked. Details: Implementing server hooks.

  3. Install a Python WSGI web server of your choice into the Docker image and configure it to start the inference-server application. Details: Deployment.

  4. Build the container image and deploy the image to a registry such as Amazon ECR.

  5. Deploy the Amazon SageMaker real-time inference endpoint using the AWS Console or a preferred CI/CD pipeline.

Comparison with Amazon’s SageMaker Inference Toolkit

Amazon’s “SageMaker Inference Toolkit” (https://github.com/aws/sagemaker-inference-toolkit) is an alternative library to serve Machine Learning models with SageMaker.

The following table compares inference-server with SageMaker Inference Toolkit.

Aspect

inference-server

SageMaker Inference Toolkit

Model integration

Defined through 4 plain Python functions. Input / output functions can be packaged independently. Details: Implementing server hooks.

Defined through an “InferenceHandler” class with 4 methods plus a “HandlerService” class plus a Python entrypoint script.

Web server

Any Python WSGI server. Documentation for recommended Gunicorn provided. Details: Deployment.

Java web server (using Amazon Multi Model Server).

Testing

Testing functions included to test model integration functions and webserver invocation. Details: Testing.

Not available.