Introduction

inference-server is a Python library to deploy an AI/ML model to Amazon SageMaker for real-time inference. The library simplifies deploying a model using your own Docker container image.

Basic steps to deploy a model

Define a Docker container image containing the model and all dependencies.
Write a couple of inference-server hooks as simple Python functions defining how the model should be loaded and invoked. Details: Implementing server hooks.
Install a Python WSGI web server of your choice into the Docker image and configure it to start the inference-server application. Details: Deployment.
Build the container image and deploy the image to a registry such as Amazon ECR.
Deploy the Amazon SageMaker real-time inference endpoint using the AWS Console or a preferred CI/CD pipeline.

Comparison with Amazon’s SageMaker Inference Toolkit

Amazon’s “SageMaker Inference Toolkit” (https://github.com/aws/sagemaker-inference-toolkit) is an alternative library to serve Machine Learning models with SageMaker.

The following table compares inference-server with SageMaker Inference Toolkit.

Aspect	inference-server	SageMaker Inference Toolkit
Model integration	Defined through 4 plain Python functions. Input / output functions can be packaged independently. Details: Implementing server hooks.	Defined through an “InferenceHandler” class with 4 methods plus a “HandlerService” class plus a Python entrypoint script.
Web server	Any Python WSGI server. Documentation for recommended Gunicorn provided. Details: Deployment.	Java web server (using Amazon Multi Model Server).
Testing	Testing functions included to test model integration functions and webserver invocation. Details: Testing.	Not available.