Introduction
inference-server is a Python library to deploy an AI/ML model to Amazon SageMaker for real-time inference. The library simplifies deploying a model using your own Docker container image.
Basic steps to deploy a model
Define a Docker container image containing the model and all dependencies.
Write a couple of inference-server hooks as simple Python functions defining how the model should be loaded and invoked. Details: Implementing server hooks.
Install a Python WSGI web server of your choice into the Docker image and configure it to start the inference-server application. Details: Deployment.
Build the container image and deploy the image to a registry such as Amazon ECR.
Deploy the Amazon SageMaker real-time inference endpoint using the AWS Console or a preferred CI/CD pipeline.
Comparison with Amazon’s SageMaker Inference Toolkit
Amazon’s “SageMaker Inference Toolkit” (https://github.com/aws/sagemaker-inference-toolkit) is an alternative library to serve Machine Learning models with SageMaker.
The following table compares inference-server with SageMaker Inference Toolkit.
Aspect |
inference-server |
SageMaker Inference Toolkit |
---|---|---|
Model integration |
Defined through 4 plain Python functions. Input / output functions can be packaged independently. Details: Implementing server hooks. |
Defined through an “InferenceHandler” class with 4 methods plus a “HandlerService” class plus a Python entrypoint script. |
Web server |
Any Python WSGI server. Documentation for recommended Gunicorn provided. Details: Deployment. |
Java web server (using Amazon Multi Model Server). |
Testing |
Testing functions included to test model integration functions and webserver invocation. Details: Testing. |
Not available. |