Frequently Asked Questions (FAQ)

Is my “model_fn” called at each invocation?

No.

The model_fn() function is called during the very fist invocation only. Once the model has been loaded, it is retained in memory for as long as the service runs.

To speed up the very first invocation, it is possible to trigger the model_fn hook in advance. To do this, simply call inference_server.warmup().

For example, when using Gunicorn, this could be done from a post-fork Gunicorn hook:

def post_fork(server, worker):
    worker.log.info("Warming up worker...")
    inference_server.warmup()

Does inference-server support async/ASGI webservers?

No.

inference-server is a WSGI application to be used by synchronous webservers.

For most ML models that will be the correct choice as model inference is typically CPU-bound. Therefore, a multi-process based WSGI server is a good choice whereby the number of workers is equal to the number of CPU cores available.

For more details see Configuring Gunicorn workers.

My model is leaking memory, how do I address that?

If the memory leak is outside your control, one approach would be to periodically restart the webserver workers.

For example, when using Gunicorn, it is possible to specify a maximum number of HTTP requests (max_requests) after which a given worker should be restarted. Gunicorn additionally allows a random offset (max_requests_jitter) to be added such that worker restarts are staggered.

For more details see Gunicorn settings documentation.

How do I invoke my model using a data stream from my favourite message queue system?

By design, inference-server is an HTTP web server and uses a simple request-response model.

This is so it can be deployed in most environments, not only including AWS Sagemaker but also as a local Dockerized service. Access to the web server is also possible from a range of environments including AWS itself, but also from other providers in a multi-cloud environment.

Depending on the messaging/queueing system and cloud environment, you have various options to integrate a model deployed with inference-server with a message stream.

For example, in AWS, you could deploy a Lambda function which consumes messages from AWS SQS, then send this as an HTTP request to AWS SageMaker. Equally, the Lambda function could write the SageMaker response to another SQS queue. Of course, instead of a Lambda function you could use any other compute platform to deploy similar logic, including an EKS pods or ECS task.