Deployment
==========

This page explains how to deploy a model to Amazon SageMaker using **inference-server** using the example from
:ref:`hooks:Implementing server hooks`.


Docker container image
----------------------

SageMaker models are deployed using Docker container images. To create a docker image, we might use a :file:`Dockerfile`
like this:

.. code-block:: docker

   FROM python:3.10

   COPY entrypoint.sh /usr/local/bin/
   RUN python -m pip install \
       gunicorn  \
       inference-server  \
       shipping-forecast  # Our package implementing the hooks

   EXPOSE 8080
   ENTRYPOINT ["sh", "entrypoint.sh"]

The :file:`entrypoint.sh` script contains the following command only:

.. code-block:: sh

   python -m gunicorn --bind=0.0.0.0:8080 'inference_server:create_app()'

Here, ``gunicorn`` is the web server which handles incoming HTTP request and executes the **inference-server**
WSGI application. The package ``shipping-forecast`` implements the server hooks (in other words: our business logic),
see the example :ref:`hooks:Hook definitions`.

.. note::

   Unfortunately, we cannot add our Python command directly in the Dockerfile under ``ENTRYPOINT``. This is because AWS
   SageMaker starts the Docker container with an extra ``serve`` argument like this:

   .. code-block:: console

      docker run {image} serve

   The entrypoint script simply ignores this extra argument.


Configuring Gunicorn HTTP server
--------------------------------

For a typical deployment, we may need to configure additional Gunicorn options. Instead of adding command line options
one by one, we could simply specify all options in a single configuration file.

Create :file:`conf.py` with the following content::

   accesslog = "-"
   bind = "0.0.0.0:8080"
   logconfig = "/opt/gunicorn/logging.ini"
   loglevel = "DEBUG"
   workers = 2
   wsgi_app = "inference_server:create_app()"

And :file:`logging.ini` like this:

.. code-block:: ini

   [loggers]
   keys = root

   [handlers]
   keys = std_out

   [formatters]
   keys = default

   [logger_root]
   level = DEBUG
   handlers = std_out

   [handler_std_out]
   class = StreamHandler
   formatter = default
   args = (sys.stdout,)

   [formatter_default]
   format = %(asctime)s %(levelname)s %(message)s
   datefmt =
   class = logging.Formatter

Then we *replace* the content in :file:`entrypoint.sh` with this:

.. code-block:: sh

   python -m gunicorn --config=/opt/gunicorn/conf.py

Finally, we need to copy the configuration files into the container image in the :file:`Dockerfile`:

.. code-block:: docker

   COPY conf.py logging.ini /opt/gunicorn/

.. seealso::

   Configuration Overview
      https://docs.gunicorn.org/en/latest/configure.html
   :mod:`logging.config` — Logging configuration
      https://docs.python.org/3/library/logging.config.html
   Use Your Own Inference Code with Hosting Services
      https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html


Configuring Gunicorn workers
----------------------------

Typically, ML model predictions are CPU-bound logic and Gunicorn's default synchronous, multi-processing workers are a
good choice.

The optimal **number** of workers should be established emperically. It depends both on the model algorithm and the AWS
EC2 compute instance type. It is recommended to choose a *compute optimized* instance type as these types are designed
and priced for sustained high CPU utilization. Using a 4 vCPU instance, for example, the hypervisor would allocate 4
concurrent processor threads to our application. In theory, such an instance could achieve a CPU utilization of 400% as
shown in AWS CloudWatch Metrics.

A good starting point for the number of Gunicorn workers is to set this equal to the vCPU count, 4 in the above example.
To finetune the number of workers, we deploy a SageMaker model endpoint with a single EC2 instance, then send a large
batch of model invocation requests. CloudWatch Metrics should then be evaluated to identity the maximum CPU utilization.
A value well below 400% suggest there may be some I/O overhead and the number of Gunicorn workers may be increased to
achieve greater concurrency and CPU utilization.

.. seealso::

   Choosing a Worker Type
      https://docs.gunicorn.org/en/latest/design.html#choosing-a-worker-type
   Automatically Scale Amazon SageMaker Models
      https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html