Hello World - Python

AMD Inference Server’s Python API allows you to start the server and send requests to it from Python. This example walks you through how to start the server and use Python to send requests to a simple model. This example and script are intended to run in the development container. The complete script used here is available in the repository.

Import the library

In the development container, the Python library is automatically built and installed as part of the CMake build process.

from time import sleep

import numpy as np

import amdinfer

Create our client and server objects

This example assumes that the server will be running locally with the default HTTP port. Since you’ll be using HTTP/REST to communicate with the server, you should create an HttpClient using the address of the server as the argument.

client = amdinfer.HttpClient("http://127.0.0.1:8998")

Is AMD Inference Server already running?

If the server is already started externally, you don’t want to start it again. You can see if the server is live using the client object. If it’s not live, then you can start the server from Python by instantiating the Server object. Depending on what protocol you’re using to communicate with the server, that protocol may need to be started on the server. The server will remain active as long as the server object stays in scope.

start_server = not client.serverLive()
if start_server:
    print("No server detected. Starting locally...")
    server = amdinfer.Server()
    server.startHttp(8998)
amdinfer.waitUntilServerReady(client)

Load a worker

Inference requests in AMD Inference Server are made to workers. Workers are started as threads in AMD Inference Server and have a defined lifecycle. Before making an inference request to a worker, it must first be loaded. Loading a worker returns an endpoint that the client should use for future operations.

This worker, Echo, is a simple example worker that accepts an integer as input, adds one to it and returns the sum.

endpoint = client.workerLoad("echo")
amdinfer.waitUntilModelReady(client, endpoint)

Inference

Once the worker is ready, you can make an inference request to it. To do so, first construct a request. We construct a request that contains an integer and send it to AMD Inference Server.

data = 3
request = make_request(data)
response = client.modelInfer(endpoint, request)

To make a request, you need to create, at minimum, a amdinfer.InferenceRequest object and add amdinfer.InferenceRequestInput objects to it. Each input object represents an input tensor for the request and has a number of metadata attributes associated with it such as a name, datatype, and shape. This format is based on KServe’s v2 specification. For images, there’s also a helper method called amdinfer.ImageInferenceRequest that you can use to create requests. It’s used in the ResNet50 Python examples.

def make_request(data):
    """
    Make a request containing an integer

    Args:
        data (int): Data to send

    Returns:
        amdinfer.InferenceRequest: Request
    """
    request = amdinfer.InferenceRequest()
    # each request has one or more input tensors, depending on the worker/model that's going to process it
    input_0 = amdinfer.InferenceRequestInput()
    input_0.name = f"input0"
    input_0.setUint32Data(np.array([data], np.uint32))
    input_0.datatype = amdinfer.DataType.UINT32
    input_0.shape = [1]
    request.addInputTensor(input_0)
    return request


Validate the response

After making the inference, you should check what the response was. First, make sure the inference didn’t fail by checking if the response is erroneous. Then, you can get the outputs and examine them. The format and contents of the output data will depend on the worker and model used for inference. In this case, the Echo worker returns a single output tensor back with one integer that should be one larger than what was sent. The assertions here check these cases.

assert not response.isError(), response.getError()
outputs = response.getOutputs()
assert len(outputs) == 1
for output in outputs:
    recv_data = output.getUint32Data()
    assert len(recv_data) == 1
    assert recv_data[0] == data + 1

Clean up

Workers that are loaded in AMD Inference Server will persist until the server shuts down or they’re explicitly unloaded. While it’s not shown here, the Python API provides an unload() method for this purpose. As the script ends, the Server object’s destructor will clean up any active workers on the server and it will shut down

Next steps

This example demonstrates the complete process from starting the server, creating a client, loading a worker, and making an inference. Depending on your use case, you may only be performing a subset of these actions but this provides an overview of what’s happening behind the scenes. You can look at more sophisticated examples for another overview of the steps associated with making inferences.