Hello World (Python)

AMD Inference Server’s Python API allows you to start the server and send requests to it from Python. This example walks you through how to start the server and use Python to send requests to a simple model. The complete script used here is available: examples/python/hello_world_rest.py.

Import the library

We need to bring in the AMD Inference Server Python library. The library’s source code is in src/python and it gets installed in the dev container on startup.

from time import sleep
import proteus

Create our client and server objects

We assume that the server will be running locally with the default HTTP port and pass that to the client. In this example, we’ll be using REST to communicate to the server so we create a RestClient object.

client = proteus.clients.HttpClient("http://127.0.0.1:8998")

Is AMD Inference Server already running?

If the server is already started externally, we don’t want to start it again. So, we attempt to check if the server is live. If this fails, we start the server ourselves. Either way, the client will attempt to communicate to the server at http://localhost:8998.

start_server = not client.serverLive()
if start_server:
    server = proteus.servers.Server()
    server.startHttp(8998)
    while not client.serverLive():
        sleep(1)

Load a worker

Inference requests in AMD Inference Server are made to workers. Workers are started as threads in AMD Inference Server and have a defined lifecycle. Before making an inference request to a worker, it must first be loaded. Loading a worker returns an identifier that the client should use for future operations.

This worker, Echo, is a simple example worker that accepts integers as input, adds one to the inputs and returns the sum. We’ll use it to demonstrate the Python flow.

worker_name = client.workerLoad("Echo")

ready = False
while not ready:
    ready = client.modelReady(worker_name)

Inference

Once the worker is ready, we can make an inference request to it. We construct a request that contains five integers and send it to AMD Inference Server. The NumericalInferenceRequest class is a helper class that simplifies creating a request in the right format.

data = [3, 1, 4, 1, 5]
request = NumericalInferenceRequest(data)
response = client.modelInfer(worker_name, request)

Validate the response

Now, we want to check what our response is. Here, we can simplify our checks because we already know what we expect to receive. So we check is that the number of outputs match the number of inputs we used. We also check that each output only has one index and is one more than the corresponding input since we know that’s what the Echo worker does.

assert not response.isError(), response.getError()
outputs = response.getOutputs()
assert len(outputs) == len(data)
for index, output in enumerate(outputs):
    recv_data = output.getUint32Data()
    assert len(recv_data) == 1
    assert recv_data[0] == data[index] + 1

Clean up

Workers that are loaded in AMD Inference Server will persist until the server shuts down or they’re explicitly unloaded. While it’s not shown here, the Python API provides an unload() method for this purpose. Finally, if we started the server from Python, we shut it down before finishing. If there are any loaded workers at this time, they will be cleaned up before shutdown.