Client API

The client API enables users to interact with the inference server. This page highlights some of the important methods that you can use.

Include the API

After installing the client API, you can include it in your code with a single line:

#include "amdinfer/amdinfer.hpp"

Create a client object

A client object enables you to talk to the server using the protocol of your choice. The inference server supports the following protocols which you can use independently.

// native - the server must be started in the same process
amdinfer::Server server;
amdinfer::NativeClient client(&server);

// HTTP/REST
amdinfer::HttpClient client{"http://127.0.0.1:8998"};

// gRPC
amdinfer::GrpcClient client{"127.0.0.1:50051"};

All clients have the same interface after the initial construction so you can use any client in the next set of steps.

Server status

You can check the state and health of the server using the following methods of all client objects: serverMetadata(), serverLive(), serverReady(), modelReady(), modelMetadata() and hasHardware().

These base methods of client objects enable the following helper functions that take a client object as the first argument: serverHasExtension(), waitUntilServerReady(), waitUntilModelReady() and waitUntilModelNotReady().

You can see more information about these functions in the API documentation for C++ and Python.

Loading a backend

If the server you are using is not already ready to serve incoming inference requests for your model, you may need to load a backend to serve your model first.

Client objects provide two methods to load backends: modelLoad() and workerLoad(). The former loads the named model from the model repository while the latter is a lower-level method to directly load a backend with a path to a particular model file. The path to the model file, and other load-time parameters, can be passed to the server with these methods. Each backend defines its own load-time parameters so check the documentation for the backend you want to use.

When you load a backend, you get an endpoint that you can use to make further requests to. For modelLoad(), the endpoint is the same name as the model you pass to the method. For workerLoad(), the server will assign it an endpoint and return it from the call to workerLoad().

You can also load ensembles with loadEnsemble().

Making an inference request

A basic inference request to the server consists of a list of input tensors. You can construct a request with something like this:

amdinfer::InferenceRequest request;

// void* data = ...;
amdinfer::InferenceRequestInput input_tensor{
    data, {224, 224, 3}, amdinfer::DataType::FP32, "input"
};
request.addInputTensor(input_tensor);

// add other tensors?

Once you have a request, you can use the client’s modelInfer() method:

// endpoint is a string from loading a backend or provided to you
auto response = client.modelInfer(endpoint, request);

The client API also provides other methods for making inferences such as modelInferAsync() and inferAsyncOrdered(). You can see more information about the available methods in the API documentation for C++ and Python.

Parsing the response

The basic response from the inference server consists of an array of output tensors.

auto response = client.modelInfer(endpoint, request);
if(!response.isError()){
    auto output_tensors = response.getOutputs();
    for(const auto& output_tensor : output_tensors){
        // you can use methods to get the shape, datatype and name and data
    }
}

You can see more information about the available methods in the API documentation for C++ and Python.

Next steps

Take a look at the examples to see these APIs used in practice.