Quickstart - Inference¶

This quickstart is intended for a user who is not configuring or maintaining the inference server and is just making inference requests to an existing server. There are multiple ways to make requests to the server but this quickstart only covers making requests using the inference server’s client library amdinfer.

To make requests, you need the address of the server and the endpoint(s) for the models you want to use for inference. Depending on how it is configured, the server may support HTTP/REST, gRPC or both protocols. Your server administrator can provide this information.

Get the library¶

The amdinfer library allows you to make clients that you can use to communicate with the server over any protocol that the server supports. Clients for different protocols have the same base set of methods so you can easily replace one with another. The library can be used from C++ or Python.

To use the Python library, install with pip:

$ pip install amdinfer

You can use the development container to use both the C++ and Python libraries:

$ git clone https://github.com/Xilinx/inference-server.git
$ cd inference-server
$ python3 docker/generate.py
$ ./amdinfer dockerize
$ ./amdinfer run --dev --net-host

This will build, start the development container and drop you into a terminal in the container. Then, inside the container:

$ amdinfer install

This will install both the C++ and Python libraries in the container, which you can confirm:

$ pip list | grep amdinfer

$ echo -e '#include "amdinfer/amdinfer.hpp"\nint main(){return 0;}' | g++ -x c++ -std=c++17 -o test.out /dev/stdin

Running the examples¶

The AMD Inference Server repository includes examples that demonstrate running an end-to-end inference request. To run the examples listed here, you need to ensure that the amdinfer Python library is installed and that the server you’re using has a compatible model loaded. These examples assume a ResNet50 model trained on the ImageNet dataset is available on the server and use a sample image and labels from the inference server repository but you can also use your own.

$ wget https://github.com/Xilinx/inference-server/raw/v0.3.0/examples/resnet50/imagenet_classes.txt
$ wget https://github.com/Xilinx/inference-server/raw/v0.3.0/tests/assets/dog-3619020_640.jpg

$ wget https://github.com/Xilinx/inference-server/raw/v0.3.0/examples/resnet50/tfzendnn.py
$ python3 tfzendnn.py --ip <ip_address> --grpc-port <port> --endpoint <endpoint> --image ./dog-3619020_640.jpg --labels ./imagenet_classes.txt

$ wget https://github.com/Xilinx/inference-server/raw/v0.3.0/examples/resnet50/migraphx.py
$ python3 migraphx.py --ip <ip_address> --http-port <port> --endpoint <endpoint> --image ./dog-3619020_640.jpg --labels ./imagenet_classes.txt

$ wget https://github.com/Xilinx/inference-server/raw/v0.3.0/examples/resnet50/vitis.py
$ python3 vitis.py --ip <ip_address> --http-port <port> --endpoint <endpoint> --image ./dog-3619020_640.jpg --labels ./imagenet_classes.txt

Using the library¶

The examples above demonstrate a full end-to-end inference using the amdinfer Python library on a specific ResNet50 model. You can write your own scripts and programs to make inference requests to other models. The examples work similarly and you can use them for reference.

The first step is to create a client. The type of client you create will depend on what protocols the server you’re using supports and which protocol you want to use.

# your server administrator must provide the values for these variables:
#   - http_server_addr: HTTP address of the server, if supported
#   - grpc_server_addr: gRPC address of the server, if supported
#   - endpoint: string to identify the model for inference. If there are
#               multiple models available, each model will have its own
#               endpoint that you can use to request inferences from it
http_server_addr = "http://127.0.0.1:8998"
grpc_server_addr = "127.0.0.1:50051"
endpoint = "endpoint"

import amdinfer

# create a client to communicate to the server over HTTP
http_client = amdinfer.HttpClient(http_server_addr)

# create a client to communicate to the server over gRPC
grpc_client = amdinfer.GrpcClient(grpc_server_addr)

// your server administrator must provide the values for these variables:
//   - http_server_addr: HTTP address of the server, if supported
//   - grpc_server_addr: gRPC address of the server, if supported
//   - endpoint: string to identify the model for inference. If there are
//               multiple models available, each model will have its own
//               endpoint that you can use to request inferences from it
const std::string http_server_addr = "http://127.0.0.1:8998";
const std::string grpc_server_addr = "127.0.0.1:50051";
const std::string endpoint = "endpoint";

#include "amdinfer/amdinfer.hpp"

// create a client to communicate to the server over HTTP
const amdinfer::HttpClient http_client{http_server_addr};

// create a client to communicate to the server over gRPC
const amdinfer::GrpcClient grpc_client{grpc_server_addr};

The library also defines the object that you have to populate to send a request: an InferenceRequest. A request is made up of, at minimum, one or more InferenceRequestInput objects that define the input tensor(s) of your request. Each tensor must have a name, a data type, an associated shape and the data itself.

request = amdinfer.InferenceRequest()

input_tensor = amdinfer.InferenceRequestInput()
# depending on the model, the string used here may be significant
input_tensor.name = "input_0"
input_tensor.datatype = amdinfer.DataType.INT64
input_tensor.shape = [2, 3]
# the data should be flattened
input_tensor.setInt64Data([1, 2, 3, 4, 5, 6])

request.addInputTensor(input_tensor)

response = http_client.modelInfer(endpoint, request)
# either client can be used interchangeably
# response = grpc_client.modelInfer(endpoint, request)

amdinfer::InferenceRequest request;

amdinfer::InferenceRequestInput input_tensor;
// depending on the endpoint, the string used here may be significant
input_tensor.setName("input_0");
input_tensor.setDatatype(amdinfer::DataType::Int64);
input_tensor.setShape({2, 3});
// the data should be flattened
std::vector<uint64_t> data{{1, 2, 3, 4, 5, 6}};
input_tensor.setData(data.data());

request.addInputTensor(input_tensor)

response = http_client.modelInfer(endpoint, request)
// either client can be used interchangeably
// response = grpc_client.modelInfer(endpoint, request)

The result of the inference is an InferenceResponse object that you can examine to get the results.

assert not response.isError()
output_tensors = response.getOutputs()

for output_tensor in output_tensors:
    shape = output_tensor.shape
    datatype = output_tensor.datatype
    data = output.getFp32Data()
    size = output.getSize()

assert(!response.isError());
// vector of amdinfer::InferenceResponseOutput objects
auto output_tensors = response.getOutputs();

for(auto &output_tensor : outputs){
    auto shape = output_tensor.getShape();
    auto datatype = output_tensor.getDataType()
    auto* data = static_cast<float*>(output_tensor.getData());
    auto size = output.getSize();
}

These examples show how to use the library to construct a request, make an inference and examine the response. They do not show any model-specific pre- and post-processing that may be needed. If needed, you must implement it or use an existing implementation for your model.

For more information about usage and the available methods, look at the examples or the documentation for the C++ and Python APIs.