Running an XModel (C++)

This example walks you through the process to make an inference request to a custom XModel in C++. This example is similar to the Running an XModel (Python) one but it uses C++ to create a new executable instead of making requests to a server. The complete program used here is available: examples/cpp/custom_processing.cpp.

Include the library

Xilinx Inference Server’s C++ API allows you to write your own C++ applications that link against Xilinx Inference Server’s backend. This approach bypasses the overheads associated with serialization that occurs with REST-based Python inferencing. The public API is defined in proteus/proteus.hpp and we include it here.

#include "proteus/proteus.hpp"  // for InferenceResponseFuture, terminate

User variables

Making an inference to an actual XModel that accepts an image requires some additional data from the user. These variables are pulled out into a separate block to highlight them.

  • Batch size: The DPU your XModel targets may have a preferred batch size and so we can use this value to create the optimally-sized request.

  • XModel Path: The XModel you want to run should exist on a path where the server runs. Here, we use a ResNet50 model trained on the ImageNet dataset, which is an image classification model.

  • Image Path: To test this model, we need to use an image. Here, we use a sample image included for testing.

const auto batch_size = 4;
const auto* path_to_xmodel =
  "${AKS_XMODEL_ROOT}/artifacts/u200_u250/resnet_v1_50_tf/"
  "resnet_v1_50_tf.xmodel";
const auto path_to_image = root + "/tests/assets/dog-3619020_640.jpg";

Initialize

Before calling any of other methods in the API, we need to first initialize Xilinx Inference Server.

proteus::initialize();

Load a worker

After initialization, we have to load the worker(s) we need for inference. To make an inference request to a custom XModel, we use the Xmodel worker. Some workers accept load-time parameters to configure different options. The Xmodel worker is one such worker. The parameter we add is to pass the path to the XModel that we want to use. If the worker we’re using doesn’t accept or need load-time parameters, a null pointer can be passed instead.

proteus::RequestParameters parameters;
parameters.put("xmodel", path_to_xmodel);
auto workerName = proteus::load("Xmodel", &parameters);

The return value from the load is the qualified name that should be used for subsequent operations such as inference.

Get images

Now, we can prepare our request. In this example, we use one image and duplicate it batch_size times. The ResNet50 model we’re using requires some pre-processing before running inference so we add a pre-processing step, which will open the images, resize them to the appropriate size for the model, normalize their values and return the images as vectors of data. The implementation of the pre-processing function can be seen in the example’s source.

std::vector<std::string> paths;
paths.reserve(batch_size);

for (auto i = 0; i < batch_size; i++) {
  paths.emplace_back(path_to_image);
}

auto images = preprocess(paths);

Inference

Using our images, we can construct a request to Xilinx Inference Server’s backend. We hardcode the shape of the image to the size we know that the pre-processing enforces. We also pass the data-type of the data, which we again know to be an signed 8-bit integer from the pre-processing. We can then send the request to Xilinx Inference Server by passing in a name from the output of a previous load() and the request itself. Enqueueing the request returns a Future object that we can later check to get the response. For now, we push all these Future objects into a queue to save them for later.

const std::initializer_list<uint64_t> shape = {224, 224, 3};
std::queue<proteus::InferenceResponseFuture> queue;

proteus::InferenceRequest request;
for (auto i = 0; i < batch_size; i++) {
  request.addInputTensor(static_cast<void*>(images[i].data()), shape,
                         proteus::types::DataType::INT8);
}
queue.push(proteus::enqueue(workerName, request));

Check the response

At some point later, we can use our queue of Future objects to get the responses for each inference. Calling get() on a Future object will block until the response is available.

auto front = std::move(queue.front());
queue.pop();
auto results = front.get();

The Future object will return a proteus::InferenceResponse object. This object can be parsed to analyze the data. For this model, we need to post-process the raw output from Xilinx Inference Server to make a useful classification for our image. For each output, we post-process the results to extract the top k indices for the classification. We can check this against our expected golden output to confirm that the inference is correct.

auto outputs = results.getOutputs();
for (auto& output : outputs) {
  auto top_k = postprocess(output, k);
  for (size_t j = 0; j < k; j++) {
    if (top_k[j] != gold_response_output[j]) {
      std::cerr << "Output (" << top_k[j] << ") does not match golden ("
                << gold_response_output[j] << ")\n";
      proteus::terminate();
      return 1;
    }
  }
}

Clean up

To safely end the program, we should signal the Xilinx Inference Server backend to shutdown. During the shutdown, Xilinx Inference Server will stop any workers that haven’t been unloaded yet and shut down management threads safely. Note that this function was also called in the validation step above.

proteus::terminate();