Running ResNet50 - C++

This page walks you through the C++ versions of the ResNet50 examples. These examples are intended to run in the development container because you need to build and compile these executables. You can see the full source files used here in the repository for more details.

The inference server binds its C++ API to Python so the Python usage and functions look similar to their C++ counterparts but there are some differences due to the available features in both languages. You can read the Python version of this example to better compare these two.

Note

These examples are intended to demonstrate the API and how to communicate with the server. They are not intended to show the most optimal performance for each backend.

Include the header

AMD Inference Server’s C++ API allows you to write your own C++ client applications that can communicate with the inference server. You can include the entire public API by including amdinfer/amdinfer.hpp or selectively include header files as needed.

#include "amdinfer/amdinfer.hpp"

Start the server

This example assumes that the server is not running elsewhere and starts it locally from C++ instead. Creating the amdinfer::Server object starts the inference server backend and it stays alive as long as the object remains in scope.

  std::optional<amdinfer::Server> server;

Depending on what protocol you want to use to communicate with the server, you may need to start it explicitly using one of the amdinfer::Server object’s methods.

For example, if you were using HTTP/REST:

  if (args.ip == "127.0.0.1" && !client.serverLive()) {
    std::cout << "No server detected. Starting locally...\n";
    server.emplace();
    server.value().startHttp(args.http_port);
  } else if (!client.serverLive()) {
    throw amdinfer::connection_error("Could not connect to server at " +
                                     server_addr);
  } else {
    // the server is reachable so continue on
  }

With gRPC:

  if (args.ip == "127.0.0.1" && !client.serverLive()) {
    std::cout << "No server detected. Starting locally...\n";
    server.emplace();
    server.value().startGrpc(args.grpc_port);
  } else if (!client.serverLive()) {
    throw amdinfer::connection_error("Could not connect to server at " +
                                     server_addr);
  } else {
    // the server is reachable so continue on
  }

The native C++ API does not need starting and just creating the Server object is sufficient.

If the server is already running somewhere, you don’t need to do this.

Create the client object

The amdinfer::Client base class defines how to communicate with the server over the supported protocols. This client protocol is based on KServe’s API. Each protocol inherits from this class and implements its methods. Some examples of clients that you can create:

  // vitis.cpp
  const auto http_port_str = std::to_string(args.http_port);
  const auto server_addr = "http://" + args.ip + ":" + http_port_str;
  amdinfer::HttpClient client{server_addr};
  // tfzendnn.cpp
  const auto grpc_port_str = std::to_string(args.grpc_port);
  const auto server_addr = args.ip + ":" + grpc_port_str;
  amdinfer::GrpcClient client{server_addr};

  std::optional<amdinfer::Server> server;
  // +start protocol
  if (args.ip == "127.0.0.1" && !client.serverLive()) {
    std::cout << "No server detected. Starting locally...\n";
    server.emplace();
    server.value().startGrpc(args.grpc_port);
  } else if (!client.serverLive()) {
    throw amdinfer::connection_error("Could not connect to server at " +
                                     server_addr);
  } else {
    // the server is reachable so continue on
  }
  // -start protocol:

  std::cout << "Waiting until the server is ready...\n";
  amdinfer::waitUntilServerReady(&client);
  std::cout << "Starting server locally...\n";

  amdinfer::Server server;

  // ptzendnn.cpp
  amdinfer::NativeClient client(&server);

  std::cout << "Waiting until the server is ready...\n";
  amdinfer::waitUntilServerReady(&client);

Load a worker

Once the server is ready, you have to load a worker to handle your request. This worker, and any load-time parameters it accepts, are backend-specific. After loading the worker, you get back an endpoint string that you use to make requests to this worker. If a worker is already ready on the server or you already have an endpoint, then you don’t need to do this.

Here are some of the different workers you can start to perform inference on a ResNet50 model.

XModel - Vitis AI on AMD FPGA:

amdinfer::ParameterMap parameters;
parameters.put("model", args.path_to_model);
parameters.put("batch_size", args.batch_size);
std::string endpoint = client->workerLoad("xmodel", parameters);

TF+ZenDNN - ZenDNN on AMD CPU:

amdinfer::ParameterMap parameters;
parameters.put("model", args.path_to_model);
parameters.put("input_size", args.input_size);
parameters.put("output_classes", args.output_classes);
parameters.put("input_node", args.input_node);
parameters.put("output_node", args.output_node);
parameters.put("batch_size", args.batch_size);
std::string endpoint = client->workerLoad("tfzendnn", parameters);
amdinfer::waitUntilModelReady(client, endpoint);

MIGraphX - MIGraphX on AMD GPU:

amdinfer::ParameterMap parameters;
const auto timeout_ms = 1000;  // batcher timeout value in milliseconds

// Required: specifies path to the model on the server for it to open
parameters.put("model", args.path_to_model);
// Optional: request a particular batch size to be sent to the backend. The
// server will attempt to coalesce incoming requests into a single batch of
// this size and pass it all to the backend.
parameters.put("batch", args.batch_size);
// Optional: specifies how long the batcher should wait for more requests
// before sending the batch on
parameters.put("timeout", timeout_ms);
std::string endpoint = client->workerLoad("migraphx", parameters);
amdinfer::waitUntilModelReady(client, endpoint);

After loading a worker, make sure it’s ready before attempting to make an inference.

  amdinfer::waitUntilModelReady(&client, endpoint);

Prepare images

Depending on the model, you may need to perform some preprocessing of the data before making an inference request. For ResNet50, this preprocessing generally consists of resizing the image, normalizing its values, and possibly converting types but its exact implementation depends on the model and on what the worker expects. The implementations of the preprocessing functions can be seen in the examples’ sources and they differ for each backend.

In these examples, you can pass a path to an image or to a directory to the executable. This single path gets converted to a vector of paths containing just the path you passed in or paths to all the files in the directory you passed in and its passed to the preprocess function. The file at each path is opened and stored in an std::vector<T>, where T depends on the data type that a backend works with. Since there may be many images, preprocess returns an std::vector<std::vector<T>>.

  std::vector<std::string> paths = resolveImagePaths(args.path_to_image);
  Images images = preprocess(paths);

Construct requests

Using the images after preprocessing, you can construct requests to the inference server. For each image, you create an InferenceRequest and add input tensors to it. The ResNet50 model only accepts a single input tensor so you just add one by specifying the image data, its shape, and data type. In this example, you create a vector of such requests.

std::vector<amdinfer::InferenceRequest> requests;
requests.reserve(images.size());

const std::initializer_list<uint64_t> shape = {input_size, input_size, 3};

for (const auto& image : images) {
  requests.emplace_back();
  // NOLINTNEXTLINE(google-readability-casting)
  requests.back().addInputTensor((void*)image.data(), shape,
                                 amdinfer::DataType::Int8);
}

Make an inference

There are multiple ways of making a request to the inference server, some of which are used in the different implementations of these examples. Before processing the response, you should verify it’s not an error.

Then, you can examine the outputs and, depending on the model, postprocess the results. For ResNet50, the raw results from the inference server lists the probabilities for each output class. The postprocessing identifies the highest probability classes and the top few of these are printed using the labels file to map the indices to a human-readable name.

Here are some examples of making an inference used in these examples:

This is the simplest way to make an inference: a blocking single inference where you loop through all the requests and make requests to the server one at a time.

    // vitis.cpp
    amdinfer::InferenceResponse response =
      client.modelInfer(endpoint, request);
    assert(!response.isError());

    std::vector<amdinfer::InferenceResponseOutput> outputs =
      response.getOutputs();
    // for resnet50, we expect a single output tensor
    assert(outputs.size() == 1);
    std::vector<int> top_indices = postprocess(outputs[0], args.top);
    printLabel(top_indices, args.path_to_labels, image_path);

You can also make a single asynchronous request to the server where you get back a std::future that you can use later to get the results of the inference.

// ptzendnn.cpp
amdinfer::InferenceResponseFuture future =
  client.modelInferAsync(endpoint, request);
amdinfer::InferenceResponse response = future.get();
assert(!response.isError());

std::vector<amdinfer::InferenceResponseOutput> outputs =
  response.getOutputs();
// for resnet50, we expect a single output tensor
assert(outputs.size() == 1);
std::vector<int> top_indices = postprocess(outputs[0], args.top);
printLabel(top_indices, args.path_to_labels, image_path);

There are also some helper methods that wrap the basic inference APIs provided by the client. The inferAsyncOrdered method accepts a vector of requests, makes all the requests asynchronously using the modelInferAsync API, waits until each request completes, and then returns a vector of responses. If there are multiple requests sent in this way, they may be batched together by the server.

  // migraphx.cpp
  std::vector<amdinfer::InferenceResponse> responses =
    amdinfer::inferAsyncOrdered(&client, endpoint, requests);
  assert(num_requests == responses.size());

  for (auto i = 0U; i < num_requests; ++i) {
    const amdinfer::InferenceResponse& response = responses[i];
    assert(!response.isError());

    std::vector<amdinfer::InferenceResponseOutput> outputs =
      response.getOutputs();
    // for resnet50, we expect a single output tensor
    assert(outputs.size() == 1);
    std::vector<int> top_indices = postprocess(outputs[0], args.top);
    printLabel(top_indices, args.path_to_labels, paths[i]);
  }