Visible to Intel only — GUID: GUID-8348C392-94B2-4351-BE4A-71A824C0369D
Visible to Intel only — GUID: GUID-8348C392-94B2-4351-BE4A-71A824C0369D
Convolution int8 inference example with Graph API
This is an example to demonstrate how to build an int8 graph with Graph API and run it on CPU.
This is an example to demonstrate how to build an int8 graph with Graph API and run it on CPU.
Example code: cpu_inference_int8.cpp
Some assumptions in this example:
Only workflow is demonstrated without checking correctness
Unsupported partitions should be handled by users themselves
Public headers
To start using oneDNN Graph, we must include the dnnl_graph.hpp header file in the application. All the C++ APIs reside in namespace dnnl::graph.
#include <iostream>
#include <memory>
#include <vector>
#include <unordered_map>
#include <unordered_set>
#include <assert.h>
#include "oneapi/dnnl/dnnl_graph.hpp"
#include "example_utils.hpp"
#include "graph_example_utils.hpp"
using namespace dnnl::graph;
using data_type = logical_tensor::data_type;
using layout_type = logical_tensor::layout_type;
using property_type = logical_tensor::property_type;
using dim = logical_tensor::dim;
using dims = logical_tensor::dims;
simple_pattern_int8() function
Build Graph and Get Partitions
In this section, we are trying to build a graph indicating an int8 convolution with relu post-op. After that, we can get all of partitions which are determined by backend.
Create input/output dnnl::graph::logical_tensor and op for the first Dequantize.
logical_tensor dequant0_src_desc {0, data_type::u8};
logical_tensor conv_src_desc {1, data_type::f32};
op dequant0(2, op::kind::Dequantize, {dequant0_src_desc}, {conv_src_desc},
"dequant0");
dequant0.set_attr<std::string>(op::attr::qtype, "per_tensor");
dequant0.set_attr<std::vector<float>>(op::attr::scales, {0.1f});
dequant0.set_attr<std::vector<int64_t>>(op::attr::zps, {10});
Create input/output dnnl::graph::logical_tensor and op for the second Dequantize.
logical_tensor dequant1_src_desc {3, data_type::s8};
logical_tensor conv_weight_desc {
4, data_type::f32, 4, layout_type::undef, property_type::constant};
op dequant1(5, op::kind::Dequantize, {dequant1_src_desc},
{conv_weight_desc}, "dequant1");
dequant1.set_attr<std::string>(op::attr::qtype, "per_channel");
// the memory format of weight is XIO, which indicates channel equals
// to 64 for the convolution.
std::vector<float> wei_scales(64, 0.1f);
dims wei_zps(64, 0);
dequant1.set_attr<std::vector<float>>(op::attr::scales, wei_scales);
dequant1.set_attr<std::vector<int64_t>>(op::attr::zps, wei_zps);
dequant1.set_attr<int64_t>(op::attr::axis, 1);
Create input/output dnnl::graph::logical_tensor the op for Convolution.
logical_tensor conv_bias_desc {
6, data_type::f32, 1, layout_type::undef, property_type::constant};
logical_tensor conv_dst_desc {7, data_type::f32, layout_type::undef};
// create the convolution op
op conv(8, op::kind::Convolution,
{conv_src_desc, conv_weight_desc, conv_bias_desc}, {conv_dst_desc},
"conv");
conv.set_attr<dims>(op::attr::strides, {1, 1});
conv.set_attr<dims>(op::attr::pads_begin, {0, 0});
conv.set_attr<dims>(op::attr::pads_end, {0, 0});
conv.set_attr<dims>(op::attr::dilations, {1, 1});
conv.set_attr<std::string>(op::attr::data_format, "NXC");
conv.set_attr<std::string>(op::attr::weights_format, "XIO");
conv.set_attr<int64_t>(op::attr::groups, 1);
Create input/output dnnl::graph::logical_tensor the op for ReLu.
logical_tensor relu_dst_desc {9, data_type::f32, layout_type::undef};
op relu(10, op::kind::ReLU, {conv_dst_desc}, {relu_dst_desc}, "relu");
Create input/output dnnl::graph::logical_tensor the op for Quantize.
logical_tensor quant_dst_desc {11, data_type::u8, layout_type::undef};
op quant(
12, op::kind::Quantize, {relu_dst_desc}, {quant_dst_desc}, "quant");
quant.set_attr<std::string>(op::attr::qtype, "per_tensor");
quant.set_attr<std::vector<float>>(op::attr::scales, {0.1f});
quant.set_attr<std::vector<int64_t>>(op::attr::zps, {10});
Finally, those created ops will be added into the graph. The graph inside will maintain a list to store all these ops. To create a graph, dnnl::engine::kind is needed because the returned partitions maybe vary on different devices. For this example, we use CPU engine.
Create graph and add ops to the graph
graph g(dnnl::engine::kind::cpu);
g.add_op(dequant0);
g.add_op(dequant1);
g.add_op(conv);
g.add_op(relu);
g.add_op(quant);
After finished above operations, we can get partitions by calling dnnl::graph::graph::get_partitions().
In this example, the graph will be partitioned into one partition.
auto partitions = g.get_partitions();
Compile and Execute Partition
In the real case, users like framework should provide device information at this stage. But in this example, we just use a self-defined device to simulate the real behavior.
Create a dnnl::engine. Also, set a user-defined dnnl::graph::allocator to this engine.
allocator alloc {};
dnnl::engine eng
= make_engine_with_allocator(dnnl::engine::kind::cpu, 0, alloc);
dnnl::stream strm {eng};
Compile the partition to generate compiled partition with the input and output logical tensors.
compiled_partition cp = partition.compile(inputs, outputs, eng);
Execute the compiled partition on the specified stream.
cp.execute(strm, inputs_ts, outputs_ts);