This document describes the compute and memory model of the AIR dialect, defines the semantics of its operations, and explains how those operations map onto the two supported hardware backends: AMD NPU (AIE) and AMD GPU (ROCDL/HIP).
The AIR dialect is designed to be architecture-independent at the programming-model
level. A program expressed in terms of air.launch, air.segment, air.herd, and the
associated data-movement operations carries a well-defined meaning that can be compiled
to either backend without changing the source IR.
A detailed description of the dialect and its use for mapping AI workloads onto AMD NPUs can be found in:
E. Wang et al. “From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR”. arXiv:2510.14871, October 2025.
The AIR model defines a three-level execution hierarchy:
air.launch — outermost level; groups co-resident work
air.segment — intermediate level; groups cores with shared on-device memory
air.herd — innermost level; an array of processing elements (PEs)
Each level defines an iteration space (a multi-dimensional index range) and a body
region that executes across that iteration space. The bodies of launch and segment
execute at the host/controller level, while the body of herd executes in data-
parallel fashion across each point of its iteration space.
Execution at each level is spatially parallel: points in the iteration space may execute concurrently on distinct hardware resources.
Operands from the enclosing scope are passed explicitly as kernel operands into each
level. Levels are IsolatedFromAbove, meaning they cannot implicitly capture values from
outer regions.
The AIR model defines three levels of memory, identified by the air::MemorySpace enum:
| Level | Enum value | Scope | Description |
|---|---|---|---|
L3 |
0 | System | Off-chip or host-accessible memory (e.g. DDR, host RAM) |
L2 |
1 | Segment | On-device memory shared across all PEs in a segment (e.g. on-chip SRAM, memory tiles) |
L1 |
2 | Herd tile | Per-PE scratchpad memory local to each processing element |
Memory at each level must be accessed through the appropriate data-movement operations. Direct reads or writes across levels (e.g., from L1 directly into L3) are not supported; all cross-level data movement is explicit.
Data movement between memory levels is expressed with:
air.dma_memcpy_nd — an N-dimensional DMA-style bulk copy between two memrefs.
The source and destination may reside at any combination of memory levels.air.channel / air.channel.put / air.channel.get — point-to-point or
broadcast communication channels that decouple producers from consumers.All data-movement and hierarchy operations may execute asynchronously. An op that
produces an !air.token result is dispatched immediately and executes concurrently
with subsequent operations. Token values are used to express data, control, affinity,
and concurrency constraints (see §1.5). air.wait_all merges multiple tokens into a
single synchronization point. air.execute wraps sequential computation in an
asynchronous region.
Every AIR hierarchy operation has two distinct lifetimes:
The execution lifespan must be a subset of the resource lifespan: resources must be allocated before execution begins and must remain allocated until execution completes.
All token kinds share a single MLIR type: !air.token. The kind is
determined by which attribute list the token appears in at a consumer op, not
by the token’s type. !air.token is the replacement for air.async.token;
existing uses of air.async.token are dependency tokens.
A token is created in one of two ways:
1. Explicit allocation:
%t = air.token.alloc : !air.token
Used to introduce a token that groups a set of ops under a shared affinity or concurrency constraint without one op “owning” the token.
2. Op result:
%t = air.segment @foo { … }
%t = air.herd tile (%x,%y) in (%sx=%Nx,%sy=%Ny) [dependency = [%dep]] { … }
The token becomes signaled when the op and all of its nested operations complete. The result is optional; omitting it gives the synchronous (blocking) form of the op.
Tokens are consumed by including them in named attribute lists on hierarchy and data-movement ops. An op may carry any combination of the three lists:
air.segment @bar [dependency = [%t0, %t1]]
[affinity = [%ta]]
[concurrency = [%tc]]
{ … }
| Attribute list | Constraint imposed | Effect on resource lifespan | Effect on execution lifespan |
|---|---|---|---|
dependency |
Op does not begin until all listed tokens are signaled (happens-before) | Compiler/runtime may reuse resources freely across the edge | Disjoint from the producers of the listed tokens |
affinity |
Op is placed on the same hardware resources as all other ops that list the same token | Disjoint — shared resources used sequentially | Disjoint — operations execute at different times |
concurrency |
Op must have an overlapping execution lifetime with all other ops that list the same token | Overlapping — all resource sets must be live simultaneously | Overlapping |
dependency list: The direct successor of air.async.token dependency
chains. The op waits for all listed tokens to be signaled before acquiring
resources or beginning execution. The compiler may reuse the releasing op’s
resources for the waiting op once the token fires.
affinity list: All ops that share the same !air.token in their
affinity list are bound to the same hardware partition (e.g. the same set
of tile columns on NPU, or the same CU on GPU). Because they share resources
they must execute sequentially; both their resource and execution lifespans
are disjoint in time.
concurrency list: All ops sharing the same token in their concurrency
list are required to have overlapping execution lifetimes. The compiler must
allocate distinct, simultaneously live resource sets for each such op. This
is the mechanism that enforces co-residency (e.g. ensuring two segments are
stamped out on hardware at the same time).
Tokens are SSA values and follow normal MLIR dominance. When a token must be
passed into a hierarchy body it is threaded through the args(…) operand list.
The sole exception to free token passing is air.launch: only tokens used in a
dependency list may be passed into the body of a launch. Affinity and
concurrency tokens may not cross the launch boundary, because imposing shared-
resource or forced-overlap constraints on the may-be-parallel instances of a
launch would contradict its execution model.
air.launch// Asynchronous form — produces !air.token, non-blocking
[%token =] air.launch (%x₀, …, %xₙ) in (%sx₀=%N₀, …, %sxₙ=%Nₙ)
args(%a₀=%v₀, …) : <types>
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
{
…
air.launch_terminator
}
// Synchronous form — caller blocks until all launch instances complete
air.launch sync (%x₀, …, %xₙ) in (%sx₀=%N₀, …, %sxₙ=%Nₙ)
args(%a₀=%v₀, …) : <types>
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
{
…
air.launch_terminator
}
air.launch defines the outermost scope of an AIR program. It:
N₀ × … × Nₙ. Each point
(x₀, …, xₙ) in this space constitutes an independent launch instance.concurrency list attribute is therefore
not permitted on air.launch; expressing an explicit concurrency bound
would impose ordering constraints that contradict may-be-parallel semantics.air.segment operations declared within the body (and their
associated L2 allocations) are co-resident on the device when the launch
body begins executing, guaranteeing a bounded resource footprint.IsolatedFromAbove and cannot implicitly capture values
from the enclosing scope.The launch iteration space expresses a pool of independent work units that the runtime should schedule as hardware becomes available. It does not guarantee simultaneous execution of all instances; it expresses that instances are independent and may be run in any order or concurrently at the runtime’s discretion.
Use the launch iteration space when:
Do not use the launch iteration space when you need a guarantee that all
instances execute simultaneously — for that, use the air.segment iteration
space (§2.2), which provides always-concurrent semantics (all instances are
co-resident and active for the duration of the segment).
A common goal is to overlap computation on one output tile with data movement for the next — “software pipelining” across tiles. This cannot be expressed across separate launch instances, because dependency tokens are the only token kind that may cross the launch boundary, and they only express ordering, not overlap.
Instead, express inter-tile pipelining within a single launch body:
Segment-level pipelining: place two air.segment ops inside the launch
body — a compute segment and a prefetch segment — connected by dependency
tokens. Both segments are co-resident (the co-residency guarantee applies to
all segments in the same launch body), so their execution genuinely overlaps:
the compute segment processes tile N while the prefetch segment loads tile N+1.
Loop-based pipelining: use an scf.for loop inside the launch body (or
inside a segment) with iter_args-carried tokens to overlap consecutive tile
iterations, as described in the scf.for section of §2.2. This is the
standard double-buffering pattern and is typically the most efficient approach.
The degree of actual concurrency achieved by a launch iteration space depends on the backend:
| Backend | Launch iteration space behaviour |
|---|---|
| GPU (ROCDL) | Maps directly to gpu.launch gridDim; all instances execute in parallel as GPU thread blocks. The GPU scheduler guarantees concurrent dispatch of all grid points subject to occupancy. |
| NPU (AIE) | The runtime schedules instances as hardware partitions become available. Instances may run sequentially if only one hardware partition is available for the launch, or concurrently if multiple are available. No concurrency is guaranteed. |
This asymmetry is intentional: GPU hardware is designed for SIMT-style uniform
parallel dispatch, while NPU execution is more task-queue-like with variable
partition availability. Programs that require a guaranteed minimum concurrency
should use air.segment iteration spaces or concurrency tokens on segments
within a launch body rather than relying on the launch iteration space.
The synchrony of air.launch is determined by whether a result value is bound,
not by a separate async keyword. The sync keyword is an explicit synonym for
the no-result form and is provided for readability.
| Form | Behaviour |
|---|---|
[%token =] air.launch (…) {…} |
Asynchronous: dispatches immediately; returns !air.token that becomes signaled when all instances complete. Caller may overlap subsequent work. |
air.launch (…) {…} |
Synchronous (implicit): blocks the calling thread until all launch instances complete. Equivalent to the sync form. |
air.launch sync (…) {…} |
Synchronous (explicit): identical to the no-result form; sync keyword makes the blocking intent unambiguous, particularly useful when the op appears alongside async launches. |
| List | Purpose |
|---|---|
dependency |
!air.token values that must be signaled before any instance begins executing. Establishes a data/control dependency. |
affinity |
!air.token values that bind this launch to the same hardware partition as other ops sharing the same token (e.g. a specific column range on NPU or a specific XCD on GPU). |
air.launch must not carry a concurrency list. Concurrency constraints
belong to operations with always-parallel semantics (air.segment, air.herd);
air.launch instead expresses may-be-parallel work whose degree of parallelism
is determined by available hardware resources at runtime. Similarly, affinity
and concurrency tokens from enclosing scopes may not be passed into the launch
body (see §1.5 scope rule).
Nesting constraint: air.segment ops may appear inside air.launch, but not the
reverse.
air.segment// Asynchronous form — produces !air.token, non-blocking
[%token =] air.segment [@name]
(%x₀, …, %xₙ) in (%sx₀=%N₀, …)
args(%a₀=%v₀, …) : <types>
[x_loc=<col>] [y_loc=<row>] [x_size=<cols>] [y_size=<rows>]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
[concurrency = [%t₀, …]]
{
…
air.segment_terminator
}
// Synchronous form — caller blocks until the segment (and all its herds) complete
air.segment [@name] sync
(%x₀, …, %xₙ) in (%sx₀=%N₀, …)
args(%a₀=%v₀, …) : <types>
[x_loc=<col>] [y_loc=<row>] [x_size=<cols>] [y_size=<rows>]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
[concurrency = [%t₀, …]]
{
…
air.segment_terminator
}
air.segment represents a physically contiguous grouping of processing
elements together with their associated L2 (on-device) memory. It:
air.herd operations, L2 allocations, and inter-level data
movement required to implement a coherent kernel.N₀ × … × Nₙ. When
present, every point in the space instantiates an independent copy of the
segment body. All instances run with overlapping lifetimes: the runtime
guarantees that every instance is active simultaneously, so the full set of
instances is co-resident on the device for the duration of the segment. This
is a “stamp-out” of the segment across independent hardware partitions. Each
instance occupies a contiguous rectangular block of compute resources
(columns × rows of tiles, L2 banks, DMA engines). The placement of distinct
instances across the iteration space is independent: instances need not be
adjacent or form a globally contiguous region; each may be placed anywhere in
the device array subject only to non-overlap with other live instances.memref.alloc at
address space 1) and can access L3 memory through DMA or channel operations.The body of a segment may contain operations that express both temporal
constraints (sequencing via !air.token dependency lists, air.wait_all,
scf.for iter_arg-carried ordering, scf.parallel init/reduce token
merging) and spatial constraints (placement of air.herd tiles via
x_loc/y_loc, channel routing that fixes communication topology).
As with air.launch, synchrony is determined by the presence or absence of a
result value; sync is an explicit synonym for the no-result form.
| Form | Behaviour |
|---|---|
[%token =] air.segment (…) {…} |
Asynchronous: returns !air.token immediately; the enclosing launch body may overlap subsequent work with this segment. Token signals when all herds and data-movement ops within the segment complete. |
air.segment (…) {…} |
Synchronous (implicit): blocks until all herds and data-movement operations within the segment complete. |
air.segment sync (…) {…} |
Synchronous (explicit): identical to the no-result form. |
The compiler must determine the total resource requirements of a segment statically, at compile time. The analysis is recursive and bottom-up: inner segments are fully analysed before the segments that contain them, so that each level can treat its nested segments as opaque resource blocks.
The analysis at each level has two stages.
Stage 1 — per-instance worst-case analysis.
The analysis covers four structural cases: leaf segments (direct air.herd
ops), containing segments (nested air.segment ops), scf.for loops (with
iter_arg-based pipelining), and scf.parallel regions (all instances
concurrent, tokens reduced).
Leaf segment (body contains air.herd ops, no nested air.segment):
air.herd tiles simultaneously
active, derived from the herd iteration spaces and any x_size/y_size
placement constraints.air.dma_memcpy_nd or air.channel operations, determined from the async
dependency graph.scf.for and scf.parallel loops within a segment body:
scf.for — loop-carried token dependencies.
scf.for iterations are logically sequential in program order, but the async
token model allows controlled overlap between consecutive iterations. The
mechanism is iter_args: the token yielded by scf.yield at the end of
iteration N becomes the iter_arg received at the start of iteration N+1.
%t_final = scf.for %k = %lb to %ub step %s
iter_args(%t_prev = %t_init) -> !air.token {
// Ops that depend on %t_prev wait for iteration N-1 to reach a known point.
// Ops that do NOT depend on %t_prev may execute concurrently with N-1.
%t_dma = air.dma_memcpy_nd async [%t_prev] …
%t_compute = air.herd async [%t_dma] …
%t_out = air.wait_all async [%t_compute]
scf.yield %t_out : !air.token
}
An op in iteration N+1 that lists %t_prev in its dependency list does not
begin until iteration N has signaled that token. An op that does not depend
on %t_prev may begin before iteration N’s scf.yield fires, creating genuine
concurrency between consecutive iterations.
The pipeline depth D for a given resource type is the number of independent
token chains maintained as separate iter_args for that resource:
iter_arg chaining all activity): strictly sequential — only
one iteration’s worth of resources is active at a time.iter_args. Each stream contributes independently to the concurrent resource
count.Resource bounds inside an scf.for:
memref.alloc results that are live across scf.yield
(loop-carried allocations) are permanently live for the entire loop and
contribute to every iteration’s baseline. Allocations freed before scf.yield
require only one slot (reused each iteration).scf.parallel — broadcast and tree-reduction of tokens.
scf.parallel expresses a set of independent instances that may execute
concurrently. The async token model for scf.parallel has two parts:
%t_out = scf.parallel (%i, %j) = (%lb0, %lb1) to (%ub0, %ub1) step (%s0, %s1)
init (%t_in) -> !air.token {
// All instances may start as soon as %t_in is signaled.
%t_instance = air.channel.put async [%t_in] @ch[%i, %j] …
scf.reduce(%t_instance : !air.token) {
^bb0(%a: !air.token, %b: !air.token):
%t_merge = air.wait_all async [%a, %b]
scf.reduce.return %t_merge : !air.token
}
}
// %t_out is signaled only when ALL instances have completed.
init (%t_in): a single token that all parallel instances must wait for
before beginning. It acts as a broadcast dependency — every instance sees the
same %t_in as a prerequisite.scf.reduce block: performs a pairwise tree-reduction over the tokens
produced by each instance. Each call to the reduce block receives two instance
tokens (%a, %b) and uses air.wait_all to merge them. The final result
%t_out is signaled only when all instances’ tokens have been reduced to one.Resource bounds inside scf.parallel: all instances execute concurrently, so
their resource requirements are summed (not maxed), exactly as for a segment
iteration space (Stage 2 above). The aggregate resource demand is the per-instance
bound multiplied by the total number of parallel instances.
Containing segment (body contains one or more nested air.segment ops):
Each nested segment is first fully analysed through its own Stage 1 and Stage 2 to produce a total resource footprint — a tuple (tiles, L2 bytes, DMA channels) representing the resources that nested segment holds across all of its iterations simultaneously. This footprint is then treated as an indivisible unit at the outer level. The containing segment’s Stage 1 computes:
air.herd ops) in the
outer body. “Concurrently active” is determined from the outer body’s async
dependency graph: nested segments linked only by dependency tokens are
treated as sequential and their resources may be reused; nested segments with
no ordering constraint between them, or linked by concurrency tokens, are
treated as simultaneous and their resources are summed.Stage 2 — scaling by the iteration space.
Because all instances of a segment’s iteration space execute with overlapping
lifetimes, the per-instance bounds from Stage 1 are multiplied by the total
iteration count (N₀ × … × Nₙ) to obtain that segment’s total resource
footprint. A segment with no iteration space has an implicit count of 1.
Stage 2 applies independently at each nesting level: the inner segment scales by its own iteration count first; the resulting total footprint is what the outer Stage 1 sees as a unit; the outer segment then scales its own per-instance result by its iteration count. Resources are therefore multiplied by the product of all iteration counts along the nesting path, reflecting the fact that all instances at every level are simultaneously live.
The total footprint describes the sum of per-instance blocks, not a single contiguous region: the compiler allocates a contiguous block per instance and may place those blocks anywhere in the device array that fits.
These statically computed totals are used to:
air.launch — can be co-resident without resource conflicts.Optional physical placement attributes (x_loc, y_loc, x_size, y_size)
annotate the column/row offset and extent of one instance on devices with
2D tile arrays. For a segment with an iteration space, the compiler assigns a
separate (x_loc, y_loc) origin to each iteration point independently; those
per-instance origins need not be adjacent. When these attributes are present
they are taken as authoritative for a single instance’s contiguous footprint;
when absent the compiler derives a per-instance contiguous footprint from the
worst-case analysis above and places each instance freely.
The optional backend-granularity attribute communicates the hardware
granularity that this segment instance should map to on the target backend.
Currently defined values:
| Value | Meaning |
|---|---|
XCD |
The segment maps to one XCD (Accelerator Complex Die) on AMD MI3xx GPUs. Used with a multi-XCD air.launch iteration space to achieve full-device occupancy (see §4.2). |
When backend-granularity is absent the backend applies its default mapping
(e.g., one workgroup / thread block on GPU).
Nesting constraints:
air.segment must be nested inside air.launch or inside another
air.segment (directly or indirectly — the outermost segment must be inside
a launch).air.herd must be nested inside air.segment. It is the innermost hierarchy
level and must not contain air.segment or another air.herd.air.segment is unbounded by the model; backends may
impose device-specific limits (e.g. two levels on MI3xx — XCD and CU).air.herd// Synchronous form — acquires resources immediately, blocks until all PEs complete
air.herd [@name] tile (%x, %y) in (%sx=%Nx, %sy=%Ny)
args(%a₀=%v₀, …) : <types>
[x_loc=<col>] [y_loc=<row>]
[link_with="<object>"]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
[concurrency = [%t₀, …]]
{
…
air.herd_terminator
}
// Asynchronous form — produces !air.token; acquires resources when deps met, signals on completion
[%token =] air.herd [@name]
tile (%x, %y) in (%sx=%Nx, %sy=%Ny)
args(%a₀=%v₀, …) : <types>
[x_loc=<col>] [y_loc=<row>]
[link_with="<object>"]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
[concurrency = [%t₀, …]]
{
…
air.herd_terminator
}
air.herd defines a 1D or 2D array of processing elements (PEs) that all
execute the same body code in data-parallel fashion. It is the innermost level
of the hierarchy and maps directly to physical compute tiles (NPU) or GPU
threads:
%Nx × %Ny determines how many PE instances execute the
body and defines a contiguous block of compute resources. All instances
run concurrently on distinct hardware resources; this is always-parallel
(not may-be-parallel).%x and %y give each instance its own coordinates,
enabling each PE to independently address its portion of L1 or L2 memory.L1 memory (address space 2) is per-PE local
scratchpad.air.dma_memcpy_nd
or channel operations before use.Operations within the herd body may impose both temporal constraints (explicit token dependencies ordering DMA fetches before compute) and spatial constraints (channel indices or tile-coordinate arithmetic that binds each PE to a specific region of a shared buffer).
The meaning of “contiguous” and the constraints imposed on iteration space dimensions depend on the target backend:
AIE (NPU): A herd defines a logical rectangular array of compute units.
The compiler may reshape the iteration space (e.g., collapse a 2D herd
into a 1D arrangement) via the AIRCollapseHerdPass. Reshaping is inhibited
automatically when the herd body uses cascade channels (channel_type =
"cascade"), because cascade connections are topology-dependent and cannot
survive reindexing. Explicit placement attributes (x_loc, y_loc,
x_size, y_size) on the enclosing segment also constrain the legal shapes
by fixing the tile footprint. The pass accepts a max-col-size option to
bound the width of the collapsed arrangement. A dedicated per-herd attribute
to disable reshaping independently of these conditions is not yet
implemented. The resulting contiguous rectangular region of physical tiles is
determined after any reshaping has been applied.
GPU (AMD MI3xx family): A herd executes entirely within a single Compute
Unit (CU), with PE instances mapped to individual warps. The combined
resource requirements (register file, LDS, wavefront slots) of all PE
instances must fit within one CU. On MI3xx devices a CU provides fewer than
32 wavefront slots, so the total number of PE instances (%Nx × %Ny) is
correspondingly limited.
| Form | Behaviour |
|---|---|
| No result (synchronous) | Acquires resources immediately; blocks the enclosing body until all PE instances complete; releases resources at the herd terminator. |
[%token =] (asynchronous) |
Acquires resources when all dependency tokens are signaled; returns !air.token immediately; releases resources when the token fires (all PE instances complete). |
Optional attributes x_loc/y_loc specify the base placement on 2D tile
arrays; each PE at tile (%x, %y) occupies physical column/row
(x_loc + %x, y_loc + %y). The link_with attribute names an external kernel
object to link into the herd at compile time.
air.dma_memcpy_ndair.dma_memcpy_nd (dst[dst_offsets][dst_sizes][dst_strides],
src[src_offsets][src_sizes][src_strides]) : (type_dst, type_src)
Describes an asynchronous N-dimensional strided bulk copy between two memrefs. The
[offsets][sizes][strides] triples follow the same convention as memref.subview:
offsets is the starting index in each dimension, sizes is the number of elements to
transfer in each dimension, and strides is the stride in units of elements.
The operation is direction-agnostic: either the source or the destination (or both) may be at any memory level. The direction (L3→L2, L2→L1, L1→L2, etc.) is inferred from the address spaces of the operand memrefs and mapped to the appropriate hardware mechanism.
An empty [offsets], [sizes], or [strides] list for a side means the entire memref
is addressed with unit strides.
air.channel, air.channel.put, air.channel.get// Channel declaration — at module scope
air.channel @name [dim₀, dim₁, …] {channel_type = "dma_stream", depth = <N>}
// Synchronous put/get — block until the transfer completes
air.channel.put @name[indices] (src[offsets][sizes][strides]) : (type_src)
air.channel.get @name[indices] (dst[offsets][sizes][strides]) : (type_dst)
// Asynchronous put/get — return !air.token; transfer completes when token signals
[%token =] air.channel.put @name[indices] [dependency = [%t₀, …]]
(src[offsets][sizes][strides]) : (type_src)
[%token =] air.channel.get @name[indices] [dependency = [%t₀, …]]
(dst[offsets][sizes][strides]) : (type_dst)
Channels are declared once (at module scope) and used at distinct producer and consumer sites. They decouple the put and get sides, allowing the compiler to schedule them independently and to introduce double-buffering.
A channel may be an array (e.g., [4, 4] for a 4×4 array). The indices operand on
put/get selects the specific channel within the array.
The channel_type attribute controls the underlying mechanism:
| Value | Mechanism |
|---|---|
"dma_stream" (default) |
DMA engines with streaming (circuit-switched) interconnect |
"dma_packet" |
DMA engines with packet-switched interconnect |
"cascade" |
Core-to-core cascade connections between adjacent tiles |
The broadcast_shape attribute enables one-to-many communication following NumPy
broadcasting rules.
A channel has a finite buffer capacity set by the depth attribute (default 1).
The depth specifies how many transfers may be in-flight simultaneously:
| Depth | Effect |
|---|---|
| 1 (default) | Rendezvous: each put must be consumed by a get before the next put can proceed. |
| 2 | Double-buffering: producer may issue one transfer ahead of the consumer. |
| N | Producer may have up to N transfers outstanding before it must wait. |
Increasing depth relaxes the coupling between producer and consumer at the cost of additional buffer memory (each slot requires storage for one transfer’s worth of data).
put and get are blocking at the channel boundary:
put issued when the channel already holds depth unread transfers stalls until
the consumer issues a get and frees a slot.get issued when the channel holds no data stalls until the producer issues a put.In the asynchronous form the operation is dispatched immediately and the returned
!air.token does not signal until the blocking condition is resolved and the transfer
is complete. This allows other work to proceed while the channel handshake is in
progress, but does not remove the fundamental flow-control constraint.
For a channel to make progress, every put must be matched by a get on the same
channel index, and vice versa. The compiler enforces the static balance condition:
put operations
and the number of get operations at each channel index must be equal.(puts per producer instance) × (producer instances) must equal
(gets per consumer instance) × (consumer instances).A violation of the balance condition is a compile-time error.
A program deadlocks when a put is waiting for a get that can never execute, or a
get is waiting for a put that can never execute, and no other operation can break
the wait.
Minimal deadlock — put and get in the same sequential scope:
air.channel @C [] {depth = 1}
air.segment @deadlock {
// put blocks: channel is at capacity, waiting for a get to free space
air.channel.put @C[] (%src[][][]) : (memref<…, 1>)
// never reached — sequential control flow cannot reach get while put blocks
air.channel.get @C[] (%dst[][][]) : (memref<…, 1>)
air.segment_terminator
}
The necessary condition for deadlock freedom is that for every channel there exists a
concurrent execution context in which the matching put and get can both
make progress simultaneously. In practice:
put and get on the same channel must appear in different herds, different async
branches, or different segment instances — any context that allows them to execute
concurrently.air.execute and air.wait_allair.execute wraps a sequential block of host-visible computation (e.g., memref.alloc,
arithmetic on scalars) in an asynchronous region. The region produces an
!air.token (dependency kind) that becomes ready when all operations inside have completed.
air.wait_all takes a variadic list of !air.token values and produces a single
token that becomes ready only when all of its inputs are ready. It acts as a join node
in the async dependency graph.
On AMD Versal AI Engine (AIE) and Ryzen AI NPU targets the three-level hierarchy maps directly to physical hardware structures:
| AIR concept | AIE hardware |
|---|---|
air.launch |
Device-level kernel dispatch; L3 = DDR via NOC |
air.segment |
A contiguous rectangle of AIE tiles + associated memory tiles or URAMs (L2) |
air.herd tile (x, y) |
A single AIE compute tile at physical column/row offset (x_loc+x, y_loc+y) |
| L1 memory (space 2) | Per-tile local data memory (32 KB on AIE2) |
| L2 memory (space 1) | Shared memory tiles or URAMs accessible via the tile interconnect |
| L3 memory (space 0) | DDR accessible via AXI4-MM NOC |
| AIR operation | AIE mechanism |
|---|---|
air.dma_memcpy_nd (L3→L2) |
Shim DMA + AXI-S routing + memory tile DMA |
air.dma_memcpy_nd (L2→L1) |
Tile DMA + local AXI-S stream |
air.channel (dma_stream) |
DMA + circuit-switched streaming interconnect |
air.channel (dma_packet) |
DMA + packet-switched overlay network |
air.channel (cascade) |
Core cascade interface (direct register-to-register) |
The air-to-aie pass translates each air.herd into AIE tile operations (aie.core,
aie.buffer, aie.lock), routes channels through the switch boxes, and generates the
SHIM DMA configuration for L3 transfers.
Placement of a segment on physical tiles is guided by the optional x_loc/y_loc
(column/row offset) and x_size/y_size (column/row count) attributes on
air.segment. The herd’s tile indices are added to the segment offset to obtain the
absolute physical tile location.
On AMD GPU targets the AIR hierarchy maps onto the GPU execution model. The
air-to-rocdl and air-gpu-outlining passes perform this translation. The
basic mapping uses three levels (launch → segment → herd); on MI3xx devices a
four-level mapping with nested segments is used to achieve full-device occupancy
(see §4.2).
| AIR concept | GPU concept |
|---|---|
air.launch (%bx,%by) in (%gbx,%gby) |
gpu.launch grid: gridDim = (gbx, gby, 1) |
air.segment |
Workgroup (thread block); the segment body runs within a single gpu.launch |
air.herd tile (%x,%y) in (%bx,%by) |
Thread block dimensions: blockDim = (bx, by, 1) |
Herd tile index (%x, %y) |
(threadIdx.x, threadIdx.y) |
Launch index (%bx, %by) |
(blockIdx.x, blockIdx.y) |
The air.launch iteration space becomes the grid (number of thread blocks), and
the air.herd iteration space becomes the block (number of threads per block).
The air.segment body is the thread-block body: code that runs once per workgroup before
and after the per-thread air.herd body.
After translation the hierarchy is flattened: air.segment and air.herd are erased
and their bodies are moved into the enclosing gpu.launch region. The
air-gpu-outlining pass then extracts the gpu.launch body into a gpu.func kernel
and injects the appropriate gpu.BlockIdOp, gpu.ThreadIdOp, gpu.GridDimOp, and
gpu.BlockDimOp intrinsics.
On AMD MI3xx devices the full device can be occupied using a four-level hierarchy built from a unit-iteration-space launch, two nested segments, and an innermost herd:
| Level | AIR construct | MI3xx granularity |
|---|---|---|
| 1 | air.launch (no iteration space — implicit single instance) |
Whole-device dispatch |
| 2 | Outer air.segment (backend-granularity=XCD, iteration space = 8) |
All 8 XCDs simultaneously — all instances stamp out concurrently |
| 3 | Inner air.segment (iteration space = CUs per XCD) |
One CU within the XCD — all instances stamp out concurrently |
| 4 | air.herd (iteration space ≤ 32) |
Concurrent warps on that CU |
air.launch carries no iteration space (an implicit single-point space).
Its sole purpose at this level is to anchor the co-residency guarantee for
everything nested inside.
Outer air.segment (backend-granularity=XCD) has an iteration space of
cardinality 8, one instance per XCD on the MI3xx device. Because segment
instances always execute with overlapping lifetimes, all 8 XCDs are occupied
simultaneously.
Inner air.segment is nested inside the outer and iterates over the CUs
within its XCD. Again all instances run concurrently, so every CU in the XCD
is active at the same time.
air.herd is nested inside the inner segment and targets the warps running
on that single CU. The iteration space must not exceed 32 PE instances (the
wavefront-slot limit of a MI3xx CU, as described in §2.3). Iterations within
the herd may communicate through:
L2 address space) — per-CU local data store, shared among
all warps in the herd.L3 address space) — device HBM, accessible to all
levels of the hierarchy.The backend-granularity=XCD attribute on the outer segment is the only
MI3xx-specific annotation required; all other tiling falls out of the standard
segment and herd iteration space mechanisms.
| AIR memory space | Enum | GPU address space | GPU scope |
|---|---|---|---|
| L3 (space 0) | MemorySpace::L3 |
0 (generic/global) | Device global memory (HBM) |
| L2 (space 1) | MemorySpace::L2 |
3 (local) | Workgroup (LDS/shared memory) |
| L1 (space 2) | MemorySpace::L1 |
5 (private) | Per-thread private memory (VGPRs/scratch) |
memref.alloc ops in L2 space are hoisted to gpu.launch workgroup attributions
(shared among all threads in the block). memref.alloc ops in L1 space become
private attributions (one copy per thread). Explicit memref.dealloc ops are
removed because GPU attributions have implicit kernel-scoped lifetimes.
air.dma_memcpy_nd operations are lowered to explicit SCF loops containing
memref.load / memref.store pairs. The offsets, sizes, and strides from the DMA
descriptor drive the loop bounds and index arithmetic. Subsequent lowering passes
(ROCDL, LLVM CodeGen) optimise these into coalesced global loads, LDS stores, or
register-to-register moves depending on the inferred address spaces.
gpu.barrier instructions (already present in well-formed AIR GPU programs or inserted
by the compiler) synchronize threads at workgroup boundaries between DMA-equivalent
loads into shared memory and subsequent compute.
The GPU test in test/gpu/4k_4k_mul/air_sync.mlir illustrates the model:
// 32×32 grid of segments, one per 128×128 output tile
air.launch (%bx, %by) in (%nbx=%c32, %nby=%c32)
args(%A=%arg0, %B=%arg1, %C=%arg2) : … {
air.segment @forward_0 args(%bx=%bx, %by=%by, %A=%A, %B=%B, %C=%C) : … {
// Segment body: runs once per workgroup
// L2 allocations — shared memory tiles
%As = memref.alloc() : memref<128x8xf32, 1> // L2 → LDS
%Bs = memref.alloc() : memref<8x128xf32, 1> // L2 → LDS
scf.for %k = 0 to 4096 step 8 {
// Load A and B tiles into shared memory (DMA equivalent: global → LDS)
// Uses gpu.thread_id to distribute loads across threads
…
gpu.barrier
// 256 herd tiles (threads) compute the outer product
air.herd @herd_0 tile (%tx, %ty) in (%ntx=%c256, %nty=%c1)
args(%As=%As, %Bs=%Bs, …) : … {
// L1 accumulation registers (VGPRs)
%acc = memref.alloc() : memref<64xf32, 2> // L1 → private/VGPRs
…
gpu.barrier
}
}
// Write accumulated results back to L3 (global memory)
}
}
The mapping:
air.launch → gpu.launch with gridDim = (32, 32, 1)air.segment → workgroup body (executed once per thread block)air.herd tile (%tx, .) in (%c256, %c1) → blockDim = (256, 1, 1), threadIdx.x = %txThe full GPU lowering pipeline is:
air-to-rocdl — Flattens air.launch/air.segment/air.herd into a
gpu.launch, converts L1/L2 allocations to workgroup/private attributions, and
lowers air.dma_memcpy_nd to SCF loops.air-gpu-outlining — Outlines the gpu.launch body into a gpu.func inside
a gpu.module, injects gpu.BlockIdOp, gpu.ThreadIdOp, etc., and converts
the launch site to gpu.launch_func.rocdl-attach-
target), converts gpu.module to ROCDL IR (convert-gpu-to-rocdl), and
serialises the GPU binary with gpu-module-to-binary.See buildingGPU.md for build instructions and the complete
aircc.py compilation pipeline.
| Concept | NPU (AIE) | GPU (ROCDL) |
|---|---|---|
air.launch iteration point |
Device-level work unit | One GPU thread block |
air.segment |
Rectangle of AIE tiles + memory tiles | Thread block workgroup body |
air.herd tile |
Single AIE compute tile | Single GPU thread |
| L1 (space 2) | 32 KB tile-local data memory | Thread-private VGPRs / scratch |
| L2 (space 1) | Memory tiles / URAMs | LDS (shared memory, ~64 KB / CU) |
| L3 (space 0) | DDR via NOC | HBM via global memory |
dma_memcpy_nd |
AIE Shim/Tile DMA engines | SCF load/store loops |
channel (dma_stream) |
Streaming AXI-S switch | — (not yet mapped to GPU) |
| Synchronization | AIE locks | gpu.barrier |
!air.token (dependency) |
AIE runtime completion signals | GPU stream/event dependencies |