-air-copy-to-dmaConvert memcpy to air.dma_memcpy_nd
Converts memory operations to optimize data transfer through Direct Memory Access (DMA) operations.
-air-insert-launch-around-herdInsert launch op (and optionally segment op) around herd op
This pass inserts an air.launch over air.herd, where an air.launch is always inserted, and an air.segment is inserted only if insertSegment is set to true.
-insert-segment : Option to insert an air.segment around the herd.
-air-linalg-to-funcConvert the operations from the linalg dialect into the function calls
-link-with : Path to the object file containing the functions that will be called in place of the linalg operations.
-air-par-to-herdConvert parallel loops to air.herd
This pass converts parallel loop operations to air herd operations. The
parallel loops can be scf.parallel or affine.parallel operations with 1
or 2 dimensional iteration spaces. The iteration space of the parallel loop
will be normalized and will become the spacial iteration space of the new
herd. If nested parallel loops are present then the depth option can
to used to specify which loop depth to convert.
-depth : Given a nest of parallel for loops, which depth to map to air.herd. -1 means converting the innermost parallel loop; any other negative value means converting all parallel loops
-first-dim : Which herd dimension to map to first. Can be zero or one. If set to zero, the 0th dimension of the scf.parallel will be mapped to the x dimension of the herd. If set to one, the 0th dimension of the scf.parallel will be mapped to the y dimension of the herd.
-air-par-to-launchConvert parallel loops to air.launch
This pass converts parallel loop operations to air launch operations. The
parallel loops can be scf.parallel or affine.parallel operations. The
iteration space of the parallel loops will be normalized and will become the
iteration space of the new launch. If nested parallel loops are present
then the depth option can to used to specify which loop depth to convert.
An air segment operation can optionally be inserted at the top level of
the generated launch operations with the has-air-segment option.
-depth : Given a nest of parallel for loops, which depth to map to air.launch-1 means converting the innermost parallel loop; any other negative value means converting all parallel loops
-has-air-segment : Whether to create an air.segment op in generated air.launch regions
-air-par-to-segmentConvert parallel loops to air.segment
This pass converts parallel loop operations to air segment operations. The
iteration space of the parallel loops will be normalized and will become the
iteration space of the new segment. If nested parallel loops are present
then the depth option can to used to specify which loop depth to convert.
-depth : Given a nest of parallel for loops, which depth to map to air.segment-1 means converting the innermost parallel loop; any other negative value means converting all parallel loops
-air-split-devicesSplit the input into one output per aie.device op
-output-prefix : File name prefix for split AIE modules. Set to '-' for stdout (default).
-air-to-aieLower air.launch_herd to AIE dialect
This pass converts AIR dialect herd and segment operations into AIE
dialect modules and AIRRt dialect metadata.
One AIE dialect module is generated for each segment in the input
module. Any herd without a parent segment will will also generate
an AIE dialect module as if the herd has an implicit segment.
For each herd in a segment a 2d array of aie.tile operations is
generated. The physical placement of the tiles is specified using the
herd operation placement attributes or with row-offset and col-offset
options to the pass. aie.core operations are generated for each aie.tile
and the herd body is cloned into each core.
After generating aie.core operations, several other conversions are run:
memref.alloc operations returning L1 memory are converted into static
allocations using aie.buffer operations.
dma_memcpy_nd operations in each core are lowered to aie.mem
operations to perform the transfers and aie.locks are allocated to
synchronize between the cores and the tile DMAs. As part of this
conversion tile DMA schedules and channel allocations are generated
for the aie.mem bodies. L3 or L2 DMA channels are allocated for
sending or receiving data to the tile DMAs. aie.flow operations
are allocated to connect the DMAs.
affine.if operations with tile id operands are specialized, as these
are now constants. This allows an upstream user or transformation to
specialize parts of each aie.core according to its location in the herd.
air.execute and air.wait_all operations are optimized away or
transformed into sequential code.
The pass will insert AIRRt metadata into the original module to describe the
segments, herds and DMA allocations that were generated in the AIE dialect
output modules. Runtime code for configuration and control of segments is
generated from the AIRRt metadata by the air-to-std pass.
func.func @f(%arg0: memref<1024xi32>, %arg1: memref<1024xi32>) {
%c1 = arith.constant 1 : index
air.herd @herd_0 tile (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0, %arg7=%arg1) : memref<1024xi32>, memref<1024xi32> {
%alloc = memref.alloc() : memref<1024xi32, 2>
air.dma_memcpy_nd (%alloc[] [] [], %arg6[] [] []) {id = 1 : i32} : (memref<1024xi32, 2>, memref<1024xi32>)
memref.dealloc %alloc : memref<1024xi32, 2>
air.herd_terminator
}
return
}
The AIE resource allocation,
module @aie.segment_0 {
%0 = aie.tile(1, 1)
%1 = aie.tile(2, 0)
%2 = aie.lock(%0, 0)
%3 = aie.buffer(%0) {sym_name = "buf0"} : memref<1024xi32, 2>
aie.flow(%1, DMA : 0, %0, DMA : 0)
the AIE DMA program,
%4 = aie.mem(%0) {
%6 = aie.dma_start(S2MM, 0, ^bb1, ^bb2)
^bb1: // 2 preds: ^bb0, ^bb1
aie.use_lock(%2, Acquire, 0)
aie.dma_bd(%3 : memref<1024xi32, 2>, 0, 1)
aie.use_lock(%2, Release, 1)
cf.br ^bb1
^bb2: // pred: ^bb0
aie.end
}
the AIE Core program,
%5 = aie.core(%0) {
cf.br ^bb1
^bb1: // pred: ^bb0
cf.br ^bb2
^bb2: // pred: ^bb1
aie.use_lock(%2, Acquire, 1)
aie.use_lock(%2, Release, 0)
aie.end
}
and the AIRRt metadata,
airrt.module_metadata{
airrt.segment_metadata attributes {sym_name = "segment_0"}{
airrt.herd_metadata {dma_allocations = [{channel = 2 : i64, col = 0 : i64, id = 1 : i64, location = 2 : i64, row = 0 : i64} ], sym_name = "herd_0"}
}
}
-row-offset : The default start row for any herds without 'y_loc' attribute.
-col-offset : The default start column for any herds without 'x_loc' attribute.
-emit-while-loop : Emit a while(1) around the herd code in generated AIR.core ops.
-emit-herd-lock : Acquire and release a lock at the start and end of herd execution. The default is to acquire lock 0 with value zero and release it with value 0. There is currently no way to override the default behavior.
-test-patterns : Test the given patterns.
-device : AIE device to target.
-use-objectfifo : Choose whether to lower data movement ops to aie.objectFifo, or directly to aie.locks.
-generate-shim-dma : Choose whether to schedule shim data movement via generating AIE shim DMA program, or AIR runtime.
-insert-trace-packet-flow : Create packet routed traces for cores and memtiles
-use-pkt-flow-at-shim-dma : Switch to using packet flows for all data movements at shim DMAs, to enable time-multiplex sharing with control packet flows.
-use-lock-race-condition-fix : Switch to enable a fix for lock race condition, which protects against the risk of race condition, at the cost of inserting extra dummy DMA BDs
-air-to-asyncAIR dialect lowering
-air-to-stdAIR dialect lowering
This pass converts AIR dialect herd launch operations into loop nests representing the host-side control program for the herd. It also converts AIR dialect memcpy operations into AIRRt memcpy operations.
module {
func.func @f(%arg0: memref<1024xi32>, %arg1: memref<1024xi32>) {
%c1 = arith.constant 1 : index
air.herd @herd_0 tile (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0, %arg7=%arg1) : memref<1024xi32>, memref<1024xi32> attributes {x_loc = 1 : i32, y_loc = 1 : i32} {
%alloc = memref.alloc() : memref<1024xi32, 2>
air.dma_memcpy_nd (%alloc[] [] [], %arg6[] [] []) {id = 1 : i32} : (memref<1024xi32, 2>, memref<1024xi32>)
memref.dealloc %alloc : memref<1024xi32, 2>
air.herd_terminator
}
return
}
}
func.func @f(%arg0: memref<1024xi32>, %arg1: memref<1024xi32>) {
%c1 = arith.constant 1 : index
%h = airrt.herd_load "herd_0" : i64
affine.for %arg2 = 0 to 1 {
affine.for %arg3 = 0 to 1 {
%alloc = memref.alloc() : memref<1024xi32, 2>
%c1_i32 = arith.constant 1 : i32
%0 = arith.index_cast %arg3 : index to i64
%1 = arith.index_cast %arg2 : index to i64
%c0_i64 = arith.constant 0 : i64
%c1_i64 = arith.constant 1 : i64
airrt.dma_memcpy_nd(%c1_i32, %0, %1, %arg0[%c0_i64, %c0_i64, %c0_i64, %c0_i64], [%c1_i64, %c1_i64, %c1_i64, %c1_i64], [%c0_i64, %c0_i64, %c0_i64]) : (i32, i64, i64, memref<1024xi32>, [i64, i64, i64, i64], [i64, i64, i64, i64], [i64, i64, i64])
memref.dealloc %alloc : memref<1024xi32, 2>
} {air.herd = "inner"}
} {air.herd = "outer"}
return
}
-air-wrap-func-with-parallelWraps the body of a given func.func operation inside an scf.parallel loop.
Wraps the body of a given func.func operation inside an scf.parallel loop. The pass assumes that: (1) The function arguments consist of: M memref arguments, N loop upper bounds, N loop induction variable indices. (2) The scf.parallel loop is constructed using the N upper bounds and induction variable indices. (3) The scf.parallel loop is inserted at the beginning of the function, wrapping all existing operations.
-loop-bounds : Specify upper bounds for scf.parallel loops
-airrt-to-llvmLower AIRRt dialect to LLVM dialect
This pass lowers AIRRt dialect to function calls and data structures matching those found in air_host.h.
AIRRt static metadata is transformed to LLVM dialect data structures. The data is generated as a number of globals with external linkage. The data layout is closely tied the AIR runtime and the definitions in air_host.h. Any changes to this pass must be reflected there.
-airrt-to-npuLower AIRRt dialect to AIEX.npu dialect
Converts the runtime program, described in AIRRt dialect, into instruction sequence specific to the SHIM DMA controllers on Ryzen AI platform.
Example:
Input:
module {
aie.device(npu1_1col) {
...
aie.shim_dma_allocation @airMemcpyId78(S2MM, 0, 0)
memref.global "public" @airMemcpyId78 : memref<32x128xi32, 1>
...
aie.shim_dma_allocation @airMemcpyId19(MM2S, 0, 0)
memref.global "public" @airMemcpyId19 : memref<32x256xi32, 1>
...
aie.shim_dma_allocation @airMemcpyId15(MM2S, 0, 2)
memref.global "public" @airMemcpyId15 : memref<256x32xi32, 1>
...
} {sym_name = "segment_0"}
...
func.func @matmul_512x512_1024xi32__dispatch_0_matmul_512x512x1024_i32() {
...
affine.for %arg0 = affine_map<(d0) -> (d0)>(%c0) to affine_map<(d0) -> (d0 + 4)>(%c0) {
affine.for %arg1 = affine_map<(d0) -> (d0)>(%c0_0) to affine_map<(d0) -> (d0 + 4)>(%c0_0) {
...
%25 = airrt.dma_memcpy_nd(%c17_i32, %15, %16, %0[%c0_i64, %17, %18, %19], [%c1_i64, %22, %23, %24], [%c0_i64, %20, %21]) {metadata = @airMemcpyId19} : (i32, i64, i64, memref<512x1024xi32>, [i64, i64, i64, i64], [i64, i64, i64, i64], [i64, i64, i64]) : !airrt.event
...
%74 = airrt.dma_memcpy_nd(%c13_i32, %67, %68, %3[%c0_i64_15, %c0_i64_15, %69, %70], [%c1_i64_16, %c1_i64_16, %72, %73], [%c0_i64_15, %c0_i64_15, %71]) {metadata = @airMemcpyId15} : (i32, i64, i64, memref<1024x512xi32>, [i64, i64, i64, i64], [i64, i64, i64, i64], [i64, i64, i64]) : !airrt.event
...
%111 = airrt.dma_memcpy_nd(%c78_i32, %104, %105, %6[%c0_i64_26, %c0_i64_26, %106, %107], [%c1_i64_27, %c1_i64_27, %109, %110], [%c0_i64_26, %c0_i64_26, %108]) {metadata = @airMemcpyId78} : (i32, i64, i64, memref<512x512xi32>, [i64, i64, i64, i64], [i64, i64, i64, i64], [i64, i64, i64]) : !airrt.event
...
}
}
return
}
}
Output:
module {
aie.device(npu1_1col) {
...
aie.shim_dma_allocation @airMemcpyId78(S2MM, 0, 0)
memref.global "public" @airMemcpyId78 : memref<32x128xi32, 1>
...
aie.shim_dma_allocation @airMemcpyId19(MM2S, 0, 0)
memref.global "public" @airMemcpyId19 : memref<32x256xi32, 1>
...
aie.shim_dma_allocation @airMemcpyId15(MM2S, 0, 2)
memref.global "public" @airMemcpyId15 : memref<256x32xi32, 1>
...
func.func @matmul_512x512_1024xi32__dispatch_0_matmul_512x512x1024_i32() {
...
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][4, 4, 32, 256][0, 256, 1024]) {id = 0 : i64, metadata = @airMemcpyId19} : memref<512x1024xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 128, 0][4, 4, 32, 256][0, 256, 1024]) {id = 1 : i64, metadata = @airMemcpyId19} : memref<512x1024xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 256, 0][4, 4, 32, 256][0, 256, 1024]) {id = 2 : i64, metadata = @airMemcpyId19} : memref<512x1024xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 384, 0][4, 4, 32, 256][0, 256, 1024]) {id = 3 : i64, metadata = @airMemcpyId19} : memref<512x1024xi32>
...
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][4, 2, 512, 32][128, 262144, 512]) {id = 0 : i64, metadata = @airMemcpyId15} : memref<1024x512xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][4, 2, 512, 32][128, 262144, 512]) {id = 1 : i64, metadata = @airMemcpyId15} : memref<1024x512xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][4, 2, 512, 32][128, 262144, 512]) {id = 2 : i64, metadata = @airMemcpyId15} : memref<1024x512xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][4, 2, 512, 32][128, 262144, 512]) {id = 3 : i64, metadata = @airMemcpyId15} : memref<1024x512xi32>
...
aiex.npu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][4, 4, 32, 128][65536, 128, 512]) {id = 8 : i64, metadata = @airMemcpyId78} : memref<512x512xi32>
...
return
}
} {sym_name = "segment_0"}
}
-trace-size : Trace buffer size for cores and memtiles (in bytes)
-trace-offset : Trace buffer offset appended to ddr_id=2