-air-copy-to-dma
Convert memcpy to air.dma_memcpy_nd
Converts memory operations to optimize data transfer through Direct Memory Access (DMA) operations.
-air-insert-launch-and-segment-around-herd
Insert segment and launch ops around herd op
This pass inserts launch and segment operations around herd op, if a herd op does not have a parent launch or segment operation.
-air-linalg-to-func
Convert the operations from the linalg dialect into the function calls
-link-with : Path to the object file containing the functions that will be called in place of the linalg operations.
-air-par-to-herd
Convert parallel loops to air.herd
This pass converts parallel loop operations to air herd
operations. The
parallel loops can be scf.parallel
or affine.parallel
operations with 1
or 2 dimensional iteration spaces. The iteration space of the parallel loop
will be normalized and will become the spacial iteration space of the new
herd
. If nested parallel loops are present then the depth
option can
to used to specify which loop depth to convert.
-depth : Given a nest of parallel for loops, which depth to map to air.herd
-first-dim : Which herd dimension to map to first. Can be zero or one. If set to zero, the 0th dimension of the scf.parallel will be mapped to the x dimension of the herd. If set to one, the 0th dimension of the scf.parallel will be mapped to the y dimension of the herd.
-air-par-to-launch
Convert parallel loops to air.launch
This pass converts parallel loop operations to air launch
operations. The
parallel loops can be scf.parallel
or affine.parallel
operations. The
iteration space of the parallel loops will be normalized and will become the
iteration space of the new launch
. If nested parallel loops are present
then the depth
option can to used to specify which loop depth to convert.
An air segment
operation can optionally be inserted at the top level of
the generated launch
operations with the has-air-segment
option.
-depth : Given a nest of parallel for loops, which depth to map to air.launch
-has-air-segment : Whether to create an air.segment op in generated air.launch regions
-air-split-devices
Split the input into one output per aie.device op
-output-prefix : File name prefix for split AIE modules. Set to '-' for stdout (default).
-air-to-aie
Lower air.launch_herd to AIE dialect
This pass converts AIR dialect herd
and segment
operations into AIE
dialect modules and AIRRt dialect metadata.
One AIE dialect module is generated for each segment
in the input
module. Any herd
without a parent segment
will will also generate
an AIE dialect module as if the herd
has an implicit segment.
For each herd
in a segment a 2d array of aie.tile
operations is
generated. The physical placement of the tiles is specified using the
herd
operation placement attributes or with row-offset
and col-offset
options to the pass. aie.core
operations are generated for each aie.tile
and the herd
body is cloned into each core.
After generating aie.core
operations, several other conversions are run:
memref.alloc
operations returning L1 memory are converted into static
allocations using aie.buffer
operations.
dma_memcpy_nd
operations in each core are lowered to aie.mem
operations to perform the transfers and aie.locks
are allocated to
synchronize between the cores and the tile DMAs. As part of this
conversion tile DMA schedules and channel allocations are generated
for the aie.mem
bodies. L3 or L2 DMA channels are allocated for
sending or receiving data to the tile DMAs. aie.flow
operations
are allocated to connect the DMAs.
affine.if
operations with tile id operands are specialized, as these
are now constants. This allows an upstream user or transformation to
specialize parts of each aie.core
according to its location in the herd.
air.execute
and air.wait_all
operations are optimized away or
transformed into sequential code.
The pass will insert AIRRt metadata into the original module to describe the
segments, herds and DMA allocations that were generated in the AIE dialect
output modules. Runtime code for configuration and control of segments is
generated from the AIRRt metadata by the air-to-std
pass.
func.func @f(%arg0: memref<1024xi32>, %arg1: memref<1024xi32>) {
%c1 = arith.constant 1 : index
air.herd @herd_0 tile (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0, %arg7=%arg1) : memref<1024xi32>, memref<1024xi32> {
%alloc = memref.alloc() : memref<1024xi32, 2>
air.dma_memcpy_nd (%alloc[] [] [], %arg6[] [] []) {id = 1 : i32} : (memref<1024xi32, 2>, memref<1024xi32>)
memref.dealloc %alloc : memref<1024xi32, 2>
air.herd_terminator
}
return
}
The AIE resource allocation,
module @aie.segment_0 {
%0 = aie.tile(1, 1)
%1 = aie.tile(2, 0)
%2 = aie.lock(%0, 0)
%3 = aie.buffer(%0) {sym_name = "buf0"} : memref<1024xi32, 2>
aie.flow(%1, DMA : 0, %0, DMA : 0)
the AIE DMA program,
%4 = aie.mem(%0) {
%6 = aie.dma_start(S2MM, 0, ^bb1, ^bb2)
^bb1: // 2 preds: ^bb0, ^bb1
aie.use_lock(%2, Acquire, 0)
aie.dma_bd(%3 : memref<1024xi32, 2>, 0, 1)
aie.use_lock(%2, Release, 1)
cf.br ^bb1
^bb2: // pred: ^bb0
aie.end
}
the AIE Core program,
%5 = aie.core(%0) {
cf.br ^bb1
^bb1: // pred: ^bb0
cf.br ^bb2
^bb2: // pred: ^bb1
aie.use_lock(%2, Acquire, 1)
aie.use_lock(%2, Release, 0)
aie.end
}
and the AIRRt metadata,
airrt.module_metadata{
airrt.segment_metadata attributes {sym_name = "segment_0"}{
airrt.herd_metadata {dma_allocations = [{channel = 2 : i64, col = 0 : i64, id = 1 : i64, location = 2 : i64, row = 0 : i64} ], sym_name = "herd_0"}
}
}
-row-offset : The default start row for any herds without 'y_loc' attribute.
-col-offset : The default start column for any herds without 'x_loc' attribute.
-emit-while-loop : Emit a while(1) around the herd code in generated AIR.core ops.
-emit-herd-lock : Acquire and release a lock at the start and end of herd execution. The default is to acquire lock 0 with value zero and release it with value 0. There is currently no way to override the default behavior.
-test-patterns : Test the given patterns.
-device : AIE device to target.
-use-objectfifo : Choose whether to lower data movement ops to aie.objectFifo, or directly to aie.locks.
-generate-shim-dma : Choose whether to schedule shim data movement via generating AIE shim DMA program, or AIR runtime.
-insert-trace-packet-flow : Create packet routed traces for cores and memtiles
-use-pkt-flow-at-shim-dma : Switch to using packet flows for all data movements at shim DMAs, to enable time-multiplex sharing with control packet flows.
-air-to-async
AIR dialect lowering
-air-to-std
AIR dialect lowering
This pass converts AIR dialect herd launch operations into loop nests representing the host-side control program for the herd. It also converts AIR dialect memcpy operations into AIRRt memcpy operations.
module {
func.func @f(%arg0: memref<1024xi32>, %arg1: memref<1024xi32>) {
%c1 = arith.constant 1 : index
air.herd @herd_0 tile (%arg2, %arg3) in (%arg4=%c1, %arg5=%c1) args(%arg6=%arg0, %arg7=%arg1) : memref<1024xi32>, memref<1024xi32> attributes {x_loc = 1 : i32, y_loc = 1 : i32} {
%alloc = memref.alloc() : memref<1024xi32, 2>
air.dma_memcpy_nd (%alloc[] [] [], %arg6[] [] []) {id = 1 : i32} : (memref<1024xi32, 2>, memref<1024xi32>)
memref.dealloc %alloc : memref<1024xi32, 2>
air.herd_terminator
}
return
}
}
func.func @f(%arg0: memref<1024xi32>, %arg1: memref<1024xi32>) {
%c1 = arith.constant 1 : index
%h = airrt.herd_load "herd_0" : i64
affine.for %arg2 = 0 to 1 {
affine.for %arg3 = 0 to 1 {
%alloc = memref.alloc() : memref<1024xi32, 2>
%c1_i32 = arith.constant 1 : i32
%0 = arith.index_cast %arg3 : index to i64
%1 = arith.index_cast %arg2 : index to i64
%c0_i64 = arith.constant 0 : i64
%c1_i64 = arith.constant 1 : i64
airrt.dma_memcpy_nd(%c1_i32, %0, %1, %arg0[%c0_i64, %c0_i64, %c0_i64, %c0_i64], [%c1_i64, %c1_i64, %c1_i64, %c1_i64], [%c0_i64, %c0_i64, %c0_i64]) : (i32, i64, i64, memref<1024xi32>, [i64, i64, i64, i64], [i64, i64, i64, i64], [i64, i64, i64])
memref.dealloc %alloc : memref<1024xi32, 2>
} {air.herd = "inner"}
} {air.herd = "outer"}
return
}
-airrt-to-llvm
Lower AIRRt dialect to LLVM dialect
This pass lowers AIRRt dialect to function calls and data structures matching those found in air_host.h.
AIRRt static metadata is transformed to LLVM dialect data structures. The data is generated as a number of globals with external linkage. The data layout is closely tied the AIR runtime and the definitions in air_host.h. Any changes to this pass must be reflected there.
-airrt-to-npu
Lower AIRRt dialect to AIEX.npu dialect
Converts the runtime program, described in AIRRt dialect, into instruction sequence specific to the SHIM DMA controllers on Ryzen AI platform.
Example:
Input:
module {
aie.device(npu1_1col) {
...
aie.shim_dma_allocation @airMemcpyId78(S2MM, 0, 0)
memref.global "public" @airMemcpyId78 : memref<32x128xi32, 1>
...
aie.shim_dma_allocation @airMemcpyId19(MM2S, 0, 0)
memref.global "public" @airMemcpyId19 : memref<32x256xi32, 1>
...
aie.shim_dma_allocation @airMemcpyId15(MM2S, 0, 2)
memref.global "public" @airMemcpyId15 : memref<256x32xi32, 1>
...
} {sym_name = "segment_0"}
...
func.func @matmul_512x512_1024xi32__dispatch_0_matmul_512x512x1024_i32() {
...
affine.for %arg0 = affine_map<(d0) -> (d0)>(%c0) to affine_map<(d0) -> (d0 + 4)>(%c0) {
affine.for %arg1 = affine_map<(d0) -> (d0)>(%c0_0) to affine_map<(d0) -> (d0 + 4)>(%c0_0) {
...
%25 = airrt.dma_memcpy_nd(%c17_i32, %15, %16, %0[%c0_i64, %17, %18, %19], [%c1_i64, %22, %23, %24], [%c0_i64, %20, %21]) {metadata = @airMemcpyId19} : (i32, i64, i64, memref<512x1024xi32>, [i64, i64, i64, i64], [i64, i64, i64, i64], [i64, i64, i64]) : !airrt.event
...
%74 = airrt.dma_memcpy_nd(%c13_i32, %67, %68, %3[%c0_i64_15, %c0_i64_15, %69, %70], [%c1_i64_16, %c1_i64_16, %72, %73], [%c0_i64_15, %c0_i64_15, %71]) {metadata = @airMemcpyId15} : (i32, i64, i64, memref<1024x512xi32>, [i64, i64, i64, i64], [i64, i64, i64, i64], [i64, i64, i64]) : !airrt.event
...
%111 = airrt.dma_memcpy_nd(%c78_i32, %104, %105, %6[%c0_i64_26, %c0_i64_26, %106, %107], [%c1_i64_27, %c1_i64_27, %109, %110], [%c0_i64_26, %c0_i64_26, %108]) {metadata = @airMemcpyId78} : (i32, i64, i64, memref<512x512xi32>, [i64, i64, i64, i64], [i64, i64, i64, i64], [i64, i64, i64]) : !airrt.event
...
}
}
return
}
}
Output:
module {
aie.device(npu1_1col) {
...
aie.shim_dma_allocation @airMemcpyId78(S2MM, 0, 0)
memref.global "public" @airMemcpyId78 : memref<32x128xi32, 1>
...
aie.shim_dma_allocation @airMemcpyId19(MM2S, 0, 0)
memref.global "public" @airMemcpyId19 : memref<32x256xi32, 1>
...
aie.shim_dma_allocation @airMemcpyId15(MM2S, 0, 2)
memref.global "public" @airMemcpyId15 : memref<256x32xi32, 1>
...
func.func @matmul_512x512_1024xi32__dispatch_0_matmul_512x512x1024_i32() {
...
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][4, 4, 32, 256][0, 256, 1024]) {id = 0 : i64, metadata = @airMemcpyId19} : memref<512x1024xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 128, 0][4, 4, 32, 256][0, 256, 1024]) {id = 1 : i64, metadata = @airMemcpyId19} : memref<512x1024xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 256, 0][4, 4, 32, 256][0, 256, 1024]) {id = 2 : i64, metadata = @airMemcpyId19} : memref<512x1024xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 384, 0][4, 4, 32, 256][0, 256, 1024]) {id = 3 : i64, metadata = @airMemcpyId19} : memref<512x1024xi32>
...
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][4, 2, 512, 32][128, 262144, 512]) {id = 0 : i64, metadata = @airMemcpyId15} : memref<1024x512xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][4, 2, 512, 32][128, 262144, 512]) {id = 1 : i64, metadata = @airMemcpyId15} : memref<1024x512xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][4, 2, 512, 32][128, 262144, 512]) {id = 2 : i64, metadata = @airMemcpyId15} : memref<1024x512xi32>
aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][4, 2, 512, 32][128, 262144, 512]) {id = 3 : i64, metadata = @airMemcpyId15} : memref<1024x512xi32>
...
aiex.npu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][4, 4, 32, 128][65536, 128, 512]) {id = 8 : i64, metadata = @airMemcpyId78} : memref<512x512xi32>
...
return
}
} {sym_name = "segment_0"}
}
-trace-size : Trace buffer size for cores and memtiles (in bytes)
-trace-offset : Trace buffer offset appended to ddr_id=2