transform.air.herd_vectorize (transform::AIRHerdVectorizeOp)Vectorize operations inside air.herd operations
Syntax:
operation ::= `transform.air.herd_vectorize` $target attr-dict `:` functional-type(operands, results)
This transform takes a handle to air.herd operations and vectorizes the operations inside their bodies using the same logic as the AIRHerdVectorizePass. It walks the body of each herd operation and applies vectorization patterns to linalg operations and other vectorizable operations.
The transform supports the same options as the AIRHerdVectorizePass:
Example:
%herd = transform.structured.match ops{["air.herd"]} in %f : (!transform.any_op) -> !transform.any_op
%vectorized = transform.air.herd_vectorize %herd {
vectorize_nd_extract = false,
flatten_1d_depthwise_conv = false,
vectorize_padding = true
} : (!transform.any_op) -> !transform.any_op
Returns a handle to the transformed air.herd operations.
Traits: FunctionalStyleTransformOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
vectorize_nd_extract | ::mlir::BoolAttr | bool attribute |
flatten_1d_depthwise_conv | ::mlir::BoolAttr | bool attribute |
disable_transfer_permutation_map_lowering_patterns | ::mlir::BoolAttr | bool attribute |
disable_multi_reduction_to_contract_patterns | ::mlir::BoolAttr | bool attribute |
vectorize_padding | ::mlir::BoolAttr | bool attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.hoist_static_alloc (transform::AIRHoistStaticAllocOp)Hoist static allocations.
Syntax:
operation ::= `transform.air.hoist_static_alloc` $target attr-dict `:` functional-type(operands, results)
Moves certain statically-sized memref.alloc operations from inner blocks
to the entry block of the target function. This shortens and unifies buffer
lifetimes, which can unlock reuse and downstream optimizations.
memref.alloc buffers with static shapes.scf.yield, func.return) are not rewritten; such allocations are skipped.Before:
func.func @foo(%arg0: memref<64xi32>) {
scf.for %i = %c0 to %c4 step %c1 {
%tmp = memref.alloc() : memref<64xi32>
linalg.fill ins(%cst : i32) outs(%tmp : memref<64xi32>)
memref.dealloc %tmp : memref<64xi32>
}
return
}
After:
func.func @foo(%arg0: memref<64xi32>) {
%tmp.hoisted = memref.alloc() : memref<64xi32>
scf.for %i = %c0 to %c4 step %c1 {
linalg.fill ins(%cst : i32) outs(%tmp.hoisted : memref<64xi32>)
}
memref.dealloc %tmp.hoisted : memref<64xi32>
return
}
transform.sequence %arg0 : !transform.any_op failures(propagate) {
^bb0(%f: !transform.any_op):
transform.air.hoist_static_alloc %f
: (!transform.any_op) -> ()
}
Traits: ReportTrackingListenerFailuresOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
transform.air.broadcast_before_unary (transform::BroadcastBeforeUnaryOp)Move vector.broadcast before element-wise unary operations to enable vector execution
Syntax:
operation ::= `transform.air.broadcast_before_unary` $target attr-dict `:` functional-type(operands, results)
This transform identifies patterns where an element-wise unary operation operates on a single-element vector or scalar and its result is immediately broadcast to a larger vector. It rearranges the operations to broadcast first, then apply the unary operation, allowing the operation to execute on the full vector which can be more efficient on vector engines.
Pattern matched (vector<1xT>):
%unary_result = unary_op %x : vector<1xT>
%result = vector.broadcast %unary_result : vector<1xT> to vector<NxT>
Pattern matched (scalar):
%unary_result = unary_op %x : T
%result = vector.broadcast %unary_result : T to vector<NxT>
Transformed to:
%broadcast = vector.broadcast %x : vector<1xT> (or T) to vector<NxT>
%result = unary_op %broadcast : vector<NxT>
This is mathematically valid for element-wise operations where op(broadcast(x)) == broadcast(op(x)).
By default (when op_name is not specified), the transform uses trait-based checking to automatically support all Pure, single-operand, single-result, element-wise operations in math/arith dialects. If op_name is specified, only operations with that exact name are transformed.
Safety conditions checked:
This optimization is particularly beneficial for hardware accelerators like AMD AIEs that can only execute certain operations on vector engines, not on scalar units. Common in layer normalization and other neural network operations.
Example usage (all qualifying unary ops):
%func = transform.structured.match ops{["func.func"]} in %arg0
: (!transform.any_op) -> !transform.any_op
transform.air.broadcast_before_unary %func
Example usage (specific operation only):
%func = transform.structured.match ops{["func.func"]} in %arg0
: (!transform.any_op) -> !transform.any_op
transform.air.broadcast_before_unary %func {op_name = "math.rsqrt"}
Returns a handle to the transformed operation.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
op_name | ::mlir::StringAttr | string attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.convert_divf_sqrt_to_rsqrt (transform::ConvertDivfSqrtToRsqrtOp)Convert arith.divf(1.0, math.sqrt(x)) pattern to math.rsqrt(x)
Syntax:
operation ::= `transform.air.convert_divf_sqrt_to_rsqrt` $target attr-dict `:` functional-type(operands, results)
This transform walks through operations with the IsolatedFromAbove trait (such as func.func, air.herd, air.segment, etc.) and identifies the pattern of dividing a constant 1.0 by the result of math.sqrt, converting it to a single math.rsqrt operation.
The transformation looks for:
%sqrt = math.sqrt %x : vector<NxT>
%rsqrt = arith.divf %cst_1.0, %sqrt : vector<NxT>
And replaces it with:
%rsqrt = math.rsqrt %x : vector<NxT>
Safety conditions checked:
This optimization is beneficial for hardware accelerators that have native rsqrt instructions, as it’s typically faster and more accurate than separate sqrt + division operations. This is a common pattern in layer normalization and other neural network operations.
Example usage:
%herd = transform.structured.match ops{["air.herd"]} in %arg0
: (!transform.any_op) -> !transform.any_op
transform.air.convert_divf_sqrt_to_rsqrt %herd
Returns a handle to the transformed operation.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.convert_memref_copy_to_linalg_copy (transform::ConvertMemrefCopyToLinalgCopyOp)Convert memref.copy operations to linalg.copy operations
Syntax:
operation ::= `transform.air.convert_memref_copy_to_linalg_copy` $target attr-dict `:` functional-type(operands, results)
This transform converts memref.copy operations to linalg.copy operations.
This can be useful for enabling further linalg-based optimizations and transformations.
The transformation replaces:
memref.copy %source, %dest : memref<...> to memref<...>
With:
linalg.copy ins(%source : memref<...>) outs(%dest : memref<...>)
Returns a handle to the modified operation containing the transformed copies.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.convert_size1_vector_to_scalar (transform::ConvertSize1VectorToScalarOp)Convert size-1 vector operations to scalar operations following LLVM’s approach
Syntax:
operation ::= `transform.air.convert_size1_vector_to_scalar` $target attr-dict `:` functional-type(operands, results)
This transform converts operations on size-1 vectors (e.g., vector<1xf32>, vector<1x1xi32>) to use scalar operations instead. Following LLVM’s VectorDropLeadUnitDim pattern, it uses rewrite patterns (NOT a TypeConverter) and inserts extract/broadcast operations at boundaries.
Key transformations:
Important: Function signatures and block argument types remain as vector<1xT>. Only the operations inside are converted to scalars, with extract/broadcast at boundaries. Canonicalization then folds redundant extract(broadcast(x)) → x pairs.
Example with memory operations:
// Before:
%v = vector.transfer_read %mem[%i, %j], %pad : memref<8x8xf32>, vector<1xf32>
%result = arith.mulf %v, %v : vector<1xf32>
vector.transfer_write %result, %mem[%i, %j] : vector<1xf32>, memref<8x8xf32>
// After:
%scalar = memref.load %mem[%i, %j] : memref<8x8xf32>
%result = arith.mulf %scalar, %scalar : f32
memref.store %result, %mem[%i, %j] : memref<8x8xf32>
Example with function arguments (extract/broadcast at boundaries):
// Before:
func.func @foo(%a: vector<1xf32>, %b: vector<1xf32>) -> vector<1xf32> {
%add = arith.addf %a, %b : vector<1xf32>
%mul = arith.mulf %add, %a : vector<1xf32>
return %mul : vector<1xf32>
}
// After:
func.func @foo(%a: vector<1xf32>, %b: vector<1xf32>) -> vector<1xf32> {
%0 = vector.extract %a[0] : f32 from vector<1xf32>
%1 = vector.extract %b[0] : f32 from vector<1xf32>
%2 = arith.addf %0, %1 : f32
%3 = vector.extract %a[0] : f32 from vector<1xf32>
%4 = arith.mulf %2, %3 : f32
%5 = vector.broadcast %4 : f32 to vector<1xf32>
return %5 : vector<1xf32>
}
Example with loops (boundaries preserved):
// Before:
%result = scf.for %i = %c0 to %c4 step %c1 iter_args(%arg = %init) -> (vector<1xf32>) {
%add = arith.addf %arg, %cst : vector<1xf32>
scf.yield %add : vector<1xf32>
}
// After:
%result = scf.for %i = %c0 to %c4 step %c1 iter_args(%arg = %init) -> (vector<1xf32>) {
%0 = vector.extract %arg[0] : f32 from vector<1xf32>
%1 = arith.addf %0, %cst_scalar : f32
%2 = vector.broadcast %1 : f32 to vector<1xf32>
scf.yield %2 : vector<1xf32>
}
Result: Clean scalar operations on memref.load/store with extract/broadcast only at region boundaries. Canonicalization folds adjacent extract/broadcast pairs automatically.
Usage:
%func = transform.structured.match ops{["func.func"]} in %arg0
: (!transform.any_op) -> !transform.any_op
transform.air.convert_size1_vector_to_scalar %func
Returns a handle to the transformed operation.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.copy_to_dma (transform::CopyToDmaOp)Syntax:
operation ::= `transform.air.copy_to_dma` $target attr-dict `:` functional-type(operands, results)
Transform a memref.copy operation into a air.dma_memcpy_nd operation.
Returns the new air.dma_memcpy_nd operation.
Traits: FunctionalStyleTransformOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.eliminate_cascade_memcpy (transform::EliminateCascadeMemcpyOp)Eliminate intermediate memref buffers in cascaded DMA operations
Syntax:
operation ::= `transform.air.eliminate_cascade_memcpy` $target attr-dict `:` functional-type(operands, results)
This transform identifies and eliminates intermediate memref buffers in cascaded air.dma_memcpy_nd operations. It looks for the pattern where an intermediate buffer is used exactly twice: once as the destination of a DMA operation and once as the source of another DMA operation, with both operations using default access patterns (empty offsets, sizes, and strides).
The transformation replaces:
air.dma_memcpy_nd (%intermediate[] [] [], %source[] [] []) : (memref<...>, memref<...>)
air.dma_memcpy_nd (%dest[] [] [], %intermediate[] [] []) : (memref<...>, memref<...>)
With:
air.dma_memcpy_nd (%dest[] [] [], %source[] [] []) : (memref<...>, memref<...>)
This optimization eliminates unnecessary intermediate memory allocations and reduces memory traffic, which is particularly beneficial for cascade patterns in AIR programs.
Returns a handle to the modified operation.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.eliminate_redundant_vector_transfers (transform::EliminateRedundantVectorTransfersOp)Eliminate redundant vector.transfer_read operations
Syntax:
operation ::= `transform.air.eliminate_redundant_vector_transfers` $target attr-dict `:` functional-type(operands, results)
This transform identifies and eliminates redundant vector.transfer_read operations within the target operation. Two vector.transfer_read operations are considered redundant when:
The transformation walks through all vector.transfer_read operations in the target, compares each pair, and when a redundant read is found, replaces all uses of the second read with the result of the first read, then erases the redundant operation.
This optimization is particularly useful after loop unrolling or other transformations that may duplicate read operations unnecessarily, reducing memory traffic and register pressure.
Example:
// Before:
%0 = vector.transfer_read %memref[%i, %j], %pad : memref<8x8xi32>, vector<4xi32>
%1 = vector.add %0, %cst : vector<4xi32>
%2 = vector.transfer_read %memref[%i, %j], %pad : memref<8x8xi32>, vector<4xi32> // Redundant!
%3 = vector.mul %2, %other : vector<4xi32>
// After:
%0 = vector.transfer_read %memref[%i, %j], %pad : memref<8x8xi32>, vector<4xi32>
%1 = vector.add %0, %cst : vector<4xi32>
%3 = vector.mul %0, %other : vector<4xi32> // Uses %0 instead of redundant %2
Returns a handle to the transformed operation.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.flatten_for_iter_args (transform::FlattenForIterArgsOp)Flatten vector-typed iter_args of an scf.for loop using vector.shape_cast
Syntax:
operation ::= `transform.air.flatten_for_iter_args` $target attr-dict `:` functional-type(operands, results)
This transform takes a handle to an scf.for loop and flattens all vector-typed iter_args by inserting vector.shape_cast operations. The transformation:
This is useful for ensuring that loop-carried dependencies use flattened vector types, which can be required by certain backend lowerings or optimization passes.
Example:
// Before:
%result:4 = scf.for %i = %c0 to %c4 step %c1
iter_args(%arg0 = %v0, %arg1 = %v1, %arg2 = %v2, %arg3 = %v3)
-> (vector<1x1x8x8xi16>, vector<1x1x8x8xi16>, vector<1x1x8x8xi16>, vector<1x1x8x8xi16>) {
// ... computation ...
scf.yield %r0, %r1, %r2, %r3 : vector<1x1x8x8xi16>, vector<1x1x8x8xi16>, vector<1x1x8x8xi16>, vector<1x1x8x8xi16>
}
// After:
%v0_flat = vector.shape_cast %v0 : vector<1x1x8x8xi16> to vector<64xi16>
%v1_flat = vector.shape_cast %v1 : vector<1x1x8x8xi16> to vector<64xi16>
%v2_flat = vector.shape_cast %v2 : vector<1x1x8x8xi16> to vector<64xi16>
%v3_flat = vector.shape_cast %v3 : vector<1x1x8x8xi16> to vector<64xi16>
%result:4 = scf.for %i = %c0 to %c4 step %c1
iter_args(%arg0 = %v0_flat, %arg1 = %v1_flat, %arg2 = %v2_flat, %arg3 = %v3_flat)
-> (vector<64xi16>, vector<64xi16>, vector<64xi16>, vector<64xi16>) {
%arg0_shaped = vector.shape_cast %arg0 : vector<64xi16> to vector<1x1x8x8xi16>
%arg1_shaped = vector.shape_cast %arg1 : vector<64xi16> to vector<1x1x8x8xi16>
%arg2_shaped = vector.shape_cast %arg2 : vector<64xi16> to vector<1x1x8x8xi16>
%arg3_shaped = vector.shape_cast %arg3 : vector<64xi16> to vector<1x1x8x8xi16>
// ... computation using %arg0_shaped, %arg1_shaped, etc. ...
%r0_flat = vector.shape_cast %r0 : vector<1x1x8x8xi16> to vector<64xi16>
%r1_flat = vector.shape_cast %r1 : vector<1x1x8x8xi16> to vector<64xi16>
%r2_flat = vector.shape_cast %r2 : vector<1x1x8x8xi16> to vector<64xi16>
%r3_flat = vector.shape_cast %r3 : vector<1x1x8x8xi16> to vector<64xi16>
scf.yield %r0_flat, %r1_flat, %r2_flat, %r3_flat : vector<64xi16>, vector<64xi16>, vector<64xi16>, vector<64xi16>
}
Returns a handle to the transformed loop.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.fold_unit_extent_dims (transform::FoldUnitExtentDimsOp)Fold unit-extent dimensions with bounded greedy iterations
Syntax:
operation ::= `transform.air.fold_unit_extent_dims` $target attr-dict `:` functional-type(operands, results)
Applies linalg fold_unit_extent_dims_via_reshapes patterns to the target function with a bounded number of greedy rewrite iterations. This is needed because LLVM 23’s populateFoldUnitExtentDimsPatterns doesn’t converge in the greedy driver on IR with air.herd ops containing unit-extent memref dimensions. This op runs the patterns with a limited iteration count and ignores non-convergence, preserving the partial results.
Also applies canonicalization, tiling canonicalization, and scf.for loop canonicalization patterns alongside fold_unit_extent to ensure proper pattern interactions.
Returns a handle to the modified function.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.forall_with_reduce_to_parallel (transform::ForallWithReduceToParallelOp)Converts a pattern of scf.forall and linalg.reduce to scf.parallel
Syntax:
operation ::= `transform.air.forall_with_reduce_to_parallel` $target attr-dict `:` functional-type(operands, results)
.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
transformed |
variadic of TransformHandleTypeInterface instance |
transform.air.fuse_elementwise_linalg (transform::FuseElementwiseLinalgOp)Apply linalg elementwise fusion patterns to a func.func operation
Syntax:
operation ::= `transform.air.fuse_elementwise_linalg` $target attr-dict `:` functional-type(operands, results)
This transform walks the body of a func.func operation and applies the linalg elementwise fusion patterns, which include:
This is the transform dialect equivalent of running the -linalg-fuse-elementwise-ops
pass on the target function.
Example:
// Before fusion:
%0 = linalg.generic {
indexing_maps = [#map0, #map1],
iterator_types = ["parallel"]
} ins(%input : tensor<16xf32>) outs(%temp1 : tensor<16xf32>) {
^bb0(%arg0: f32, %arg1: f32):
%1 = arith.mulf %arg0, %cst : f32
linalg.yield %1 : f32
} -> tensor<16xf32>
%result = linalg.generic {
indexing_maps = [#map0, #map2, #map1],
iterator_types = ["parallel"]
} ins(%0, %input2 : tensor<16xf32>, tensor<16xf32>) outs(%temp2 : tensor<16xf32>) {
^bb0(%arg0: f32, %arg1: f32, %arg2: f32):
%2 = arith.addf %arg0, %arg1 : f32
linalg.yield %2 : f32
} -> tensor<16xf32>
// After fusion:
%result = linalg.generic {
indexing_maps = [#map0, #map2, #map1],
iterator_types = ["parallel"]
} ins(%input, %input2 : tensor<16xf32>, tensor<16xf32>) outs(%temp2 : tensor<16xf32>) {
^bb0(%arg0: f32, %arg1: f32, %arg2: f32):
%1 = arith.mulf %arg0, %cst : f32
%2 = arith.addf %1, %arg1 : f32
linalg.yield %2 : f32
} -> tensor<16xf32>
Returns a handle to the modified function.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.fuse_into_containing_op (transform::FuseIntoContainingMemrefOp)Fuse a producer into a containing operation.
Syntax:
operation ::= `transform.air.fuse_into_containing_op` $producer_op `into` $containing_op attr-dict `:` functional-type(operands, results)
Fuses the producer_op into the containing_op.
Returns a handle to the fused ops.
The producer is a subview slice of a tiled op. This transform computes the accessed producer slice inside of the containing op (“tile and fuse”).
The containing op handle must be associated with exactly one payload op. The producer op handle may be associated with multiple payload ops. This transform fuses exactly one producer.
If the producer could not be fused, this operation fails silently. This is the case when tiling fails or when the producer op has zero uses within the containing op. I.e., “producers” that are not consumed within the containing op are rejected by this operation.
This operation reads and frees the producer handle. This operation reads the containing op handle.
Interfaces: MemoryEffectOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
producer_op |
TransformHandleTypeInterface instance |
containing_op |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
fused_op |
TransformHandleTypeInterface instance |
transform.air.fuse_multi_op_linalg (transform::FuseMultiOpLinalgOp)Fuse a linalg operation containing multiple element-wise ops with its consumer
Syntax:
operation ::= `transform.air.fuse_multi_op_linalg` $first_op `,` $second_op attr-dict `:` functional-type(operands, results)
This transform fuses two linalg operations where:
The second operation may have any iterator types (parallel or reduction).
This is a generalization of FuseExtfLinalgOp that supports multiple operations. The fusion is performed by:
All operations in the first op’s body must be:
Example with reduction in second op:
// Before fusion:
%0 = linalg.generic {
indexing_maps = [#map0, #map1],
iterator_types = ["parallel", "parallel"]
} ins(%input : tensor<16x8xf16>) outs(%temp : tensor<16x8xf32>) {
^bb0(%arg0: f16):
%1 = arith.extf %arg0 : f16 to f32
%2 = arith.mulf %1, %cst : f32
linalg.yield %2 : f32
} -> tensor<16x8xf32>
%result = linalg.generic {
indexing_maps = [#map2, #map3],
iterator_types = ["parallel", "reduction"]
} ins(%0 : tensor<16x8xf32>) outs(%output : tensor<16xf32>) {
^bb0(%arg0: f32, %arg1: f32):
%3 = arith.addf %arg0, %arg1 : f32
linalg.yield %3 : f32
} -> tensor<16xf32>
// After fusion:
%result = linalg.generic {
indexing_maps = [#map4, #map3],
iterator_types = ["parallel", "reduction"]
} ins(%input : tensor<16x8xf16>) outs(%output : tensor<16xf32>) {
^bb0(%arg0: f16, %arg1: f32):
%1 = arith.extf %arg0 : f16 to f32
%2 = arith.mulf %1, %cst : f32
%3 = arith.addf %2, %arg1 : f32
linalg.yield %3 : f32
} -> tensor<16xf32>
Returns a handle to the fused operation (the second operation after modification).
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
first_op |
TransformHandleTypeInterface instance |
second_op |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
fused_op |
TransformHandleTypeInterface instance |
transform.air.fuse_truncf_linalg (transform::FuseTruncfLinalgOp)Fuse a linalg operation containing only arith.truncf into its producer
Syntax:
operation ::= `transform.air.fuse_truncf_linalg` $truncf_op `,` $producer_op attr-dict `:` functional-type(operands, results)
This transform fuses two linalg operations where:
The fusion is performed by:
This optimization folds the arithmetic truncations into the producer linalg ops, enabling the use of native intrinsics on narrower datatypes, such as AMD AIEs, and reducing intermediate memory storage requirements.
Example:
// Before fusion:
%0 = linalg.generic {
^bb0(%arg0: f32, %arg1: f32):
%1 = arith.addf %arg0, %arg1 : f32
linalg.yield %1 : f32
} ins(%input1, %input2 : tensor<16xf32>, tensor<16xf32>) outs(%temp : tensor<16xf32>)
%result = linalg.generic {
^bb0(%arg0: f32):
%2 = arith.truncf %arg0 : f32 to f16
linalg.yield %2 : f16
} ins(%0 : tensor<16xf32>) outs(%output : tensor<16xf16>)
// After fusion:
%result = linalg.generic {
^bb0(%arg0: f32, %arg1: f32):
%1 = arith.addf %arg0, %arg1 : f32
%2 = arith.truncf %1 : f32 to f16
linalg.yield %2 : f16
} ins(%input1, %input2 : tensor<16xf32>, tensor<16xf32>) outs(%output : tensor<16xf16>)
Returns a handle to the fused operation (the producer operation after modification).
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
truncf_op |
TransformHandleTypeInterface instance |
producer_op |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
fused_op |
TransformHandleTypeInterface instance |
transform.air.get_segment_for (transform::GetSegmentForOp)Gets a handle to the parent ‘air.segment’ of the given operation
Syntax:
operation ::= `transform.air.get_segment_for` $target attr-dict `:` functional-type(operands, results)
Produces a handle to the parent air.segment op for each payload IR
operation associated with the operand. Fails if a segment cannot be found.
The list of operations associated with the handle contains
parent operations in the same order as the list associated with the operand,
except for operations that are parents to more than one input which are only
present once.
Traits: NavigationTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
parent |
TransformHandleTypeInterface instance |
transform.air.hoist_cast_pair (transform::HoistCastPairOp)Hoist extension/truncation operation pairs out of a loop
Syntax:
operation ::= `transform.air.hoist_cast_pair` $extension_op `,` $truncation_op `,` $loop_op attr-dict `:` functional-type(operands, results)
This transform takes handles to an extension operation (arith.extsi, arith.extui, or arith.extf), a truncation operation (arith.trunci or arith.truncf), and their parent scf.for loop. It hoists the extension/truncation pair out of the loop by:
Supports the following extension/truncation pairs:
This optimization is beneficial when accumulator values are repeatedly extended to a wider type for computation and then truncated back to a narrow type at each iteration. By keeping the accumulator in the wide type throughout all loop iterations, we eliminate redundant extend/truncate operations.
Example (Integer):
// Before:
%init = ... : vector<64xi16>
%result = scf.for %i = %c0 to %c4 step %c1 iter_args(%arg = %init) -> (vector<64xi16>) {
%arg_shaped = vector.shape_cast %arg : vector<64xi16> to vector<1x1x8x8xi16>
%arg_ext = arith.extsi %arg_shaped : vector<1x1x8x8xi16> to vector<1x1x8x8xi32>
// ... computation using %arg_ext ...
%result_i32 = vector.contract ... : ... into vector<1x1x8x8xi32>
%result_i16 = arith.trunci %result_i32 : vector<1x1x8x8xi32> to vector<1x1x8x8xi16>
%result_flat = vector.shape_cast %result_i16 : vector<1x1x8x8xi16> to vector<64xi16>
scf.yield %result_flat : vector<64xi16>
}
// After:
%init = ... : vector<64xi16>
%init_shaped = vector.shape_cast %init : vector<64xi16> to vector<1x1x8x8xi16>
%init_ext = arith.extsi %init_shaped : vector<1x1x8x8xi16> to vector<1x1x8x8xi32>
%init_flat = vector.shape_cast %init_ext : vector<1x1x8x8xi32> to vector<64xi32>
%result_i32 = scf.for %i = %c0 to %c4 step %c1 iter_args(%arg = %init_flat) -> (vector<64xi32>) {
%arg_shaped = vector.shape_cast %arg : vector<64xi32> to vector<1x1x8x8xi32>
// ... computation using %arg_shaped directly (no extsi needed) ...
%result_i32 = vector.contract ... : ... into vector<1x1x8x8xi32>
%result_flat = vector.shape_cast %result_i32 : vector<1x1x8x8xi32> to vector<64xi32>
scf.yield %result_flat : vector<64xi32>
}
%result_shaped = vector.shape_cast %result_i32 : vector<64xi32> to vector<1x1x8x8xi32>
%result_i16 = arith.trunci %result_shaped : vector<1x1x8x8xi32> to vector<1x1x8x8xi16>
%result = vector.shape_cast %result_i16 : vector<1x1x8x8xi16> to vector<64xi16>
Example (Floating-point):
// Before:
%init = ... : vector<64xbf16>
%result = scf.for %i = %c0 to %c4 step %c1 iter_args(%arg = %init) -> (vector<64xbf16>) {
%arg_ext = arith.extf %arg : vector<64xbf16> to vector<64xf32>
// ... computation using %arg_ext ...
%result_f32 = vector.fma ... : vector<64xf32>
%result_bf16 = arith.truncf %result_f32 : vector<64xf32> to vector<64xbf16>
scf.yield %result_bf16 : vector<64xbf16>
}
// After:
%init = ... : vector<64xbf16>
%init_ext = arith.extf %init : vector<64xbf16> to vector<64xf32>
%result_f32 = scf.for %i = %c0 to %c4 step %c1 iter_args(%arg = %init_ext) -> (vector<64xf32>) {
// ... computation using %arg directly (no extf needed) ...
%result_f32 = vector.fma ... : vector<64xf32>
scf.yield %result_f32 : vector<64xf32>
}
%result = arith.truncf %result_f32 : vector<64xf32> to vector<64xbf16>
Requirements:
Returns a handle to the transformed loop.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectOpInterface, MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
extension_op |
TransformHandleTypeInterface instance |
truncation_op |
TransformHandleTypeInterface instance |
loop_op |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.hoist_loop_invariant_transfers (transform::HoistLoopInvariantTransfersOp)Discover and hoist all loop-invariant vector transfer read/write pairs
Syntax:
operation ::= `transform.air.hoist_loop_invariant_transfers` $scope_op `,` $loop_op attr-dict `:` functional-type(operands, results)
This transform takes handles to a scope operation and an scf.for loop inside it. It automatically discovers all vector.transfer_read/write pairs in the loop that:
Each discovered pair is hoisted out of the loop: the read is moved before the loop (with an iter_arg), and the write is moved after the loop. All necessary operand-producing operations (constants, affine.apply, etc.) are also hoisted to maintain SSA dominance.
Index equivalence is checked using areEquivalentIndices(), which handles direct SSA value equality, affine.apply ops with the same map and operands, and constant index equality.
This eliminates the need for fragile split_handle patterns that depend on the exact number and ordering of transfer operations, which can change with different unroll factors, tile sizes, or data types.
The op works across all matmul variants (BF16, I8, I16) and any unroll factor.
Example usage:
%herd = transform.structured.match ops{["air.herd"]} attributes{compute_herd}
in %arg0 : (!transform.any_op) -> !transform.any_op
%loop = ... // innermost scf.for loop
%updated_loop = transform.air.hoist_loop_invariant_transfers %herd, %loop
: (!transform.any_op, !transform.any_op) -> !transform.any_op
Requirements:
Returns a handle to the transformed loop.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectOpInterface, MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
scope_op |
TransformHandleTypeInterface instance |
loop_op |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.hoist_vector_transfer_pointers (transform::HoistVectorTransferPointersOp)Optimize vector transfers by hoisting pointer computations out of loops
Syntax:
operation ::= `transform.air.hoist_vector_transfer_pointers` $target attr-dict `:` functional-type(operands, results)
This transform takes a handle to an scf.for loop and optimizes vector transfer operations (vector.transfer_read and vector.transfer_write) inside the loop by:
This optimization converts expensive multi-dimensional address calculations inside loops into simple “pointer + constant” arithmetic with iter_args, which is particularly beneficial for hardware accelerators with limited address computation capabilities.
Example with IV-dependent indices:
// Before:
scf.for %i = %c0 to %c8 step %c1 {
%val = vector.transfer_read %mem[%c0, %i], %pad
: memref<32x32xi16>, vector<8x8xi16>
// ... computation ...
vector.transfer_write %result, %mem[%c0, %i]
: vector<8x8xi16>, memref<32x32xi16>
}
// After:
%flat_mem = memref.collapse_shape %mem [[0, 1]] : memref<32x32xi16> into memref<1024xi16>
%base_ptr = affine.apply affine_map<(d0, d1) -> (d0 * 32 + d1)>(%c0, %c0)
%stride = arith.constant 1 : index
scf.for %i = %c0 to %c8 step %c1 iter_args(%ptr = %base_ptr) -> (index) {
%val_1d = vector.transfer_read %flat_mem[%ptr], %pad : memref<1024xi16>, vector<64xi16>
%val = vector.shape_cast %val_1d : vector<64xi16> to vector<8x8xi16>
// ... computation ...
%result_1d = vector.shape_cast %result : vector<8x8xi16> to vector<64xi16>
vector.transfer_write %result_1d, %flat_mem[%ptr] : vector<64xi16>, memref<1024xi16>
%next_ptr = arith.addi %ptr, %stride : index
scf.yield %next_ptr : index
}
Requirements:
Returns a handle to the transformed loop.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.linalg_promote (transform::LinalgPromoteOp)Syntax:
operation ::= `transform.air.linalg_promote` $target attr-dict `:` functional-type(operands, results)
Promotes the specified operands of the target into a separate memory buffer
using the mlir::linalg::promoteSubViews utility.
This operation applies to Linalg ops that satisfy the
mlir::linalg::promoteSubviewsPrecondition, otherwise it fails.
When successful, several optimization passes are run on the resulting IR.
The return handle points to the target operation that was modified
inplace.
The operation accepts as attributes the fields in
mlir::linalg::LinalgPromotionOptions. In addition the memory space in
allocated buffers can be specified with with the memory_space attribute as
“L1”, “L2” or “L3”. The default memory space is L1.
example:
%0 = transform.structured.match ops{["linalg.matmul"]} in %code : (!transform.any_op) -> !transform.any_op
%1 = transform.air.linalg_promote %0 {memory_space="L2", operands_to_promote=[0]}
The group_size attribute is used to apply promotion to multiple
linalg ops. When group_size=N, the operands_to_promote attribute refers to
N payload operations at a time and the operand indices apply to the
operands of the N operations in the order they appear in the target handle.
For example,
%m = transform.structured.match ops{["linalg.matmul"]} in %f : (!transform.any_op) -> !transform.any_op
%f = transform.structured.match ops{["linalg.fill"]} in %f : (!transform.any_op) -> !transform.any_op
%h = transform.merge_handles %f, %m : !transform.any_op
// promote the input of the fill operation and the output of the matmul operation to L1 memory
transform.air.linalg_promote %h {"group_size"=2, "operands_to_promote"=[1,4], "memory_space"="L1"}
Interfaces: MemoryEffectOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
operands_to_promote | ::mlir::ArrayAttr | 64-bit integer array attribute |
group_size | ::mlir::IntegerAttr | 64-bit signless integer attribute |
use_full_tile_buffers | ::mlir::ArrayAttr | 1-bit boolean array attribute |
use_full_tiles_by_default | ::mlir::UnitAttr | unit attribute |
use_alloca | ::mlir::UnitAttr | unit attribute |
alignment | ::mlir::IntegerAttr | 64-bit signless integer attribute |
memory_space | ::mlir::StringAttr | string attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
transformed |
TransformHandleTypeInterface instance |
transform.air.linalg_tile (transform::LinalgTileOp)Tile a linalg operation with the given sizes. The new linalg
operantion and the generated loop are returned. Tiling is
performed with the transform::tileToForallOpImpl so that an
scf.forall loop is generated whenever possible.
This is a variant of transform.structured.tile_using_forall.
Interfaces: MemoryEffectOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
static_sizes | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
dynamic_sizes |
variadic of TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
tiled_linalg_op |
TransformHandleTypeInterface instance |
loops |
TransformHandleTypeInterface instance |
transform.air.linalg_to_library_call (transform::LinalgToLibraryCallOp)Convert a linalg op to a function call (library call)
Syntax:
operation ::= `transform.air.linalg_to_library_call` $target attr-dict `:` functional-type(operands, results)
Replaces a linalg op with a call to a function. If the function_name
attribute is provided, it is used as the function name. Otherwise, the
linalg op’s library_call attribute is used. The function is created if
it does not exist. If the link_with attribute is provided, it is used
to link the function call to a prebuilt object that contains the
implementation of the function. If the linalg op is inside a herd, the
link_with attribute is propagated to the herd.
Example:
%matmul = transform.structured.match ops{["linalg.matmul"]} in %f : (!transform.any_op) -> !transform.any_op
%call = transform.air.linalg_to_library_call %matmul { function_name = "my_matmul", link_with = "extern_func.o" } : (!transform.any_op) -> !transform.any_op
Traits: FunctionalStyleTransformOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
function_name | ::mlir::StringAttr | string attribute |
link_with | ::mlir::StringAttr | string attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.normalize_for_bounds (transform::NormalizeForBoundsOp)Normalize scf.for loop bounds by folding affine.apply on induction variable
Syntax:
operation ::= `transform.air.normalize_for_bounds` $target attr-dict `:` functional-type(operands, results)
This transform normalizes an scf.for loop by folding affine.apply operations that multiply the induction variable by a constant factor into the loop bounds.
The transformation looks for patterns where the induction variable is multiplied by a constant via affine.apply, and folds this multiplication into the loop bounds to eliminate the affine.apply operation.
For example, if a loop iterates with bounds (0, 64, step=8) and has an affine.apply that scales the induction variable by 8 (i.e., affine_map<(d0) -> (d0 * 8)>), the transformation will:
Example:
// Before:
scf.for %i = %c0 to %c64 step %c8 {
%scaled = affine.apply affine_map<(d0) -> (d0 * 8)>(%i)
// ... uses of %scaled ...
}
// After:
scf.for %i = %c0 to %c512 step %c64 {
// ... uses of %i directly (scaled bounds) ...
}
The transform supports:
(d0) -> (d0 * constant)Returns a handle to the transformed loop.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.override_memref_memory_space (transform::OverrideMemRefMemorySpaceOp)Override memref memory spaces within a target operation scope
Syntax:
operation ::= `transform.air.override_memref_memory_space` $target attr-dict `:` functional-type(operands, results)
Overrides the memory space of all memref.alloc operations within the target
operation to the specified memory_space value. The scope is inferred from
the target operation type:
air.herd -> overrides allocs inside the herd (L1)air.segment -> overrides allocs inside the segment but NOT inside herds (L2)air.launch -> overrides allocs inside the launch but NOT inside segments/herdsfunc.func -> overrides allocs inside the function but NOT inside launch/segment/herdsThis exclusive scoping allows assigning different memory spaces at different hierarchy levels by invoking the op multiple times with different targets.
After overriding alloc types, the op also:
This is the transform dialect equivalent of the air-override-memref-memory-space pass.
Example:
// Override herd allocs to L1 (memory_space 2)
%herd = transform.structured.match ops{["air.herd"]} in %arg1
: (!transform.any_op) -> !transform.any_op
transform.air.override_memref_memory_space %herd {memory_space = 2}
: (!transform.any_op) -> !transform.any_op
// Override func-level allocs to L2 (memory_space 1), excluding herds
%func = transform.structured.match ops{["func.func"]} in %arg1
: (!transform.any_op) -> !transform.any_op
transform.air.override_memref_memory_space %func {memory_space = 1}
: (!transform.any_op) -> !transform.any_op
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
memory_space | ::mlir::IntegerAttr | 32-bit signless integer attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.par_to_herd (transform::ParToHerdOp)Syntax:
operation ::= `transform.air.par_to_herd` $target attr-dict `:` functional-type(operands, results)
Transform a scf.parallel operation into a air.herd operation.
If the scf.parallel operation has more than two dimensions, then only
the last two are used and a new scf.parallel is created outside of the
herd. Returns the new air.herd operation.
Traits: FunctionalStyleTransformOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
first_dim | ::mlir::IntegerAttr | 64-bit signless integer attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.par_to_launch (transform::ParToLaunchOp)Syntax:
operation ::= `transform.air.par_to_launch` $target attr-dict `:` functional-type(operands, results)
Transform a scf.parallel operation into a air.launch operation.
Returns the new air.launch operation.
Traits: FunctionalStyleTransformOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
has_air_segment | ::mlir::BoolAttr | bool attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.par_to_segment (transform::ParToSegmentOp)Syntax:
operation ::= `transform.air.par_to_segment` $target attr-dict `:` functional-type(operands, results)
Transform a scf.parallel operation into a air.segment operation.
Returns the new air.segment operation.
Traits: FunctionalStyleTransformOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
has_air_segment | ::mlir::BoolAttr | bool attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.pipeline_reduce (transform::PipelineReduceOp)Syntax:
operation ::= `transform.air.pipeline_reduce` $target attr-dict `:` functional-type(operands, results)
Experimental
Traits: FunctionalStyleTransformOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
tile_size | ::mlir::ArrayAttr | 64-bit integer array attribute |
pipeline_depth | ::mlir::IntegerAttr | 64-bit signless integer attribute |
direction | ::mlir::StringAttr | string attribute |
promote | ::mlir::UnitAttr | unit attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.remove_uninitialized_copy (transform::RemoveUninitializedCopyOp)Remove copy operations that copy from uninitialized memrefs
Syntax:
operation ::= `transform.air.remove_uninitialized_copy` $target attr-dict `:` functional-type(operands, results)
This transform walks through a func.func operation and identifies memref.copy and linalg.copy operations where the source is an uninitialized memref (allocated but not written to). Such copy operations are erased as they copy undefined data.
The transform detects the pattern where:
Returns a handle to the modified function.
Examples:
// memref.copy case
%alloc = memref.alloc() : memref<2x16x8xi32, 1>
%subview = memref.subview %alloc[0, 0, 0] [1, 16, 8] [1, 1, 1] : ...
%target = memref.alloc() : memref<1x16x8xi32, 2>
memref.copy %subview, %target // <- This copy will be erased
// linalg.copy case
%alloc2 = memref.alloc() : memref<16x8xi32, 1>
%target2 = memref.alloc() : memref<16x8xi32, 2>
linalg.copy ins(%alloc2 : memref<16x8xi32, 1>) outs(%target2 : memref<16x8xi32, 2>) // <- This copy will be erased
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.segment_to_aie (transform::SegmentToAIEOp)Syntax:
operation ::= `transform.air.segment_to_aie` $target attr-dict `:` functional-type(operands, results)
Lower air.segment operations to mlir-aie modules.
Traits: FunctionalStyleTransformOpTrait, TransformEachOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
transformed |
TransformHandleTypeInterface instance |
transform.air.transpose_reduce (transform::TransposeReduceOp)Transpose inputs of linalg.reduce ops to make reduction dimensions innermost
Syntax:
operation ::= `transform.air.transpose_reduce` $target attr-dict `:` functional-type(operands, results)
This transform takes a handle to linalg.reduce operations and checks if the reduction dimensions are at the innermost (last/lowest) dimensions. If any reduction dimension has non-reduction dimensions to the right, it transposes the corresponding inputs to ensure all reduction dimensions are innermost.
For example, if a linalg.reduce operation reduces along dimension 1 in a 3D tensor (shape [M, N, K] reducing along N), this transform will transpose the input to [M, K, N] so that the reduction dimension N becomes innermost.
This optimization is beneficial for hardware accelerators that perform more efficient reductions when the reduction dimensions are contiguous and innermost.
The transformation:
Returns a handle to the transformed linalg.reduce operations.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |
transform.air.vector_type_cast (transform::VectorTypeCastOp)Cast vector operands and results of vector operations to a user-provided datatype
Syntax:
operation ::= `transform.air.vector_type_cast` $target attr-dict `:` functional-type(operands, results)
This transform takes a handle to vector dialect operations and casts input operands and/or results of vector type to a user-provided datatype. By default, if none of input_indices or output_indices are specified, all vector operands and results are cast.
The transformation works by:
This optimization is useful for hardware accelerators that can perform vector operations natively on specific data types (e.g., bf16, f16) while maintaining compatibility with the original precision through selective casting.
Special handling for single-element vectors: Operations where ALL vector operands and results have exactly one element are skipped to avoid unnecessary type conversions that provide no performance benefit. Note that operations with mixed vector sizes (e.g., vector.multi_reduction with a multi-element input and single-element accumulator) are still transformed, as they contain at least one multi-element vector.
Example 1 - Cast all inputs and outputs (default behavior):
// Before:
%result = vector.fma %a, %b, %c : vector<8xf32>
// After (with target_element_type = f16):
%a_cast = arith.truncf %a : vector<8xf32> to vector<8xf16>
%b_cast = arith.truncf %b : vector<8xf32> to vector<8xf16>
%c_cast = arith.truncf %c : vector<8xf32> to vector<8xf16>
%result_f16 = vector.fma %a_cast, %b_cast, %c_cast : vector<8xf16>
%result = arith.extf %result_f16 : vector<8xf16> to vector<8xf32>
Example 2 - Cast only specific inputs:
// Before:
%result = vector.fma %a, %b, %c : vector<8xf32>
// After (with target_element_type = f16, input_indices = [0, 1]):
%a_cast = arith.truncf %a : vector<8xf32> to vector<8xf16>
%b_cast = arith.truncf %b : vector<8xf32> to vector<8xf16>
%result_f16 = vector.fma %a_cast, %b_cast, %c : vector<8xf16, f32, f32>
%result = arith.extf %result_f16 : vector<8xf16> to vector<8xf32>
Example 3 - Cast only outputs:
// Transform only the output
transform.air.vector_type_cast %op {
target_element_type = f16,
output_indices = [0]
}
Attributes:
Returns a handle to the modified operations containing the transformed vector operations.
Traits: FunctionalStyleTransformOpTrait
Interfaces: MemoryEffectsOpInterface, TransformOpInterface
| Attribute | MLIR Type | Description |
|---|---|---|
target_element_type | ::mlir::TypeAttr | any type attribute |
input_indices | ::mlir::ArrayAttr | 64-bit integer array attribute |
output_indices | ::mlir::ArrayAttr | 64-bit integer array attribute |
| Operand | Description |
|---|---|
target |
TransformHandleTypeInterface instance |
| Result | Description |
|---|---|
result |
TransformHandleTypeInterface instance |