In-place transform. Internally uses separate input/output ObjectFifos,
but fills and drains to same tensor.
Args:
func: Function to apply, either a lambda/callable or ExternalFunction.
For ExternalFunction, arg_types should be [input_tile, output_tile, *params]
tensor: The tensor to apply in-place transformation
*params: Additional parameters for ExternalFunction only.
Scalar dtypes (np.int32, etc.) are passed as MLIR constants;
array types are transferred via ObjectFifos.
tile_size: Size of each tile processed by a worker (default: 16)
Example:
# kernel has separate in/out tile buffers, but only pass one tensor in
scale = ExternalFunction("scale", arg_types=[tile_ty, tile_ty, scalar_ty, np.int32], ...)
for_each(scale, tensor, factor, tile_size)