==================== GGUF Export ==================== GGML and GGUF have established as popular libraries and format to quantize and export LLM, with several libraries being able to read GGUF and apply optimized inference. In its current status, GGUF provides lots of flexibility in terms of representation, with several quantization options, but it also has some limitations, such as: * No graph representation during export * Mostly focused on weight-only quantization * Limited optimization possibilities during QAT and/or PTQ In Brevitas, we are progressively adding better support for GGUF export. Currently the supported formats are: * Q8_0 * Q4_0 * Q4_1 * Q4_K The first three modes can be obtained through our LLM entrypoint (`brevitas_ptq_llm`), while it is possible to target all the formats through direct quantization. The specification for these formats can be found `here `_. LLM Entrypoint ============== Brevitas' LLM entrypoint allows the user to load, quantize, test, and export many of the LLM available on HuggingFace, by simply passing a series of command line arguments that can control, among other things: * Weights and activations bit width * Weights and activation quantization format (int vs float, asym vs sym, etc.) * PTQ algorithms to apply and their options * and much more... We have recently added the possibility to export directly to GGUF after quantization, targetting Q4_0 and Q4_1. In terms of command line arguments, this corresponds to the following configurations: .. code-block:: bash brevitas_ptq_llm --model org/model --weight-bit-width 4 --weight-quant-type sym --weight-quant-granularity per_group --weight-group-size 32 --export-target gguf:q4_0 .. code-block:: bash brevitas_ptq_llm --model org/model --weight-bit-width 4 --weight-quant-type sym --weight-quant-granularity per_group --weight-quant-type asym --weight-group-size 32 --export-target gguf:q4_1 If activation bit-width is not specifies, weight-only quantization will be performed. These commands will produce quantized models without any extra pre-processing, but several PTQ algorithms are compatible with weight-only quantization. Among these: * GPTQ * AWQ * Learned Round * QuaRot/SpinQuant (without any online hadamard rotation) * MagR Once the model is exported, it can be used as any other GGUF model. Direct export and quantized scales/zero_points ============================================== Although our LLM entrypoint is highly customizable, it still does not expose all the flexibility that Brevitas offers. In particular, Brevitas allows easily to quantize scale and zero points, matching the new GGUF formats like Q4_K. This can be done by defining three custom quantizers. The first two are quantizers that specify how the scales and the zero point must be quantized (i.e., for the superblocks in the _K GGUF formats). The third one is the base quantizer combining everything together. This looks like: .. code-block:: python class ShapeMixin(ExtendedInjector): @value def scaling_shape( scaling_per_output_type, scaling_per_output_channel_shape, expanded_groupwise_shape, group_dim): if scaling_per_output_type == ScalingPerOutputType.TENSOR: scaling = SCALAR_SHAPE elif scaling_per_output_type == ScalingPerOutputType.CHANNEL: scaling = scaling_per_output_channel_shape elif scaling_per_output_type == ScalingPerOutputType.GROUP: # Scaling shape is like expanded_groupwise_shape but has 1 in position group_dim + 1 assert expanded_groupwise_shape is not None, "Per Group scaling not correctly configured" assert group_dim is not None, "Per Group scaling not correctly configured" size = list(expanded_groupwise_shape) size[group_dim + 1] = 1 return tuple(size) return scaling class QuantScalingInt(Int8WeightPerTensorFloat, ShapeMixin): bit_width = 6 module = (this << 1).module rescaling_int_quant = RescalingIntQuant group_size = 8 scaling_per_output_type = ScalingPerOutputType.GROUP upstream_shape = (this << 1).scaling_shape signed = False @value def tracked_parameter_list(upstream_shape): return [torch.empty(upstream_shape)] class QuantZPInt(Int8WeightPerTensorFloat, ShapeMixin): module = (this << 1).module rescaling_int_quant = RescalingIntQuant restrict_threshold_impl = FloatRestrictValue bit_width = 6 scaling_per_output_type = ScalingPerOutputType.GROUP group_size = 8 upstream_shape = (this << 1).zero_point_shape signed = False @value def tracked_parameter_list(upstream_shape): return [torch.empty(upstream_shape)] class QuantScaleQuantZPInt8WeightPerTensorFloat(ShiftedUint8WeightPerTensorFloat): proxy_class = GroupwiseWeightQuantProxyFromInjector scaling_quant = QuantScalingInt zp_quant = QuantZPInt restrict_scaling_impl = QuantRestrictValue scaling_per_output_type = ScalingPerOutputType.GROUP restrict_threshold_impl = FloatRestrictValue scale_shift_zero_point_impl = _ScaleShiftQuantZeroPoint group_size = 32 bit_width = 4 @value def restrict_value_float_to_int_impl(): return this.scaling_quant.rescaling_int_quant @value def zp_int_quant(): return this.zp_quant.rescaling_int_quant @value def scale_dequantized_shape(scaling_per_output_type, scaling_shape): if scaling_per_output_type == ScalingPerOutputType.TENSOR or scaling_per_output_type == ScalingPerOutputType.CHANNEL: return None elif scaling_per_output_type == ScalingPerOutputType.GROUP: return scaling_shape @value def zero_point_dequantized_shape(scaling_per_output_type, zero_point_shape): if scaling_per_output_type == ScalingPerOutputType.TENSOR or scaling_per_output_type == ScalingPerOutputType.CHANNEL: return None elif scaling_per_output_type == ScalingPerOutputType.GROUP: return zero_point_shape The intuition behind these quantizers is as follows: `QuantScaleQuantZeroPointInt8WeightPerTensorFloat` is the baseline quantizer, with asymmetric group-wise quantization at 4 bit. This quantizer specified two classes used for scale and zero_point quantization: * restrict_scaling_impl set to QuantRestrictValue, which is responsible for the scale * scale_shift_zero_point_impl set to _ScaleShiftQuantZeroPoint, responsible for the zero_point In order to construct these two classes through dependency injection, we need to define `restrict_value_float_to_int_impl` and `zp_int_quant`, which is done through two `value` functions, a detail of the dependency injection package we use in Brevitas (for more info about this, check our `Anatomy of a quantizer tutorial`). These value functions select the object to instantiate from two other variables defined in the main quantizer, `scaling_quant` and `zp_quant`. These two variables contain the scale and zero point quantizer, respectively: * QuantScalingInt * QuantZeroPointInt For all practical purposes, these two quantizers behave exactly as any other Brevitas quantizer. The main exceptions are that they are not directly attached to any layer, but rather to another quantizer. Starting from a standard 8-bit integer quantizer, some parameters are re-defined to match the `Q4_K recipe `_ , in particular: * Group-wise quantization * Group size equals to 8 (as in, the super block is composed of 8 blocks of 32 elements) * Bit width is set to 6 * Quantization is unsigned, as we assume that both scales and zero point are defined positive The other elements are needed for a correct definition of a Brevitas quantizer through dependency injection, and can be ignored and left as they are. After they have been created, it is possible to manually create a quant layer as follow: .. code-block:: python qnn.QuantLinear(IN_CHANNEL, OUT_CHANNEL, weight_quant=QuantScaleQuantZeroPointInt8WeightPerTensorFloat) Alternatively, it is possible to programmatically quantize your network with these quantizers with: .. code-block:: python model = ... layer_map[torch.nn.Linear] = ( qnn.QuantLinear, { 'weight_quant': QuantScaleQuantZeroPointInt8WeightPerTensorFloat}) model = layerwise_quantize( model=model, compute_layer_map=layer_map) After quantization is applied, all the same considerations made above for PTQ hold true, and QAT is also a possibility. Changing the `weight_scaling_impl_type` in the scale and zero_point quantizer to `parameter_from_stats` should also allow to learn the scale factors of the scale and zero point with QAT, although this is not tested. After the model is quantized, it is possible to export it with the following: .. code-block:: python from brevitas_examples.llm.gguf_export.export import save_quantized_as_gguf save_quantized_as_gguf("/path/to/exported/model", model=model.cpu(), backend="gguf:q4_k_s", tokenizer=tokenizer) FAQ === * *How to export in GGUF format X?* If you want to quantize and export in GGUF format that is not currently supported, feel free to open an issue. In general, the indications above, combined with the export code itself, should provide a solid blueprint to add new export formats, especially similar ones like Q3_K, etc. * *How to export in Q4_K but still do PTQ?* We plan to expand the options available in our LLM entrypoint, but introducing scale and zero point quantization could limit the readability and usability of the script. If you want to do Q4_K and apply one or more of the algorithms we propose, just follow the same style used in the entrypoint and write your own quantization script, focusing on one or a few configurations that are of interest for you. * *Accuracy/Quality of the models seem to be worse compared to other GGUF model. Is it normal?* Generally speaking, different quantizers have different ways of achieving the same target format. If the quality you get is not up to your expectations, feel free to try some of the PTQ algorithms suggested above. If that still does not help, please open an issue and we will be more than happy to look into it.