Machine Learning/Quantization ROCm
Quantization Overview
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).
Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows models to run on embedded devices, which sometimes only support integer data types. Source: Hugging Face Quantization Documentation
Goals of Quantization
In our case, we have two main goals when incorporating quantization:
- Better resource utilization
- Faster inference
Methods
We have three primary ways of performing quantization using ROCm software and AMD GPUs as explored by the ML team in [1]These methods focus on text models
- AWQ
- GPTQ
- bitsandbytes
From the above methods, only bitsandbytes performs on-the-fly quantization, while the other two methods require quantizing and calibrating the weights of the model to minimize the error compared to the original (vanilla) version. This calibration process uses a dataset, with the C4 dataset being the most widely used. The choice of samples for calibration is critical, particularly for multilingual models, as a diverse set of samples leads to better quantization results.
GPTQModel
The AutoGPTQ library has been the de facto method for applying GPTQ quantization. However, as noted in its documentation, development has stopped. Instead, GPTQModel should be used.
GPTQModel has been tested on ROCm 6.2+ versions.
Setting Up the Environment
You will need a PyTorch version built for ROCm. Install it using the following command:
pip install torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/rocm6.2/
Alternatively, you can build the GPTQModel package from source:
git clone https://github.com/ModelCloud/GPTQModel && cd GPTQModel ROCM_VERSION=6.2 python -m build -–no-isolation –wheel .
This process will also produce a Python wheel file that can be reused in similar environments.
Quantization Process
Here is an example of how to quantize a model using GPTQModel:
from datasets import load_dataset from gptqmodel import GPTQModel, QuantizeConfig model_id = “/srv/hf-cache/hub/models–CohereForAI–aya-expanse-32b/snapshots/edd25f1107cd806b3fb779c61c210d804202f5ce/” # Example model path quant_path = “aya-expanse-32b-gptqmodel-4bit” calibration_dataset = load_dataset( “allenai/c4”, data_files=“en/c4-train.00001-of-01024.json.gz”, split=“train” ).select(range(1024))[“text”] quant_config = QuantizeConfig(bits=4, group_size=128) model = GPTQModel.load(model_id_or_path=model_id, quantize_config=quant_config) Increase batch_size based on GPU/VRAM specs to speed up quantization model.quantize(calibration_dataset, batch_size=2) model.save(quant_path)
Using the Quantized Model
The quantized model can be used with Transformers as follows:
model = GPTQModel.load(quant_path) result = model.generate("Uncovering deep insights begins with")[0]
Example Output
During quantization, the process logs information about the model size and layer-wise loss. An example summary for a 32B model (e.g., aya-expanse-32b) is shown below:
INFO - Pre-Quantized model size: 61600.68MB, 60.16GB INFO - Quantized model size: 18974.52MB, 18.53GB INFO - Size difference: 42626.16MB, 41.63GB - 69.20% INFO - Effective Quantization BPW (bits per weight): 4.25 bpw, based on [bits: 4, group_size: 128]
AWQ
bitsandbytes
Bitsandbytes is a method that offers on-the-fly quantization.
Installation for ROCm
There are 2 options to use bitsandbytes with ROCm:
- Build from source
- Install the multi backend refactor release that supports ROCm
Note: Add --no-deps to avoid reinstalling dependencies pip install –force-reinstall ‘https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.45.1.dev0-py3-none-manylinux_2_24_x86_64.whl’