Machine Learning/Quantization ROCm

Quantization Overview

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows models to run on embedded devices, which sometimes only support integer data types. Source: Hugging Face Quantization Documentation

Goals of Quantization

In our case, we have two main goals when incorporating quantization:

Better resource utilization
Faster inference

Methods

We have three primary ways of performing quantization using ROCm software and AMD GPUs as explored by the ML team in ^[1]These methods focus on text models

AWQ
GPTQ
bitsandbytes

From the above methods, only bitsandbytes performs on-the-fly quantization, while the other two methods require quantizing and calibrating the weights of the model to minimize the error compared to the original (vanilla) version. This calibration process uses a dataset, with the C4 dataset being the most widely used. The choice of samples for calibration is critical, particularly for multilingual models, as a diverse set of samples leads to better quantization results.

GPTQModel

The AutoGPTQ library has been the de facto method for applying GPTQ quantization. However, as noted in its documentation, development has stopped. Instead, GPTQModel should be used.

GPTQModel has been tested on ROCm 6.2+ versions.

Setting Up the Environment

You will need a PyTorch version built for ROCm. Install it using the following command:

 
pip install torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/rocm6.2/

Alternatively, you can build the GPTQModel package from source:

 
git clone https://github.com/ModelCloud/GPTQModel && cd GPTQModel
ROCM_VERSION=6.2 python -m build -–no-isolation –wheel .

This process will also produce a Python wheel file that can be reused in similar environments.

Quantization Process

Here is an example of how to quantize a model using GPTQModel:

 
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = “/srv/hf-cache/hub/models–CohereForAI–aya-expanse-32b/snapshots/edd25f1107cd806b3fb779c61c210d804202f5ce/”  # Example model path
quant_path = “aya-expanse-32b-gptqmodel-4bit”

calibration_dataset = load_dataset(
“allenai/c4”,
data_files=“en/c4-train.00001-of-01024.json.gz”,
split=“train”
).select(range(1024))[“text”]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id_or_path=model_id, quantize_config=quant_config)

Increase batch_size based on GPU/VRAM specs to speed up quantization

model.quantize(calibration_dataset, batch_size=2)

model.save(quant_path)

Using the Quantized Model

The quantized model can be used with Transformers as follows:

 
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0]

Example Output

During quantization, the process logs information about the model size and layer-wise loss. An example summary for a 32B model (e.g., aya-expanse-32b) is shown below:

 
INFO - Pre-Quantized model size: 61600.68MB, 60.16GB
INFO - Quantized model size: 18974.52MB, 18.53GB
INFO - Size difference: 42626.16MB, 41.63GB - 69.20%
INFO - Effective Quantization BPW (bits per weight): 4.25 bpw, based on [bits: 4, group_size: 128]

AWQ

bitsandbytes

Bitsandbytes is a method that offers on-the-fly quantization.

Installation for ROCm

There are 2 options to use bitsandbytes with ROCm:

Build from source
Install the multi backend refactor release that supports ROCm

Note: Add --no-deps to avoid reinstalling dependencies
pip install –force-reinstall ‘https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.45.1.dev0-py3-none-manylinux_2_24_x86_64.whl’

References

↑ quantization: allow loading model weights as int8/int4 with HF

[1] quantization: allow loading model weights as int8/int4 with HF

[1]