Half-Quadratic Quantization of large machine learning models
Republished Mobius blog introduces Half-Quadratic Quantization (HQQ), a calibration-free weight-only quantization method that uses half-quadratic splitting and a sparsity-promoting l_p loss to model outliers and produce closed-form updates. HQQ runs in half-precision on GPU without autograd, quantizes very large models (e.g., Llama-2-70B) in minutes, and shows competitive or better perplexity/accuracy vs. GPTQ, AWQ, and bitsandbytes across LLM and ViT benchmarks; code and models are provided.