LLM Quantization
The article discusses quantization techniques in the Transformers library, which reduce memory and computational demands by using lower-precision data types like int8. It highlights support for quantization algorithms such as AWQ, GPTQ, and integration with bitsandbytes for 8-bit and 4-bit quantization. Users can also implement custom quantization methods using the HfQuantizer class and configuration options like QuantoConfig and AqlmConfig.
- ▪Quantization reduces memory and computational costs by using lower-precision data types such as 8-bit integers.
- ▪Transformers supports AWQ and GPTQ quantization algorithms and enables 8-bit and 4-bit quantization via bitsandbytes.
- ▪The HfQuantizer class allows integration of quantization techniques not natively supported in Transformers.
- ▪QuantoConfig and AqlmConfig provide customizable options for quantizing model weights and activations with specific data types and excluded modules.
Opening excerpt (first ~120 words) tap to expand
Transformers documentation Quantization Transformers 🏡 View all docsAWS Trainium & InferentiaAccelerateArgillaAutoTrainBitsandbytesCLIChat UIDataset viewerDatasetsDeploying on AWSDiffusersDistilabelEvaluateGoogle CloudGoogle TPUsGradioHubHub Python LibraryHuggingface.jsInference Endpoints (dedicated)Inference ProvidersKernelsLeRobotLeaderboardsLightevalMicrosoft AzureOptimumPEFTReachy MiniSafetensorsSentence TransformersTRLTasksText Embeddings InferenceText Generation InferenceTokenizersTrackioTransformersTransformers.jsXetsmolagentstimm Search documentation…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Huggingface.