By Dr Vishnupriya Raghavan, Business Head, StackRoute, NIIT Ltd
AI is everywhere. Today, large language models like ChatGPT and advanced vision systems are deployed across functions covering everything from customer interactions to real-time defect detection on manufacturing lines. However, increased sophistication and accuracy also mean greater demand for computing resources at an unprecedented pace. Running these AI systems needs powerful and specialised computers that used to be available only in top research labs. For many businesses, the costs of operating state-of-the-art AI can reach thousands of dollars per day, creating a formidable barrier to scaling AI in a sustainable and commercially viable way.
AI involves two main phases: model training and model inference (deployment). Training refers to building the model using large datasets, which typically incur much higher hardware, energy, and time costs. Inference is when the trained model is actually used in production, providing predictions or insights from new data. It’s crucial to understand that both phases have distinct cost and hardware profiles: training is mostly a one-time (but heavy) investment, while inference costs scale with usage and deployment footprint.
The real cost of AI is high, and in certain contexts, even unsustainable. Each query costs dollars – real estate, GPUs, electricity, and natural resources. As per Google’s latest data, its electricity consumption has risen 27% and carbon emissions are up 51% fueled by growth in AI. Some estimates also peg the daily running cost of ChatGPT between $100,000 and $700,000, with per-query expenses averaging 36 cents.
Why quantisation matters now
The Stanford AI Index 2025 demonstrates the necessity of these solutions for modern organisations. According to the report, AI model training costs have experienced exponential growth during the previous few years, which makes them inaccessible to numerous organisations until new efficient solutions emerge. According to McKinsey’s research on AI empowerment organisations can achieve their complete AI value only when functional teams access powerful yet practical tools and models. Quantisation functions as a key method to connect the available solutions to practical applications. The reduced cost and simplified model deployment through quantisation enable organisations to adopt AI on a larger scale in an inclusive manner.
Quantisation 101
Quantisation has established itself as a powerful solution to tackle this problem. The process of quantisation simplifies AI model calculations by decreasing their numerical precision. Most models rely on 32-bit floating point numbers for accuracy, yet they rarely need such high precision for effective performance. Quantisation achieves memory reduction and computational load reduction by converting numbers into 8-bit or 4-bit formats. Thoughtful implementation of this method reduces performance while preserving high accuracy levels.
The implemented benefits deliver measurable advantages across different industries. The implementation of a quantised vision transformer in manufacturing reduced processing delays by more than 40% which allowed deployment of advanced AI capabilities through cost-effective edge devices instead of expensive GPUs. The implementation of quantised models has made it possible to achieve fast and trustworthy diagnostic imaging on portable devices within healthcare applications. The implementation of quantisation in fraud detection and customer service operations allows real-time model responses through minimal infrastructure expenses.
Different approaches for different needs
Different quantisation approaches exist that serve the business and technical needs of various organisations. The most straightforward quantisation technique, Post-Training Quantisation, takes a completed model and converts it to a lower precision version through a process that avoids retraining. This deployment method proves valuable for initial system implementations when speed of deployment and reduced operational expenses take precedence over accuracy.
Quantisation-Aware Training involves adding precision limitations to the model’s training process. The training method produces models that maintain stable performance while being highly optimised , which makes it suitable for healthcare and automotive industries that require strict accuracy standards.
Generalised Post-Training Quantisation or GPTQ has gained special importance when working with large language models. Through GPTQ methods, organisations can reduce their models to extremely low precision levels without compromising their advanced text understanding and generation capabilities. The functionality enables large models to operate on basic hardware systems, which makes enterprise-level AI accessible for locations without access to expensive infrastructure.
Navigating the trade-offs
Quantisation provides many benefits, yet its implementation involves multiple trade-offs. The excessive reduction of numerical precision results in accuracy degradation, which affects both complicated systems and rare situations that need precise measurements. The implementation of low-precision computation faces restrictions on older hardware platforms, which prevent full utilisation of quantisation benefits in traditional systems. Successful mitigation of these challenges requires careful calibration and thorough testing, together with precise knowledge of efficient performance levels.
The tools that support quantisation have experienced rapid development in their ecosystem. SmoothQuant and bitsandbytes, along with ONNX Runtime, allow engineering teams to simplify the workflow integration of quantisation methods. The tools allow organisations to create tailored quantisation strategies that suit their application need,s whether they focus on private on-premise deployments or edge-based AI service delivery.
Is Quantisation the only path?
The future indicates that quantisation will establish itself as a fundamental component of AI scalability approaches. Quantisation serves as an established method to maximise machine learning benefits while maintaining affordable costs in environments where AI deployment speed and efficiency and scalability determine competitive advantage.
While quantisation is a leading technique, other, often more sophisticated, model optimisation methods—such as model pruning, knowledge distillation, and low-rank factorisation can be even bigger contributors to overall optimisation. These methods typically provide significant reductions in model size and computation, sometimes surpassing quantisation, but require a deeper, more technical understanding for effective use. Readers interested in a broader and more advanced perspective on AI model optimisation should explore these techniques further to appreciate the full landscape of cost- and efficiency-saving possibilities.
While quantisation is a leading technique, other model optimisation methods such as pruning, knowledge distillation, and low-rank factorisation—also play a vital role in improving AI efficiency and deployment feasibility. Together, these techniques form a toolkit for enabling scalable, production-ready AI.
Organisations that implement this approach will safeguard their AI capabilities for the future while making innovation accessible to teams across different industries. Quantisation will serve as a crucial mechanism for developing practical AI models that remain powerful for production deployment.