Read more | xifan.uno

We're introducing Online Quantization — a new feature in Dedicated Endpoints that lowers inference costs while maintaining accuracy without requiring any model prep.

This feature automatically quantizes your models as it loads, so there’s no need to store separate quantized versions. With online quantization, Dedicated Endpoints can run your workloads using fewer GPUs, reducing infrastructure costs and accelerating speed while maintaining reliable inference.

What Is Online Quantization?

Online quantization automatically converts model weights and activations from its original weight (such as FP16) to lower-precision formats like FP8 or FP4 during model loading. Unlike traditional quantization methods that require preprocessing, retraining, or model changes, online quantization adjusts precision on the fly, providing a seamless and efficient experience, such as:

Zero setup: No calibration data, retraining, or model changes required.
Faster inference: Improves both Time-to-First-Token (TTFT) and Time-Per-Output-Token (TPOT).
Lower GPU usage: Cut GPU needs by 2-4x, significantly reducing costs.
On-the-fly conversion: Quantization happens automatically during model load.
Preserved accuracy: Accuracy remains nearly identical to the original.

Powered by Friendli Inference, online quantization further optimizes compute efficiency alongside our cutting-edge technologies like iteration batching (a.k.a. continuous batching), multimodal caching, multi-LoRA, speculative decoding, and more.

How Online Quantization Cuts Costs

Quantization reduces precision (e.g., FP16 → FP8), cutting compute requirements and memory bandwidth during inference.

With online quantization now integrated into Dedicated Endpoints:

Models run with less GPU memory and compute power per request
Endpoint throughput increases, enabling more requests per GPU
You can serve the same workload with fewer GPUs, reducing your cloud or data center costs

Fast Inference. Lower Costs. No Retraining.

Skip the slowdowns of traditional quantization. Our online approach runs automatically at initialization—no retraining, no calibration—delivering minimal accuracy loss with maximum speed.

Accelerate Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT) while cutting GPU costs instantly. No need to manage multiple model versions or pipelines. Speed meets simplicity with our online quantization technology.

Getting Started

To enable Online Quantization, just turn it on in your Dedicated Endpoint configuration — no model changes or pipeline updates required. It supports all major model formats and GPU types, and integrates directly into your existing AI workflows.

After selecting an eligible model on the endpoint creation page, you can toggle "Online Quantization" on or off under the "Endpoint features" section.

Figure 1: Create endpoint overview.

Qwen/Qwen2.5-72B-Instruct, for instance, would require 4x NVIDIA H100 GPUs, but with Online Quantization, it can run with only 2x NVIDIA H100 GPUs, cutting the cost by half.

Figure 2: Quantization off.

Figure 3: Quantization on.

To see whether online quantization is enabled for an endpoint, simply check the endpoint overview.

Figure 4: Endpoint overview with online quantization enabled.

To learn more about how to configure your Dedicated Endpoints, please refer to our docs.