Read more
We’re excited to introduce N-gram Speculative Decoding, a new feature in Dedicated Endpoints that speeds up LLM responses for structured and predictable tasks — such as code generation, legal drafting, or templated writing.
This is the first in a series of speculative decoding techniques coming to Dedicated Endpoints. N-gram speculative decoding accelerates inference by leveraging common N-gram patterns — with no changes required to your model or pipelines. It’s now available as a free-to-try feature for all plans.
What Is N-gram Speculative Decoding?
N-gram speculative decoding uses known token patterns (or N-grams) to look ahead in the generation process and predict likely next tokens. This approach enables the system to generate multiple tokens in parallel — greatly improving latency for deterministic or structured outputs.
Unlike draft-model speculative decoding, which relies on a separate lightweight model to propose future tokens, N-gram speculative decoding leverages the model’s own output patterns, making it faster to initialize and simpler to deploy.
With N-gram speculative decoding, you get:
- Faster outputs: Reduces Time-per-Output-Token (TPOT)
- No need to train draft models: It works without draft models, simplifying the setup
- Seamless integration: Just toggle it on during endpoint creation
- Optimized performance: Especially powerful when combined with Friendli Inference
This makes N-gram speculative decoding ideal for applications like:
- Code generation
- Formatted emails or reports
- Legal contracts
- Structured JSON generation
- Templated summaries
Getting Started
To enable N-gram speculative decoding:
- Create a new Dedicated Endpoint
- Toggle on “N-gram speculative decoding” under “Endpoint features”
- (Optional) Set minimum and maximum N-gram size for customization
- Deploy — no model or code changes needed

Figure 1: Create an endpoint with N-gram speculative decoding enabled.
To see whether N-gram speculative decoding is enabled or not for an endpoint, you can check the overview page.

Figure 2: An endpoint overview with N-gram speculative decoding enabled.
When enabled, N-gram speculative decoding automatically detects frequently occurring token sequences and uses those patterns to pre-generate likely continuations. The model then verifies these predictions in parallel, skipping unnecessary steps and speeding up generation.
If the predicted N-grams are correct, they're committed instantly. If not, the model falls back to standard decoding. This yields faster inference with minimal overhead and no accuracy tradeoff.
N-gram speculative decoding further accelerates generation with lookahead techniques, integrating seamlessly with our other advanced technologies powered by Friendli Inference.
To learn more about N-gram speculative decoding, check out our documentation!