Read more

We’re excited to introduce N-gram Speculative Decoding, a new feature in Dedicated Endpoints that speeds up LLM responses for structured and predictable tasks — such as code generation, legal drafting, or templated writing.

This is the first in a series of speculative decoding techniques coming to Dedicated Endpoints. N-gram speculative decoding accelerates inference by leveraging common N-gram patterns — with no changes required to your model or pipelines. It’s now available as a free-to-try feature for all plans.

What Is N-gram Speculative Decoding?

N-gram speculative decoding uses known token patterns (or N-grams) to look ahead in the generation process and predict likely next tokens. This approach enables the system to generate multiple tokens in parallel — greatly improving latency for deterministic or structured outputs.

Unlike draft-model speculative decoding, which relies on a separate lightweight model to propose future tokens, N-gram speculative decoding leverages the model’s own output patterns, making it faster to initialize and simpler to deploy.

With N-gram speculative decoding, you get:

  • Faster outputs: Reduces Time-per-Output-Token (TPOT)
  • No need to train draft models: It works without draft models, simplifying the setup
  • Seamless integration: Just toggle it on during endpoint creation
  • Optimized performance: Especially powerful when combined with Friendli Inference

This makes N-gram speculative decoding ideal for applications like:

  • Code generation
  • Formatted emails or reports
  • Legal contracts
  • Structured JSON generation
  • Templated summaries

Getting Started

To enable N-gram speculative decoding:

  1. Create a new Dedicated Endpoint
  2. Toggle on “N-gram speculative decoding” under “Endpoint features
  3. (Optional) Set minimum and maximum N-gram size for customization
  4. Deploy — no model or code changes needed

Create an endpoint with N-gram speculative decoding enabled.

Figure 1: Create an endpoint with N-gram speculative decoding enabled.

To see whether N-gram speculative decoding is enabled or not for an endpoint, you can check the overview page.

An endpoint overview with N-gram speculative decoding enabled.

Figure 2: An endpoint overview with N-gram speculative decoding enabled.

When enabled, N-gram speculative decoding automatically detects frequently occurring token sequences and uses those patterns to pre-generate likely continuations. The model then verifies these predictions in parallel, skipping unnecessary steps and speeding up generation.

If the predicted N-grams are correct, they're committed instantly. If not, the model falls back to standard decoding. This yields faster inference with minimal overhead and no accuracy tradeoff.

N-gram speculative decoding further accelerates generation with lookahead techniques, integrating seamlessly with our other advanced technologies powered by Friendli Inference.

To learn more about N-gram speculative decoding, check out our documentation!