https://pytorch.org/blog/torchspec-speculative-decoding-training-at-scale/
TorchSpec: Speculative Decoding Training at Scale – PyTorch
speculative decodingtrainingscalepytorch
https://arxiv.org/abs/2308.04623
[2308.04623] Accelerating LLM Inference with Staged Speculative Decoding
Abstract page for arXiv paper 2308.04623: Accelerating LLM Inference with Staged Speculative Decoding
staged speculative decodingllm inferenceaccelerating
https://openreview.net/forum?id=vo9t20wsmd
Faster Cascades via Speculative Decoding | OpenReview
Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches interleave two models, but via...
speculative decodingfastercascadesviaopenreview
https://docs.sglang.io/docs/advanced_features/speculative_decoding
Speculative Decoding - SGLang Documentation
speculative decodingsglangdocumentation
https://huggingface.co/papers/2601.23180
Paper page - TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
Join the discussion on this paper page
speculative decodingpaperternaryvialightweight
https://www.together.ai/blog/customized-speculative-decoding
Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding
speculative decodingboostingdeepseekspeedcustomized
https://redis.io/blog/speculative-decoding-llm/
Speculative decoding: how it works & when to use it
Apr 23, 2026 - Learn how speculative decoding speeds up LLM responses, when batch size works against it, and how it pairs with semantic caching in a layered inference stack.
how it workswhen to usespeculative decoding
https://www.d-matrix.ai/how-speculative-decoding-supercharged-ai-inference-in-disaggregated-pipelines/
How speculative decoding supercharged AI inference in disaggregated pipelines - d-Matrix
May 12, 2026 - AI inference pipelines using multiple different kinds of accelerators are providing a more snappy, low-latency experience.
speculative decodingai inferencesuperchargedpipelinesmatrix
https://virtual.aistats.org/virtual/2026/poster/13343
AISTATS Poster DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification
speculative decodingaistatsposterrelaxedvia
https://www.together.ai/blog/distribution-aware-speculative-decoding
Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding
Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.
up tospeculative decodingacceleraterlrollouts
https://www.together.ai/blog/sequoia
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
speculative decodingsequoiascalablerobusthardware
https://www.together.ai/blog/speculative-decoding-for-high-throughput-long-context-inference
Speculative decoding for high-throughput long-context inference
speculative decodinghigh throughputlong contextinference
https://sleepingrobots.com/dreams/speculative-decoding-gemma4-strix-halo/
Speculative Decoding on Strix Halo: 2x Faster Gemma 4 31B Token Generation | Sleeping Robots
Apr 12, 2026 - Benchmarking speculative decoding with Gemma 4 E2B as a draft model for Gemma 4 31B on AMD Strix Halo, a bandwidth-bound setup where the optimal draft-max...
speculative decodingstrix halo
https://arxiv.org/abs/2602.06036
[2602.06036] DFlash: Block Diffusion for Flash Speculative Decoding
Abstract page for arXiv paper 2602.06036: DFlash: Block Diffusion for Flash Speculative Decoding
speculative decodingdflashblockdiffusion
https://rocm.blogs.amd.com/artificial-intelligence/fly/README.html
FLy: A New Paradigm for Speculative Decoding — Accepting Semantically Correct Drafts Beyond Exact...
This blog explores a new training-free loosely speculative decoding method, that can accept mismatches that are semantically valid and speedup original SPD...
a new paradigmspeculative decodingflyaccepting
https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
An Introduction to Speculative Decoding for Reducing Latency in AI Inference | NVIDIA Technical Blog
Oct 8, 2025 - Generating text with large language models (LLMs) often involves running into a fundamental bottleneck. GPUs offer massive compute, yet much of that power sits…
an introduction tonvidia technical blogspeculative decodingai inference