Robuta

https://pytorch.org/blog/torchspec-speculative-decoding-training-at-scale/ TorchSpec: Speculative Decoding Training at Scale – PyTorch speculative decodingtrainingscalepytorch https://arxiv.org/abs/2308.04623 [2308.04623] Accelerating LLM Inference with Staged Speculative Decoding Abstract page for arXiv paper 2308.04623: Accelerating LLM Inference with Staged Speculative Decoding staged speculative decodingllm inferenceaccelerating https://openreview.net/forum?id=vo9t20wsmd Faster Cascades via Speculative Decoding | OpenReview Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches interleave two models, but via... speculative decodingfastercascadesviaopenreview https://docs.sglang.io/docs/advanced_features/speculative_decoding Speculative Decoding - SGLang Documentation speculative decodingsglangdocumentation https://huggingface.co/papers/2601.23180 Paper page - TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification Join the discussion on this paper page speculative decodingpaperternaryvialightweight https://www.together.ai/blog/customized-speculative-decoding Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding speculative decodingboostingdeepseekspeedcustomized https://redis.io/blog/speculative-decoding-llm/ Speculative decoding: how it works & when to use it Apr 23, 2026 - Learn how speculative decoding speeds up LLM responses, when batch size works against it, and how it pairs with semantic caching in a layered inference stack. how it workswhen to usespeculative decoding https://www.d-matrix.ai/how-speculative-decoding-supercharged-ai-inference-in-disaggregated-pipelines/ How speculative decoding supercharged AI inference in disaggregated pipelines - d-Matrix May 12, 2026 - AI inference pipelines using multiple different kinds of accelerators are providing a more snappy, low-latency experience. speculative decodingai inferencesuperchargedpipelinesmatrix https://virtual.aistats.org/virtual/2026/poster/13343 AISTATS Poster DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification speculative decodingaistatsposterrelaxedvia https://www.together.ai/blog/distribution-aware-speculative-decoding Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality. up tospeculative decodingacceleraterlrollouts https://www.together.ai/blog/sequoia Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding speculative decodingsequoiascalablerobusthardware https://www.together.ai/blog/speculative-decoding-for-high-throughput-long-context-inference Speculative decoding for high-throughput long-context inference speculative decodinghigh throughputlong contextinference https://sleepingrobots.com/dreams/speculative-decoding-gemma4-strix-halo/ Speculative Decoding on Strix Halo: 2x Faster Gemma 4 31B Token Generation | Sleeping Robots Apr 12, 2026 - Benchmarking speculative decoding with Gemma 4 E2B as a draft model for Gemma 4 31B on AMD Strix Halo, a bandwidth-bound setup where the optimal draft-max... speculative decodingstrix halo https://arxiv.org/abs/2602.06036 [2602.06036] DFlash: Block Diffusion for Flash Speculative Decoding Abstract page for arXiv paper 2602.06036: DFlash: Block Diffusion for Flash Speculative Decoding speculative decodingdflashblockdiffusion https://rocm.blogs.amd.com/artificial-intelligence/fly/README.html FLy: A New Paradigm for Speculative Decoding — Accepting Semantically Correct Drafts Beyond Exact... This blog explores a new training-free loosely speculative decoding method, that can accept mismatches that are semantically valid and speedup original SPD... a new paradigmspeculative decodingflyaccepting https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/ An Introduction to Speculative Decoding for Reducing Latency in AI Inference | NVIDIA Technical Blog Oct 8, 2025 - Generating text with large language models (LLMs) often involves running into a fundamental bottleneck. GPUs offer massive compute, yet much of that power sits… an introduction tonvidia technical blogspeculative decodingai inference