https://openreview.net/forum?id=RXPofAsL8F&referrer=%5Bthe%20profile%20of%20Zihao%20Ye%5D(%2Fprofile%3Fid%3D~Zihao_Ye1)
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels...
llm inferenceefficientcustomizableattentionengine
https://www.rubrik.com/blog/ai/25/llm-inference-benchmarks-predibase-fireworks-vllm
Discover how Predibase delivers up to 4x faster LLM inference vs vLLM & Fireworks using speculative decoding, chunked prefill, and managed AI infrastructure.
real worldllm inferencebenchmarksbuiltfastest
https://www.pugetsystems.com/labs/hpc/exploring-hybrid-cpu-gpu-llm-inference/
A brief look into using a hybrid GPU/VRAM + CPU/RAM approach to LLM inference with the KTransformers inference library.
cpu gpullm inferencepuget systemsexploringhybrid
https://www.redhat.com/en/blog/evaluating-llm-inference-performance-red-hat-openshift-ai
This article introduces the methodology and results of performance testing the Llama-2 models deployed on the model serving stack included with Red Hat...
red hat openshiftllm inferenceevaluatingperformanceai
https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/
Nov 7, 2023 - Large language models (LLMs) offer incredible new capabilities, expanding the frontier of what is possible with AI. However, their large size and unique…
large language modelnvidiatensorrtllmsupercharges
https://jatevo.id/
The Distributed GPU LLM Inference Network. Route your prompts across a decentralized network of GPUs for low-latency, cost-efficient LLM inference.
llm inferencedecentralizedgpu
https://nousresearch.com/introducing-the-forge-reasoning-api-beta-and-nous-chat-an-evolution-in-llm-inference/
May 29, 2025 - The Forge Reasoning API contains some of our latest advancements in inference-time AI research, building on our journey from the original Hermes model.
nous chatintroducingforgereasoningapi
https://www.amazon.science/publications/order-of-magnitude-speedups-for-llm-membership-inference
Large Language Models (LLMs) have the promise to revolutionize computing broadly, but their complexity and extensive training data also expose significant...
amazon scienceordermagnitudellmmembership
https://openreview.net/forum?id=n3rZJrWPLE&referrer=%5Bthe%20profile%20of%20Genghan%20Zhang%5D(%2Fprofile%3Fid%3D~Genghan_Zhang1)
Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context...
attention spansllm inferencemixtureoptimizingefficiency
https://www.arxiv.org/abs/2601.12904
Abstract page for arXiv paper 2601.12904: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation
prefixcachefusionragaccelerating
https://www.alibabacloud.com/help/en/arms/application-monitoring/user-guide/use-the-arms-agent-for-python-to-monitor-llm-applications
Connect LLM applications or inference services to ARMS,Application Real-Time Monitoring Service:The Python agent is an observability data collector for Python...
llm applicationsconnectinferenceservicesarms
https://predibase.com/blog/guide-how-to-serve-llms-faster-inference
Learn how to accelerate and optimize deployments for open-source models with our blueprint for fast, reliable, and cost-efficient LLM serving. Deep dive on GPU...
serving guidebuild fasterllminferenceopen
https://inferencepriceindex.com/
Daily benchmark tracking LLM inference costs across OpenAI, Anthropic, Google and more. Free API for token pricing data.
price indexinferencetrackllmapi
https://bentoml.com/llm/
A practical handbook for engineers building, optimizing, scaling and operating LLM inference systems in production.
llm inference handbook
https://resources.nvidia.com/en-us-run-ai/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer
Deploying large language models (LLMs) poses a challenge in optimizing inference efficiency. In particular, cold start delays—where models take significant...
cold startllm inferencereducinglatencynvidia