https://github.com/vllm-project/vllm
GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for...
A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm
https://vllm.ai/
vLLM
vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). Deploy AI models faster with state-of-the-art...
vllm
https://docs.vllm.ai/en/latest/features/batch_invariance/
Batch Invariance - vLLM
batch invariancevllm
https://docs.vllm.ai/en/latest/api/vllm/v1/engine/logprobs/
logprobs - vLLM
logprobsvllm
https://docs.vllm.ai/en/latest/api/vllm/renderers/mistral/
mistral - vLLM
mistralvllm
https://docs.vllm.ai/en/latest/examples/features/reset_kv/
Reset Kv - vLLM
reset kvvllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/qutlass_utils/
qutlass_utils - vLLM
utilsvllm
https://docs.vllm.ai/en/latest/api/vllm/v1/engine/core_client/
core_client - vLLM
coreclientvllm
https://docs.vllm.ai/en/latest/features/quantization/inc/
Intel Quantization Support - vLLM
intel quantization supportvllm
https://vllm.ai/releases
Previous vLLM Releases | vLLM
Browse all vLLM releases. Find installation commands for any version and access release notes.
previousvllmreleases
https://docs.vllm.ai/en/v0.9.1/serving/openai_compatible_server.html
OpenAI-Compatible Server - vLLM
openaicompatibleservervllm
https://docs.vllm.ai/en/latest/api/vllm/v1/attention/backends/linear_attn/
linear_attn - vLLM
linearattnvllm
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/openai/realtime/connection/
connection - vLLM
connectionvllm
https://docs.vllm.ai/en/v0.9.1/training/rlhf.html
Reinforcement Learning from Human Feedback - vLLM
reinforcement learninghuman feedbackvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/mamba/ops/ssd_chunk_state/
ssd_chunk_state - vLLM
ssdchunkstatevllm
https://docs.vllm.ai/en/latest/community/meetups/
Meetups - vLLM
meetupsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/glm_ocr/
glm_ocr - vLLM
glmocrvllm
https://docs.vllm.ai/en/latest/api/vllm/v1/worker/gpu/model_states/whisper/
whisper - vLLM
whispervllm
https://docs.vllm.ai/en/stable/api/vllm/entrypoints/openai/responses/
responses - vLLM
responsesvllm
https://ingero.io/11-second-time-to-first-token-healthy-vllm-server/
11-Second TTFT on Healthy vLLM: Tracing Latency Spikes
Apr 21, 2026 - We traced vLLM latency spikes causing 11-second time to first token despite healthy endpoints, using eBPF kernel tracing on GPU scheduling.
secondhealthyvllmtracinglatency
https://docs.vllm.ai/en/latest/design/p2p_nccl_connector/
P2P NCCL Connector - vLLM
ncclconnectorvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fla/ops/chunk_o/
chunk_o - vLLM
chunkvllm
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/pooling/base/protocol/
protocol - vLLM
protocolvllm
https://docs.vllm.ai/en/latest/cli/serve/?h=kv+events+config
vllm serve - vLLM
vllmserve
https://docs.vllm.ai/en/latest/api/vllm/lora/worker_manager/
worker_manager - vLLM
workermanagervllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/nemotron_nas/
nemotron_nas - vLLM
nemotronnasvllm
https://docs.vllm.ai/en/latest/api/vllm/v1/attention/backends/mla/xpu_mla_sparse/
xpu_mla_sparse - vLLM
xpumlasparsevllm
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/openai/parser/responses_parser/
responses_parser - vLLM
responsesparservllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/kernels/linear/scaled_mm/pytorch/
pytorch - vLLM
pytorchvllm
https://docs.vllm.ai/en/latest/api/vllm/v1/kv_offload/cpu/gpu_worker/
gpu_worker - vLLM
gpuworkervllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/transformers/base/
base - vLLM
basevllm
https://cellularstockpile.com/amd-rivals-nvidia-in-ai-mi300x-doubles-speed-in-vllm-and-outperforms-h100-by-30-in-tensorrt-llm/
AMD Rivals NVIDIA in AI: MI300X Doubles Speed in vLLM and Outperforms H100 by 30% in TensorRT-LLM |...
https://developersdigest.tech/tools/vllm
vLLM - Review & Setup Guide - Developers Digest
High-throughput inference server for LLMs. PagedAttention memory management. The go-to for serious local or self-hosted serving. Full review, videos, and setup...
setup guidevllmreviewdevelopersdigest
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/serve/instrumentator/
instrumentator - vLLM
instrumentatorvllm
https://docs.vllm.ai/en/stable/api/vllm/scalar_type/
scalar_type - vLLM
scalartypevllm
https://docs.vllm.ai/en/latest/contributing/dockerfile/dockerfile/
Dockerfile - vLLM
dockerfilevllm
https://docs.vllm.ai/en/latest/api/vllm/assets/base/
base - vLLM
basevllm
https://www.analyticsvidhya.com/blog/2024/06/guide-to-vllm-using-gemma-7b-it/
Guide to vLLM using Gemma-7b-it
Jun 24, 2024 - Learn how to use vLLM for efficient LLM inference with the Gemma-7b-it model, featuring KV Cache and PagedAttention.
guide tovllmusinggemma
https://docs.vllm.ai/en/latest/api/vllm/v1/attention/backend/
backend - vLLM
backendvllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe/
trtllm_fp8_moe - vLLM
moevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/kernels/linear/
linear - vLLM
linearvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/rotary_embedding/ernie45_vl_rope/
ernie45_vl_rope - vLLM
vlrope
https://docs.vllm.ai/en/latest/api/vllm/v1/worker/gpu/spec_decode/utils/
utils - vLLM
utilsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/bee/
bee - vLLM
beevllm
https://docs.vllm.ai/en/latest/api/vllm/v1/core/block_pool/
block_pool - vLLM
blockpoolvllm
https://docs.vllm.ai/en/latest/usage/troubleshooting/
Troubleshooting - vLLM
troubleshootingvllm
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/pooling/utils/
utils - vLLM
utilsvllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/mamba/ops/ssu_dispatch/
ssu_dispatch - vLLM
ssudispatchvllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w8a8_int8/
compressed_tensors_moe_w8a8_int8 - vLLM
compressedtensorsmoevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/phi4mm/
phi4mm - vLLM
vllm
https://docs.vllm.ai/en/latest/api/vllm/v1/pool/metadata/
metadata - vLLM
metadatavllm
https://docs.vllm.ai/en/latest/models/hardware_supported_models/xpu/
XPU - Intel® GPUs - vLLM
xpugpusvllm
https://docs.vllm.ai/en/latest/api/vllm/v1/kv_offload/cpu/policies/base/
base - vLLM
basevllm
https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/streaming/
streaming - vLLM
streamingvllm
https://docs.vllm.ai/en/stable/api/vllm/compilation/monitor/
monitor - vLLM
monitorvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/opt/
opt - vLLM
optvllm
https://vllm.ooos.top/api/vllm/tool_parsers/seed_oss_tool_parser/
seed_oss_tool_parser - vLLM
seedosstoolparservllm
https://docs.vllm.ai/en/latest/features/nixl_connector_usage/
NixlConnector Usage Guide - vLLM
nixlconnector usage guidevllm
https://docs.vllm.ai/en/latest/api/vllm/logprobs/
logprobs - vLLM
logprobsvllm
https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/pythonic_tool_parser/
pythonic_tool_parser - vLLM
pythonictoolparservllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/online/moe_base/
moe_base - vLLM
moebasevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/mamba/ops/cpu/causal_conv1d/
causal_conv1d - vLLM
causalvllm
https://docs.vllm.ai/en/latest/deployment/frameworks/autogen/
AutoGen - vLLM
autogenvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/router/fused_topk_router/
fused_topk_router - vLLM
fusedtopkroutervllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/llava/
llava - vLLM
llavavllm
https://docs.vllm.ai/en/latest/api/vllm/v1/sample/sampler/
sampler - vLLM
samplervllm
https://docs.vllm.ai/en/latest/api/vllm/v1/worker/gpu/structured_outputs/
structured_outputs - vLLM
structured outputsvllm
https://docs.vllm.ai/en/latest/deployment/frameworks/chatbox/
Chatbox - vLLM
chatboxvllm
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/openai/parser/
parser - vLLM
parservllm
https://docs.vllm.ai/en/latest/examples/tool_calling/openai_chat_completion_client_with_tools/
OpenAI Chat Completion Client With Tools - vLLM
chat completionopenaiclienttoolsvllm
https://docs.vllm.ai/en/latest/api/vllm/scalar_type/
scalar_type - vLLM
scalartypevllm
https://github.com/defilantech/llmkube
GitHub - defilantech/LLMKube: Kubernetes operator for local LLM inference with llama.cpp, vLLM,...
Kubernetes operator for local LLM inference with llama.cpp, vLLM, TGI, and mlx-server — multi-GPU NVIDIA + Apple Silicon Metal, autoscaling, air-gapped,...
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/base_config/
base_config - vLLM
baseconfigvllm
https://docs.vllm.ai/en/latest/api/vllm/lora/ops/triton_ops/lora_shrink_op/
lora_shrink_op - vLLM
lorashrinkopvllm
https://docs.vllm.ai/en/stable/training/trl/
Transformers Reinforcement Learning - vLLM
reinforcement learningtransformersvllm
https://docs.vllm.ai/en/stable/api/vllm/engine/llm_engine/
llm_engine - vLLM
llmengine
https://docs.vllm.ai/en/latest/api/vllm/v1/attention/backends/mla/prefill/flash_attn/
flash_attn - vLLM
flashattnvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/router/custom_routing_router/
custom_routing_router - vLLM
custom routingroutervllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/llama_eagle/
llama_eagle - vLLM
llamaeaglevllm
https://www.codiste.com/vllm-vs-ollama
vLLM vs Ollama: Choosing the Right LLM Framework
Compare vLLM vs Ollama for LLM inference. vLLM delivers 2,300 tokens/sec for production APIs while Ollama excels at local deployment and CPU systems.
the rightvllmvsollamachoosing
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/arcee/
arcee - vLLM
arceevllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/models/exaone4_5_mtp/
exaone4_5_mtp - vLLM
mtpvllm
https://docs.vllm.ai/en/stable/api/vllm/entrypoints/mcp/tool/
tool - vLLM
toolvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/attention/kv_transfer_utils/
kv_transfer_utils - vLLM
kvtransferutilsvllm
https://docs.vllm.ai/en/latest/api/vllm/compilation/passes/
passes - vLLM
passesvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/compressed_tensors/transform/linear/
linear - vLLM
linearvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/prepare_finalize/
prepare_finalize - vLLM
preparefinalizevllm
https://recipes.vllm.ai/XiaomiMiMo/MiMo-V2.5-Pro
XiaomiMiMo/MiMo-V2.5-Pro | vLLM Recipes
Xiaomi's flagship MoE reasoning model (1.02T total / 42B active) with hybrid attention, native FP8 weights, and Multi-Token Prediction
mimoprovllmrecipes
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/mamba/mamba_mixer2/
mamba_mixer2 - vLLM
mambavllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/rotary_embedding/mrope_interleaved/
mrope_interleaved - vLLM
mropevllm
https://docs.vllm.ai/en/latest/examples/features/batch_invariance/
Batch Invariance - vLLM
batch invariancevllm
https://docs.vllm.ai/en/latest/api/vllm/v1/structured_output/backend_types/
backend_types - vLLM
backendtypesvllm
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/openai/responses/
responses - vLLM
responsesvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/deepseek_mtp/
deepseek_mtp - vLLM
deepseekmtpvllm
https://bobweb.ai/t/vllm/
vLLM Archives - bobweb.ai
vllmarchivesai
https://docs.vllm.ai/en/latest/api/vllm/v1/worker/gpu/spec_decode/eagle/eagle3_utils/
eagle3_utils - vLLM
utilsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/modular_kernel/
modular_kernel - vLLM
modularkernelvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fla/ops/
ops - vLLM
opsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/ocp_mx_emulation_moe/
ocp_mx_emulation_moe - vLLM
ocpmxemulationmoevllm
https://docs.vllm.ai/en/latest/api/vllm/v1/core/sched/scheduler/
scheduler - vLLM
schedulervllm