https://github.com/vllm-project/vllm
GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for...
A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm
https://vllm.ai/
vLLM
vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). Deploy AI models faster with state-of-the-art...
vllm
https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/processor/
processor - vLLM
processorvllm
https://docs.vllm.ai/en/latest/examples/speech_to_text/openai/
OpenAI - vLLM
openaivllm
https://docs.vllm.ai/en/latest/api/vllm/v1/worker/gpu/spec_decode/eagle/utils/
utils - vLLM
utilsvllm
https://docs.vllm.ai/en/latest/api/vllm/distributed/kv_transfer/kv_connector/v1/nixl/
nixl - vLLM
nixlvllm
https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/chat_templates/registry/
registry - vLLM
registryvllm
https://docs.vllm.ai/en/latest/api/vllm/v1/kv_offload/cpu/policies/base/
base - vLLM
basevllm
https://docs.vllm.ai/en/latest/api/vllm/compilation/compiler_interface/
compiler_interface - vLLM
compilerinterfacevllm
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/cli/benchmark/
benchmark - vLLM
benchmarkvllm
https://docs.vllm.ai/en/v0.9.1/usage/usage_stats.html
Usage Stats Collection - vLLM
usage stats collectionvllm
https://docs.vllm.ai/en/latest/api/vllm/config/reasoning/
reasoning - vLLM
reasoningvllm
https://docs.vllm.ai/en/latest/deployment/frameworks/dstack/
dstack - vLLM
dstackvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/modular_kernel/
modular_kernel - vLLM
modularkernelvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_nvfp4/
compressed_tensors_w4a16_nvfp4 - vLLM
compressedtensorsvllm
https://discuss.vllm.ai/t/vllm-hangs-during-worker-initialization-on-blackwell-pcie-gpus-unless-disable-custom-all-reduce-is-used/2540
vLLM hangs during worker initialization on Blackwell PCIe GPUs unless --disable-custom-all-reduce...
Apr 11, 2026 - Description When deploying a large model with tensor parallelism on a multi-GPU server, vLLM hangs during worker initialization. The logs repeatedly show:...
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/prepare_finalize/no_dp_ep/
no_dp_ep - vLLM
dpepvllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/kernels/linear/scaled_mm/deep_gemm/
deep_gemm - vLLM
deepgemmvllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/models/ernie45_vl_moe/
ernie45_vl_moe - vLLM
vlmoe
https://docs.vllm.ai/en/latest/api/vllm/distributed/kv_transfer/kv_connector/v1/metrics/
metrics - vLLM
metricsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/phi4mm_audio/
phi4mm_audio - vLLM
audiovllm
https://docs.vllm.ai/en/latest/api/vllm/multimodal/processing/inputs/
inputs - vLLM
inputsvllm
https://docs.vllm.ai/en/latest/api/vllm/reasoning/abs_reasoning_parsers/
abs_reasoning_parsers - vLLM
absreasoningparsersvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/trtllm_mxfp4_moe/
trtllm_mxfp4_moe - vLLM
moevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/flashinfer_cutedsl_moe/
flashinfer_cutedsl_moe - vLLM
flashinfermoevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/utils/humming_utils/
humming_utils - vLLM
hummingutilsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/fp_quant/
fp_quant - vLLM
fpquantvllm
https://docs.vllm.ai/en/latest/usage/troubleshooting/
Troubleshooting - vLLM
troubleshootingvllm
https://docs.vllm.ai/en/latest/api/vllm/v1/attention/ops/triton_attention_helpers/
triton_attention_helpers - vLLM
tritonattentionhelpersvllm
https://docs.vllm.ai/en/stable/api/vllm/entrypoints/pooling/scoring/io_processor/
io_processor - vLLM
ioprocessorvllm
https://docs.vllm.ai/en/latest/api/vllm/tokenizers/registry/
registry - vLLM
registryvllm
https://docs.vllm.ai/en/latest/api/vllm/renderers/base/
base - vLLM
basevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/medusa/
medusa - vLLM
medusavllm
https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/llama_tool_parser/
llama_tool_parser - vLLM
llamatoolparservllm
https://eesungkim.com/reviews/triton_vs_vllm_comparison/
qwen3_asr_triton vs speechLLM (vLLM): Performance & Architecture Comparison | Log
Notes of research, experiments, and troubleshooting.
asrtritonvsvllmperformance
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a8_fp8/
compressed_tensors_moe_w4a8_fp8 - vLLM
compressedtensorsmoevllm
https://docs.vllm.ai/en/latest/design/lora_resolver_plugins/
LoRA Resolver Plugins - vLLM
lora resolver pluginsvllm
https://docs.vllm.ai/en/latest/api/vllm/v1/engine/utils/
utils - vLLM
utilsvllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/fla/ops/fused_sigmoid_gating/
fused_sigmoid_gating - vLLM
fusedsigmoidgatingvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a8_int8/
compressed_tensors_moe_w4a8_int8 - vLLM
compressedtensorsmoevllm
https://docs.vllm.ai/en/latest/api/vllm/utils/mistral/
mistral - vLLM
mistralvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/qwen2_5_omni_thinker/
qwen2_5_omni_thinker - vLLM
omnithinkervllm
https://skillnav.dev/articles/vllm-v0-to-v1-correctness-before-corrections-in-rl
vLLM V0 到 V1 迁移:先修推理正确性,再改 RL 目标 | SkillNav
本文记录了 vLLM V0 到 V1 迁移过程中遇到的训练-推理不匹配问题,以及通过修复 logprob 语义、运行时默认值、权重更新路径和 fp32 lm_head 来恢复后端正则性的过程。作者强调,在调整 RL 目标之前,应先确保推理后端输出正确的 logprob。
vllmrl
https://docs.vllm.ai/en/latest/api/vllm/entrypoints/pooling/scoring/protocol/
protocol - vLLM
protocolvllm
https://docs.vllm.ai/en/latest/api/vllm/multimodal/media/image/
image - vLLM
imagevllm
https://docs.vllm.ai/en/latest/api/vllm/logging_utils/lazy/
lazy - vLLM
lazyvllm
https://docs.vllm.ai/en/latest/api/vllm/compilation/passes/fusion/attn_quant_fusion/
attn_quant_fusion - vLLM
attnquantfusionvllm
https://docs.vllm.ai/en/latest/examples/tool_calling/openai_responses_client_with_tools/
OpenAI Responses Client With Tools - vLLM
openai responses clienttoolsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/opt/
opt - vLLM
optvllm
https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/processors/granite4_vision/
granite4_vision - vLLM
visionvllm
https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/runai_utils/
runai_utils - vLLM
utilsvllm
https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/granite4_tool_parser/
granite4_tool_parser - vLLM
toolparservllm
https://llmkube.com/blog/vllm-swift-turboquant-m5-max
vllm-swift on M5 Max: A/B'ing TurboQuant+ against the llama.cpp data - LLMKube Blog
TheTom asked us to run his vllm-swift TurboQuant+ work through the same kind of sweep we did on the llama.cpp fork. 36 cells later: fp16 wins decode at every...
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/sarvam/
sarvam - vLLM
sarvamvllm
https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/configs/funaudiochat/
funaudiochat - vLLM
funaudiochatvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/gritlm/
gritlm - vLLM
gritlmvllm
https://www.datacamp.com/hi/tutorial/llama-4-vllm
Llama 4 With vLLM: A Guide With Demo Project | DataCamp
Learn how to deploy and use Meta's LLaMA 4 Scout with vLLM on RunPod for both text completion and multimodal inference.
a guidedemo projectllamavllmdatacamp
https://docs.vllm.ai/en/latest/api/vllm/utils/profiling/
profiling - vLLM
profilingvllm
https://docs.vllm.ai/en/latest/api/vllm/renderers/grok2/
grok2 - vLLM
vllm
https://docs.pruna.ai/en/stable/setup/vllm.html
vLLM | Pruna documentation
vllmprunadocumentation
https://docs.vllm.ai/en/latest/api/vllm/v1/simple_kv_offload/
simple_kv_offload - vLLM
simplekvoffloadvllm
https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/config_parser_base/
config_parser_base - vLLM
configparserbasevllm
https://docs.vllm.ai/en/latest/api/vllm/v1/kv_offload/cpu/manager/
manager - vLLM
managervllm
https://docs.vllm.ai/en/latest/api/vllm/beam_search/
beam_search - vLLM
beamsearchvllm
https://docs.vllm.ai/en/stable/features/prompt_embeds/
Prompt Embedding Inputs - vLLM
prompt embedding inputsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fla/ops/cumsum/
cumsum - vLLM
cumsumvllm
https://docs.vllm.ai/en/latest/getting_started/installation/gpu/
GPU - vLLM
gpuvllm
https://docs.vllm.ai/en/stable/api/vllm/distributed/nixl_utils/
nixl_utils - vLLM
nixlutilsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/awq_marlin/
awq_marlin - vLLM
awqmarlinvllm
https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/configs/eagle/
eagle - vLLM
eaglevllm
https://docs.vllm.ai/en/latest/api/vllm/utils/mem_constants/
mem_constants - vLLM
memconstantsvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/nemotron_vl/
nemotron_vl - vLLM
nemotronvl
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/rotary_embedding/dynamic_ntk_alpha_rope/
dynamic_ntk_alpha_rope - vLLM
dynamicntkalpharopevllm
https://docs.vllm.ai/en/latest/deployment/docker/
Using Docker - vLLM
using dockervllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/bitsandbytes/
bitsandbytes - vLLM
bitsandbytesvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/batched_deep_gemm_moe/
batched_deep_gemm_moe - vLLM
batcheddeepgemmmoevllm
https://docs.vllm.ai/en/stable/api/vllm/distributed/kv_transfer/kv_connector/v1/nixl/scheduler/
scheduler - vLLM
schedulervllm
https://createaiagent.net/tools/vllm/
vLLM: High-Performance Inference Engine
Jan 22, 2026 - Explore vLLM, a GPU-optimized inference engine for self-hosted clusters, enhancing throughput and reducing latency for production workloads.
high performancevllminferenceengine
https://docs.vllm.ai/en/latest/api/vllm/v1/sample/logits_processor/state/
state - vLLM
statevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/cutlass_moe/
cutlass_moe - vLLM
cutlassmoevllm
https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/attention_layer_base/
attention_layer_base - vLLM
attentionlayerbasevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/kernels/linear/scaled_mm/cpu/
cpu - vLLM
cpuvllm
https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/processors/kimi_k25/
kimi_k25 - vLLM
kimivllm
https://docs.vllm.ai/en/latest/api/vllm/distributed/nixl_utils/
nixl_utils - vLLM
nixlutilsvllm
https://docs.vllm.ai/en/latest/examples/pooling/reward/
Reward - vLLM
rewardvllm
https://docs.vllm.ai/en/stable/api/vllm/distributed/device_communicators/flashinfer_all_reduce/
flashinfer_all_reduce - vLLM
flashinferreducevllm
https://docs.vllm.ai/en/latest/api/vllm/v1/attention/backends/mla/flashinfer_mla/
flashinfer_mla - vLLM
flashinfermlavllm
https://docs.vllm.ai/en/latest/design/vllm_ir/
vLLM IR: Functional Intermediate Representation - vLLM
vllmirfunctionalintermediaterepresentation
https://docs.vllm.ai/en/latest/features/quantization/index.html
Quantization - vLLM
quantizationvllm
https://docs.vllm.ai/en/stable/deployment/frameworks/litellm/
LiteLLM - vLLM
litellmvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/experts_int8/
experts_int8 - vLLM
expertsvllm
https://docs.vllm.ai/en/latest/api/vllm/engine/protocol/
protocol - vLLM
protocolvllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/model_loader/
model_loader - vLLM
modelloadervllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/nemotron_parse/
nemotron_parse - vLLM
nemotronparsevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/mistral_large_3_eagle/
mistral_large_3_eagle - vLLM
mistral largeeaglevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/mistral_eagle/
mistral_eagle - vLLM
mistraleaglevllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/rotary_embedding/linear_scaling_rope/
linear_scaling_rope - vLLM
linearscalingropevllm
https://docs.vllm.ai/en/latest/api/vllm/v1/worker/gpu/kv_connector/
kv_connector - vLLM
kvconnectorvllm
https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/lfm2_tool_parser/
lfm2_tool_parser - vLLM
toolparservllm
https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/siglip2navit/
siglip2navit - vLLM
vllm