Robuta

https://github.com/vllm-project/vllm GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for... A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm https://vllm.ai/ vLLM vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). Deploy AI models faster with state-of-the-art... vllm https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/processor/ processor - vLLM processorvllm https://docs.vllm.ai/en/latest/examples/speech_to_text/openai/ OpenAI - vLLM openaivllm https://docs.vllm.ai/en/latest/api/vllm/v1/worker/gpu/spec_decode/eagle/utils/ utils - vLLM utilsvllm https://docs.vllm.ai/en/latest/api/vllm/distributed/kv_transfer/kv_connector/v1/nixl/ nixl - vLLM nixlvllm https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/chat_templates/registry/ registry - vLLM registryvllm https://docs.vllm.ai/en/latest/api/vllm/v1/kv_offload/cpu/policies/base/ base - vLLM basevllm https://docs.vllm.ai/en/latest/api/vllm/compilation/compiler_interface/ compiler_interface - vLLM compilerinterfacevllm https://docs.vllm.ai/en/latest/api/vllm/entrypoints/cli/benchmark/ benchmark - vLLM benchmarkvllm https://docs.vllm.ai/en/v0.9.1/usage/usage_stats.html Usage Stats Collection - vLLM usage stats collectionvllm https://docs.vllm.ai/en/latest/api/vllm/config/reasoning/ reasoning - vLLM reasoningvllm https://docs.vllm.ai/en/latest/deployment/frameworks/dstack/ dstack - vLLM dstackvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/modular_kernel/ modular_kernel - vLLM modularkernelvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_nvfp4/ compressed_tensors_w4a16_nvfp4 - vLLM compressedtensorsvllm https://discuss.vllm.ai/t/vllm-hangs-during-worker-initialization-on-blackwell-pcie-gpus-unless-disable-custom-all-reduce-is-used/2540 vLLM hangs during worker initialization on Blackwell PCIe GPUs unless --disable-custom-all-reduce... Apr 11, 2026 - Description When deploying a large model with tensor parallelism on a multi-GPU server, vLLM hangs during worker initialization. The logs repeatedly show:... https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/prepare_finalize/no_dp_ep/ no_dp_ep - vLLM dpepvllm https://docs.vllm.ai/en/stable/api/vllm/model_executor/kernels/linear/scaled_mm/deep_gemm/ deep_gemm - vLLM deepgemmvllm https://docs.vllm.ai/en/stable/api/vllm/model_executor/models/ernie45_vl_moe/ ernie45_vl_moe - vLLM vlmoe https://docs.vllm.ai/en/latest/api/vllm/distributed/kv_transfer/kv_connector/v1/metrics/ metrics - vLLM metricsvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/phi4mm_audio/ phi4mm_audio - vLLM audiovllm https://docs.vllm.ai/en/latest/api/vllm/multimodal/processing/inputs/ inputs - vLLM inputsvllm https://docs.vllm.ai/en/latest/api/vllm/reasoning/abs_reasoning_parsers/ abs_reasoning_parsers - vLLM absreasoningparsersvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/trtllm_mxfp4_moe/ trtllm_mxfp4_moe - vLLM moevllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/flashinfer_cutedsl_moe/ flashinfer_cutedsl_moe - vLLM flashinfermoevllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/utils/humming_utils/ humming_utils - vLLM hummingutilsvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/fp_quant/ fp_quant - vLLM fpquantvllm https://docs.vllm.ai/en/latest/usage/troubleshooting/ Troubleshooting - vLLM troubleshootingvllm https://docs.vllm.ai/en/latest/api/vllm/v1/attention/ops/triton_attention_helpers/ triton_attention_helpers - vLLM tritonattentionhelpersvllm https://docs.vllm.ai/en/stable/api/vllm/entrypoints/pooling/scoring/io_processor/ io_processor - vLLM ioprocessorvllm https://docs.vllm.ai/en/latest/api/vllm/tokenizers/registry/ registry - vLLM registryvllm https://docs.vllm.ai/en/latest/api/vllm/renderers/base/ base - vLLM basevllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/medusa/ medusa - vLLM medusavllm https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/llama_tool_parser/ llama_tool_parser - vLLM llamatoolparservllm https://eesungkim.com/reviews/triton_vs_vllm_comparison/ qwen3_asr_triton vs speechLLM (vLLM): Performance & Architecture Comparison | Log Notes of research, experiments, and troubleshooting. asrtritonvsvllmperformance https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a8_fp8/ compressed_tensors_moe_w4a8_fp8 - vLLM compressedtensorsmoevllm https://docs.vllm.ai/en/latest/design/lora_resolver_plugins/ LoRA Resolver Plugins - vLLM lora resolver pluginsvllm https://docs.vllm.ai/en/latest/api/vllm/v1/engine/utils/ utils - vLLM utilsvllm https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/fla/ops/fused_sigmoid_gating/ fused_sigmoid_gating - vLLM fusedsigmoidgatingvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a8_int8/ compressed_tensors_moe_w4a8_int8 - vLLM compressedtensorsmoevllm https://docs.vllm.ai/en/latest/api/vllm/utils/mistral/ mistral - vLLM mistralvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/qwen2_5_omni_thinker/ qwen2_5_omni_thinker - vLLM omnithinkervllm https://skillnav.dev/articles/vllm-v0-to-v1-correctness-before-corrections-in-rl vLLM V0 到 V1 迁移:先修推理正确性,再改 RL 目标 | SkillNav 本文记录了 vLLM V0 到 V1 迁移过程中遇到的训练-推理不匹配问题,以及通过修复 logprob 语义、运行时默认值、权重更新路径和 fp32 lm_head 来恢复后端正则性的过程。作者强调,在调整 RL 目标之前,应先确保推理后端输出正确的 logprob。 vllmrl https://docs.vllm.ai/en/latest/api/vllm/entrypoints/pooling/scoring/protocol/ protocol - vLLM protocolvllm https://docs.vllm.ai/en/latest/api/vllm/multimodal/media/image/ image - vLLM imagevllm https://docs.vllm.ai/en/latest/api/vllm/logging_utils/lazy/ lazy - vLLM lazyvllm https://docs.vllm.ai/en/latest/api/vllm/compilation/passes/fusion/attn_quant_fusion/ attn_quant_fusion - vLLM attnquantfusionvllm https://docs.vllm.ai/en/latest/examples/tool_calling/openai_responses_client_with_tools/ OpenAI Responses Client With Tools - vLLM openai responses clienttoolsvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/opt/ opt - vLLM optvllm https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/processors/granite4_vision/ granite4_vision - vLLM visionvllm https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/runai_utils/ runai_utils - vLLM utilsvllm https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/granite4_tool_parser/ granite4_tool_parser - vLLM toolparservllm https://llmkube.com/blog/vllm-swift-turboquant-m5-max vllm-swift on M5 Max: A/B'ing TurboQuant+ against the llama.cpp data - LLMKube Blog TheTom asked us to run his vllm-swift TurboQuant+ work through the same kind of sweep we did on the llama.cpp fork. 36 cells later: fp16 wins decode at every... https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/sarvam/ sarvam - vLLM sarvamvllm https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/configs/funaudiochat/ funaudiochat - vLLM funaudiochatvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/gritlm/ gritlm - vLLM gritlmvllm https://www.datacamp.com/hi/tutorial/llama-4-vllm Llama 4 With vLLM: A Guide With Demo Project | DataCamp Learn how to deploy and use Meta's LLaMA 4 Scout with vLLM on RunPod for both text completion and multimodal inference. a guidedemo projectllamavllmdatacamp https://docs.vllm.ai/en/latest/api/vllm/utils/profiling/ profiling - vLLM profilingvllm https://docs.vllm.ai/en/latest/api/vllm/renderers/grok2/ grok2 - vLLM vllm https://docs.pruna.ai/en/stable/setup/vllm.html vLLM | Pruna documentation vllmprunadocumentation https://docs.vllm.ai/en/latest/api/vllm/v1/simple_kv_offload/ simple_kv_offload - vLLM simplekvoffloadvllm https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/config_parser_base/ config_parser_base - vLLM configparserbasevllm https://docs.vllm.ai/en/latest/api/vllm/v1/kv_offload/cpu/manager/ manager - vLLM managervllm https://docs.vllm.ai/en/latest/api/vllm/beam_search/ beam_search - vLLM beamsearchvllm https://docs.vllm.ai/en/stable/features/prompt_embeds/ Prompt Embedding Inputs - vLLM prompt embedding inputsvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fla/ops/cumsum/ cumsum - vLLM cumsumvllm https://docs.vllm.ai/en/latest/getting_started/installation/gpu/ GPU - vLLM gpuvllm https://docs.vllm.ai/en/stable/api/vllm/distributed/nixl_utils/ nixl_utils - vLLM nixlutilsvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/awq_marlin/ awq_marlin - vLLM awqmarlinvllm https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/configs/eagle/ eagle - vLLM eaglevllm https://docs.vllm.ai/en/latest/api/vllm/utils/mem_constants/ mem_constants - vLLM memconstantsvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/nemotron_vl/ nemotron_vl - vLLM nemotronvl https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/rotary_embedding/dynamic_ntk_alpha_rope/ dynamic_ntk_alpha_rope - vLLM dynamicntkalpharopevllm https://docs.vllm.ai/en/latest/deployment/docker/ Using Docker - vLLM using dockervllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/bitsandbytes/ bitsandbytes - vLLM bitsandbytesvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/batched_deep_gemm_moe/ batched_deep_gemm_moe - vLLM batcheddeepgemmmoevllm https://docs.vllm.ai/en/stable/api/vllm/distributed/kv_transfer/kv_connector/v1/nixl/scheduler/ scheduler - vLLM schedulervllm https://createaiagent.net/tools/vllm/ vLLM: High-Performance Inference Engine Jan 22, 2026 - Explore vLLM, a GPU-optimized inference engine for self-hosted clusters, enhancing throughput and reducing latency for production workloads. high performancevllminferenceengine https://docs.vllm.ai/en/latest/api/vllm/v1/sample/logits_processor/state/ state - vLLM statevllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fused_moe/experts/cutlass_moe/ cutlass_moe - vLLM cutlassmoevllm https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/attention_layer_base/ attention_layer_base - vLLM attentionlayerbasevllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/kernels/linear/scaled_mm/cpu/ cpu - vLLM cpuvllm https://docs.vllm.ai/en/latest/api/vllm/transformers_utils/processors/kimi_k25/ kimi_k25 - vLLM kimivllm https://docs.vllm.ai/en/latest/api/vllm/distributed/nixl_utils/ nixl_utils - vLLM nixlutilsvllm https://docs.vllm.ai/en/latest/examples/pooling/reward/ Reward - vLLM rewardvllm https://docs.vllm.ai/en/stable/api/vllm/distributed/device_communicators/flashinfer_all_reduce/ flashinfer_all_reduce - vLLM flashinferreducevllm https://docs.vllm.ai/en/latest/api/vllm/v1/attention/backends/mla/flashinfer_mla/ flashinfer_mla - vLLM flashinfermlavllm https://docs.vllm.ai/en/latest/design/vllm_ir/ vLLM IR: Functional Intermediate Representation - vLLM vllmirfunctionalintermediaterepresentation https://docs.vllm.ai/en/latest/features/quantization/index.html Quantization - vLLM quantizationvllm https://docs.vllm.ai/en/stable/deployment/frameworks/litellm/ LiteLLM - vLLM litellmvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/experts_int8/ experts_int8 - vLLM expertsvllm https://docs.vllm.ai/en/latest/api/vllm/engine/protocol/ protocol - vLLM protocolvllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/model_loader/ model_loader - vLLM modelloadervllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/nemotron_parse/ nemotron_parse - vLLM nemotronparsevllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/mistral_large_3_eagle/ mistral_large_3_eagle - vLLM mistral largeeaglevllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/mistral_eagle/ mistral_eagle - vLLM mistraleaglevllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/rotary_embedding/linear_scaling_rope/ linear_scaling_rope - vLLM linearscalingropevllm https://docs.vllm.ai/en/latest/api/vllm/v1/worker/gpu/kv_connector/ kv_connector - vLLM kvconnectorvllm https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/lfm2_tool_parser/ lfm2_tool_parser - vLLM toolparservllm https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/siglip2navit/ siglip2navit - vLLM vllm