Sponsor of the Day:
Jerkmate
https://www.together.ai/blog/cache-aware-disaggregated-inference
Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving
Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40%...
cache aware40 fasterlong contextllm servingdisaggregation
https://haoailab.com/blogs/distserve/
Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode...
Mar 17, 2024 - TL;DR: LLM apps today have diverse latency requirements. For example, a chatbot may require a fast initial response (e.g., under 0.2 seconds) but moderate...
llm servingthroughputneedmaximizingusing
https://vllm.ai/blog/vllm
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog
LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow eve
easy fastllm servingvllmcheappagedattention
https://hashnode.com/posts/optimizing-llm-serving-vllm-nvlink/69d8b3ae075944a59151beac
Discussion on "Optimizing LLM Serving: The Engineering Truth of vLLM & NVLink" | Hashnode
optimizing llmdiscussionservingengineeringtruth
https://huggingface.co/papers/2405.19888
Paper page - Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Join the discussion on this paper page
llm basedpaperparrotefficientserving