Robuta

Sponsor of the Day: Jerkmate
https://www.together.ai/blog/cache-aware-disaggregated-inference Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40%... cache aware40 fasterlong contextllm servingdisaggregation https://haoailab.com/blogs/distserve/ Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode... Mar 17, 2024 - TL;DR: LLM apps today have diverse latency requirements. For example, a chatbot may require a fast initial response (e.g., under 0.2 seconds) but moderate... llm servingthroughputneedmaximizingusing https://vllm.ai/blog/vllm vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow eve easy fastllm servingvllmcheappagedattention https://hashnode.com/posts/optimizing-llm-serving-vllm-nvlink/69d8b3ae075944a59151beac Discussion on "Optimizing LLM Serving: The Engineering Truth of vLLM & NVLink" | Hashnode optimizing llmdiscussionservingengineeringtruth https://huggingface.co/papers/2405.19888 Paper page - Parrot: Efficient Serving of LLM-based Applications with Semantic Variable Join the discussion on this paper page llm basedpaperparrotefficientserving