https://github.com/ggml-org/llama.cpp
GitHub - ggml-org/llama.cpp: LLM inference in C/C++ · GitHub
LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.
llm inferencegithubggmlllamacpp
https://llama-cpp.com/
Llama.cpp - Run LLM Inference in C/C++
Mar 19, 2026 - Llama.cpp (LLaMA C++) allows you to run efficient Large Language Model Inference in pure C/C++. Download llama.cpp for Windows, Linux and Mac.
llm inferencellamacpprun
https://epoch.ai/data-insights/llm-inference-price-trends
LLM inference prices have fallen rapidly but unequally across tasks | Epoch AI
Epoch AI is a research institute investigating key trends and questions that will shape the trajectory and governance of Artificial Intelligence.
llm inference
https://arxiv.org/abs/2308.04623
[2308.04623] Accelerating LLM Inference with Staged Speculative Decoding
Abstract page for arXiv paper 2308.04623: Accelerating LLM Inference with Staged Speculative Decoding
llm inferenceacceleratingstagedspeculativedecoding
https://jetstream-cloud.org/news-events/news/4-11-25_llm-inference-service.html
Jetstream2 launches large language model (LLM) inference service: News: News & Events: Jetstream2:...
Jetstream2 recently unveiled a Large Language Model (LLM) inference service tailored to Jetstream2 users.
large language modelllm inferenceservice newslaunchesevents
https://www.redhat.com/en/blog/evaluating-llm-inference-performance-red-hat-openshift-ai
Evaluating LLM inference performance on Red Hat OpenShift AI
This article introduces the methodology and results of performance testing the Llama-2 models deployed on the model serving stack included with Red Hat...
red hat openshiftllm inferenceevaluatingperformanceai
https://devtune.ai/verticals/llm-inference-serverless-gpu/modal-labs/alternatives
Modal Labs Alternatives in LLM Inference & Serverless GPU | DevTune
llm inferenceserverless gpumodallabsalternatives
https://www.vllm.ch/
vLLM Experts Switzerland – LLM Inference Hosting | VSHN
Your vLLM experts in Switzerland. We deploy, scale, and operate high-throughput LLM inference on Kubernetes across APPUiO, OpenShift, and private cloud setups.
llm inferencevllmexpertsswitzerlandhosting
https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Defeating Nondeterminism in LLM Inference - Thinking Machines Lab
Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models. For example,...
llm inferencethinking machineslab
https://pyshine.com/InferSim/
InferSim: LLM Inference Simulation by Alibaba | PyShine
Apr 27, 2026 - Learn how InferSim by Alibaba do simulation. This guide covers installation, architecture, and real-world applications for VLSI design.
llm inferencesimulationalibaba
https://forum.lazarus.freepascal.org/index.php?topic=72801.15
PasLLM - LLM Inference Engine in Pure Pascal
PasLLM - LLM Inference Engine in Pure Pascal
llm inferenceenginepurepascal
https://ranjankumar.in/choosing-the-right-llm-inference-framework-a-practical-guide
Choosing the Right LLM Inference Framework: A Practical Guide | Ranjan Kumar
Dec 24, 2025 - *Performance benchmarks, cost analysis, and decision framework for developers worldwide* Here's something nobody tells you about "open source" AI: the model ...
a practical guidethe rightllm inferencechoosing
https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android
LLM Inference guide for Android | Google AI Edge | Google AI for Developers
google ai edgellm inferencefor androidguidedevelopers
https://ar5iv.labs.arxiv.org/html/2412.15803
[2412.15803] WebLLM: A High-Performance In-Browser LLM Inference Engine
Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and...
high performancellm inferencewebllm
https://deeplearn.org/arxiv/744958/power-softmax:-towards-secure-llm-inference-over-encrypted-data
Power-Softmax: Towards Secure LLM Inference over Encrypted Data - Paper Detail
Things happening in deep learning: arxiv, twitter, reddit
llm inferenceencrypted datapowersoftmaxtowards
https://www.kog.ai/
Kog l 30x Faster LLM Inference
Sequential generation is the bottleneck. Kog couples a low-latency engine with parallel architecture to deliver 30x faster LLM inference. Request API access.
koglfasterinference
https://arxiv.org/abs/2505.09598
[2505.09598] How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference
Abstract page for arXiv paper 2505.09598: How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference
https://budecosystem.alwaysdata.net/reducing-llm-operational-costs-through-hybrid-inference-with-slms-on-intel-cpus-and-cloud-llms/
Reducing LLM Ops Costs through Hybrid Inference with SLMs on Intel CPUs and Cloud LLMs –...
Despite the transformative potential of generative AI, its adoption in enterprises is lagging significantly. One major reason for this slow uptake is that many...
https://www.storagereview.com/review/pliops-xdp-lightningai-supercharges-kv-cache-to-optimize-llm-inference-with-nvidia-dynamo
Pliops XDP LightningAI Supercharges KV Cache to Optimize LLM Inference with NVIDIA Dynamo -...
May 21, 2025 - Pliops XDP LightningAI boosts LLM inference by offloading KV cache, enabling faster, scalable AI with NVIDIA Dynamo integration
https://friendli.ai/models/singtan/solvrays-llm-pdf
singtan/solvrays-llm-pdf - Fast, Reliable, and Scalable Inference on FriendliAI
Run singtan/solvrays-llm-pdf with fast, reliable, and scalable inference on FriendliAI. Get low-latency performance with advanced quantization (FP4, FP8, INT4,...
reliable and scalablellmpdffast
https://www.spheron.network/blog/batch-llm-inference-gpu-cloud/
Batch LLM Inference on GPU Cloud: Offline Processing Pipelines for 10x Lower Cost vs Real-Time...
Batch LLM inference cuts costs 5-10x vs real-time serving for document summarization, classification, and embedding workloads. This guide covers queuing...
https://papers.nips.cc/paper_files/paper/2025/hash/0907335ecf28faf15be54485dbcbe70e-Abstract-Conference.html
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in...
https://joshuaberkowitz.us/blog/news-1/speculative-cascades-the-hybrid-solution-driving-smarter-faster-llm-inference-1092
Speculative Cascades: The Hybrid Solution Driving Smarter, Faster LLM Inference | Joshua Berkowitz
Unlocking LLM Performance: Why Efficiency Matters
the hybrid
https://arxiv.org/abs/2507.08523
[2507.08523] InferLog: Accelerating LLM Inference for Online Log Parsing via ICL-oriented Prefix...
Abstract page for arXiv paper 2507.08523: InferLog: Accelerating LLM Inference for Online Log Parsing via ICL-oriented Prefix Caching
https://www.eletimes.ai/gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-less-than-in-2025
Gartner Predicts That by 2030, Performing Inference on an LLM With 1 Trillion Parameters Will Cost...
By 2030, performing inference on a large language model (LLM) with one trillion parameters will cost GenAI providers over 90% less than it did in 2025,...
https://www.dbasolved.com/2026/01/llm-inference-performance-tuning-guide-for-dbas/
Understanding LLM Inference: A DBA's Guide to Performance Tuning AI Models - DBASolved
Jan 25, 2026 - Learn how LLM inference works and apply your database tuning expertise to optimize AI model performance and reduce costs.
https://www.blocksandfiles.com/flash/2026/02/16/sk-hynix-proposes-hbm-and-hbf-hybrid-for-llm-inference/4091326
SK Hynix proposes HBM and HBF hybrid for LLM inference
sk hynixfor llmproposeshbmhbf
https://llmkube.com/about
About LLMKube - Kubernetes Operator for Self-Hosted LLM Inference
Learn about LLMKube, the open source Kubernetes operator for deploying and managing local LLM workloads. Apache 2.0 licensed, community-driven,...
self hosted llmkubernetes operatorllmkubeinference
https://www.blog.brightcoding.dev/2025/11/24/the-dataframe-revolution-how-fenic-is-transforming-llm-inference-for-production-ai/
The DataFrame Revolution: How Fenic is Transforming LLM Inference for Production AI - BrightCoding
Why 90% of LLM Projects Fail at Scale (And How DataFrame Frameworks Are Changing Everything) Large Language Models are powerful but deploying them in...
https://www.roofline.ai/news/dynamic-shape-support-a-key-enabler-for-on-device-llm-inference
Dynamic shape support: A key enabler for on-device LLM inference
dynamicshapesupportkey
https://pytorch.org/blog/unleashing-ai-mobile/
Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch...
https://inference.net/
Inference.net | Full-Stack LLM Lifecycle Platform
Train, deploy, observe, and evaluate LLMs from a single platform. Lower cost, faster latency, and dedicated support from Inference.net.
full stackinferencellmlifecycleplatform
https://developers.llamaindex.ai/python/framework/integrations/llm/heroku/
Heroku LLM Managed Inference | Developer Documentation
managed inferenceherokullmdeveloperdocumentation
https://developers.googleblog.com/supercharging-llm-inference-on-google-tpus-achieving-3x-speedups-with-diffusion-style-speculative-decoding/
Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative...
Researchers at UCSD have achieved a breakthrough in AI serving efficiency by integrating DFlash, a block-diffusion speculative decoding framework, into the...
https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.4/html/deploy_models_using_distributed_inference_with_llm-d/index
Deploy models using Distributed Inference with llm-d | Red Hat OpenShift AI Self-Managed | 3.4 |...
Deploy models using Distributed Inference with llm-d | Red Hat OpenShift AI Self-Managed | 3.4 | Red Hat Documentation
https://arxiv.org/abs/2404.15420v3
[2404.15420v3] XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Abstract page for arXiv paper 2404.15420v3: XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
https://mvpfactory.io/blog/webgpu-compute-shaders-for-on-device-llm-inference-in-android-webviews-the-gpu/
WebGPU Compute Shaders for On-Device LLM Inference in Android WebViews: The GPU Pipeline That...
Apr 24, 2026 - Using WebGPU compute shaders via Android WebView to run quantized LLM matrix multiplications on mobile GPUs, bypassing NNAPI's operator coverage gaps and vendor
https://arxiv.org/abs/2507.05228
[2507.05228] Cascade: Token-Sharded Private LLM Inference
Abstract page for arXiv paper 2507.05228: Cascade: Token-Sharded Private LLM Inference
private llmcascadetokeninference