https://github.com/confident-ai/deepeval
GitHub - confident-ai/deepeval: The LLM Evaluation Framework · GitHub
The LLM Evaluation Framework. Contribute to confident-ai/deepeval development by creating an account on GitHub.
llm evaluationgithubconfidentaiframework
https://www.evidentlyai.com/courses
Evidently AI - LLM Evaluation and AI Observability Courses
Learn about LLM evaluation and AI observability with our free hands-on courses.
llm evaluationaiobservabilitycourses
https://deepeval.com/docs/metrics-hallucination
Hallucination | DeepEval by Confident AI - The LLM Evaluation Framework
The hallucination metric uses LLM-as-a-judge to determine whether your LLM generates factually correct information by comparing the actual_output to the…
llm evaluationhallucinationconfidentaiframework
https://www.evidentlyai.com/llm-evaluation-course-practice
Evidently AI - LLM evaluation for builders: applied course
Free video course with 10 hands-on code tutorials. From designing custom LLM judges to RAG evaluations and adversarial testing. Sign up to save your seat.
llm evaluationfor buildersaiappliedcourse
https://deepchecks.com/
Deepchecks LLM Evaluation | Evaluate AI Progress with Know Your Agent | Deepchecks
Apr 20, 2026 - Deepchecks LLM Evaluation is an enterprise-grade AI testing, observability and monitoring platform that provides visibility, control, and trust across AI...
llm evaluationevaluateaiprogressknow
https://www.evidentlyai.com/llm-testing
LLM Evaluation and Testing Platform | Evidently AI
Catch hallucinations, safety risks, and quality issues in LLM products before they impact users. Automate, customize, and track AI testing at scale.
llm evaluationtestingplatformai
https://openfabric.ai/blog/llm-evaluation-methodologies-a-deep-dive-into-llm-evals
LLM Evaluation methodologies: A Deep Dive into LLM Evals
LLM evals are important for the long term continuity and improvement of LLMs. Read this article to have a deeper look into LLM evaluation methodologies
llm evaluationdeep divemethodologiesevals
https://langwatch.ai/
LangWatch: AI Agent Testing and LLM Evaluation Platform
LangWatch is an AI agent testing, LLM evaluation, and LLM observability platform. Test agents with simulated users, prevent regressions, and debug issues.
ai agent testingllm evaluationplatform
https://deepeval.com/
DeepEval by Confident AI - The LLM Evaluation Framework
DeepEval is the open-source LLM evaluation framework for testing and benchmarking LLM applications — 50+ plug-and-play metrics for AI agents, RAG, chatbots,...
llm evaluationconfidentaiframework
https://galileo.ai/blog/metrics-first-approach-to-llm-evaluation
A Metrics-First Approach to LLM Evaluation
Aug 12, 2025 - Learn about different types of LLM evaluation metrics
llm evaluationmetricsfirstapproach
https://www.evidentlyai.com/
Evidently AI - AI Evaluation & LLM Observability Platform
Ensure your AI is production-ready. Test LLMs and monitor performance across AI applications, RAG systems, and multi-agent workflows. Built on open-source.
ai evaluationllm observabilityplatform
https://www.getplum.ai/about/
Plum AI - LLM Quality Evaluation & Improvement
Plum AI is a tool that evaluates and improves the quality of large-language model applications
plumaillmqualityevaluation
https://app.ragmetrics.ai/
RagMetrics | LLM Application Evaluation
RagMetrics helps LLM builders prove ROI and optimize performance through tailored, scalable and automated evaluations
llmapplicationevaluation
https://arxiv.org/abs/2412.01778
[2412.01778] HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing
Abstract page for arXiv paper 2412.01778: HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing
penetration testingllmagentevaluationframework
https://deepeval.com/docs/metrics-introduction
Introduction to LLM Metrics | DeepEval by Confident AI - The LLM Evaluation Framework
deepeval offers 50+ SOTA, ready-to-use metrics for you to quickly get started with. Essentially, while a test case represents the thing you're trying to…
introductionllmmetricsconfidentai
https://www.getplum.ai/privacy-policy/
Plum AI - LLM Quality Evaluation & Improvement
Plum AI is a tool that evaluates and improves the quality of large-language model applications
plumaillmqualityevaluation
https://www.evidentlyai.com/llm-evaluation-benchmarks-datasets
Evidently AI - 250 LLM benchmarks and evaluation datasets
How can you evaluate different LLMs? We put together a database of 250 LLM benchmarks and publicly available datasets to evaluate the performance of LLMs.
llm benchmarksaievaluationdatasets
https://langfuse.com/docs/evaluation/overview
Evaluation of LLM Applications - Langfuse
With Langfuse you can capture all your LLM evaluations in one place. You can combine a variety of different evaluation metrics like model-based evaluations...
llm applicationsevaluationlangfuse
https://agenta.ai/
Agenta - Prompt Management, Evaluation, and Observability for LLM apps
Agenta is an open-source platform for building robust LLM Application. It provides tools for prompt engineering, evaluation, debugging, and monitoring of...
prompt managementevaluationobservabilityllmapps
https://arize.com/
LLM Observability & Evaluation Platform
Unified LLM Observability and Agent Evaluation Platform for AI Applications—from development to production.
llm observabilityevaluationplatform
https://github.com/Giskard-AI/giskard-oss?locale=en-US
GitHub - Giskard-AI/giskard-oss: 🐢 Open-Source Evaluation & Testing library for LLM Agents · GitHub
open sourcegithubaiossevaluation
https://www.getplum.ai/
Plum AI - LLM Quality Evaluation & Improvement
Plum AI is a tool that evaluates and improves the quality of large-language model applications
plumaillmqualityevaluation
https://model.aibase.com/llm
Domestic and Foreign AI Large Model Price Comparison Evaluation Platform_Domestic and Foreign LLM...
Professional large model price comparison evaluation platform, providing OpenAI GPT-4, Claude, Wenxin Yiyan, Tongyi Qianwen and other domestic and foreign 100+...
price comparisondomesticforeignailarge
https://towardsdatascience.com/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058/
Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 |...
Jan 8, 2025 - You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data
model evaluationcase studybewareunreliabledata