https://www.infoq.com/podcasts/tiger-teams-evals-agents/
Tiger Teams, Evals and Agents: The New AI Engineering Playbook - InfoQ
ai engineeringtigerteamsevalsagents
https://developers.openai.com/api/docs/guides/evals
Working with evals | OpenAI API
Learn how to test and improve AI model outputs through evaluations.
openai apiworkingevals
https://github.com/Margin-Lab/evals
GitHub - Margin-Lab/evals: Fast, robust, configurable agent evals · GitHub
Fast, robust, configurable agent evals. Contribute to Margin-Lab/evals development by creating an account on GitHub.
githubmarginlabevalsfast
https://developers.openai.com/learn/evals
Evals | OpenAI Developers
evalsopenaidevelopers
https://docs.docker.com/ai/docker-agent/evals/
Evals | Docker Docs
Mar 10, 2026 - Test your agents with saved conversations
evalsdockerdocs
https://developers.openai.com/cookbook/topic/evals
Evals • Cookbook
Improve your LLM integrations with evals.
evalscookbook
https://app.evals.net/login
EVALS
evals
https://pydantic.dev/docs/ai/evals/evals/
Pydantic Evals
pydantic evals
Sponsored https://adultfriendfinder.com/
AdultFriendFinder – The World’s Largest Dating and Social Discovery Site
Join the Largest Community of Fun-Loving Adults - AdultFriendFinder. Discover the excitement of connecting with millions of like-minded members on...
https://deepmind.google/research/evals/
Evals — Google DeepMind
google deepmindevals
https://lovable.dev/careers/engineer-agents-and-evals-9f4963
Engineer - Agents & Evals - Lovable Careers
engineeragentsevalslovablecareers
https://www.ycombinator.com/companies/respan
Respan: Self-driving observability, evals, and gateway for AI agents | Y Combinator
Self-driving observability, evals, and gateway for AI agents. Founded in 2023 by Raymond Huang and Andy Li, Respan has 10 employees based in San Francisco, CA,...
for ai agentsy combinatorselfdrivingobservability
https://towardsdatascience.com/tds-newsletter-how-to-design-evals-metrics-and-kpis-that-work/
TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work | Towards Data Science
Dec 6, 2025 - On the challenges of producing reliable insights and avoiding common mistakes
how todata sciencetdsnewsletterdesign
https://www.langchain.com/langsmith/evaluation
LangSmith - LLM & AI Agent Evals Platform: Continuously improve agents
ai agentlangsmithllmevalsplatform
Sponsored https://darlink.ai/
DarLink AI: Free AI Girlfriend Generator | Chat, Photos & Video
Create your ideal AI Girlfriend with DarLink AI. Customize her look and personality, chat naturally, and enjoy personalized photos, videos, and voice for a...
https://humanloop.com/docs/v5/getting-started/overview
Humanloop is the LLM Evals Platform for Enterprises | Humanloop Docs
Learn how to use Humanloop for prompt engineering, evaluation and monitoring. Comprehensive guides and tutorials for LLMOps.
for enterprisesllmevalsplatformdocs
https://arxiv.org/abs/2411.00640
[2411.00640] Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
Abstract page for arXiv paper 2411.00640: Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
addingerrorbarsevalsstatistical
https://www.langchain.com/langsmith-platform
LangSmith: AI Agent & LLM Observability and Evals Platform
LangSmith is the complete framework agnostic AI agent and LLM observability, evaluation, and deployment platform.
ai agentllm observabilitylangsmithevalsplatform
https://exa.ai/evals
Evals at Exa | Search Quality Benchmarks & Evaluation
How Exa measures and maintains state-of-the-art search quality for LLMs through rigorous evaluation and benchmarking.
search qualityevalsexabenchmarksevaluation