https://www.ycombinator.com/companies/respan
Respan: Self-driving observability, evals, and gateway for AI agents | Y Combinator
Self-driving observability, evals, and gateway for AI agents. Founded in 2023 by Raymond Huang and Andy Li, Respan has 10 employees based in San Francisco, CA,...
for ai agentsself driving
https://budecosystem.alwaysdata.net/from-genai-pilot-to-production-best-practices-and-evals-that-matter/
From GenAI Pilot to Production: Best Practices and Evals That Matter – BudEcosystem
Many GenAI initiatives shine in the pilot phase but struggle when scaled to production. A common reason is that teams often focus narrowly on metrics like...
pilot to production
https://ai-in-the-am.com/episodes/cheap-search-gpt-55-evals-ai-takeoff-and-analog-inference/
Episode 2026-04-24: Cheap Search, GPT-5.5 Evals, AI Takeoff and Analog Inference | AI:AM
A morning briefing on cheaper agent retrieval, GPT-5.5 benchmark behavior, takeoff forecasts, and energy-efficient AI hardware.
https://app.evals.net/login
EVALS
evals
https://aligneval.com/
AlignEval: Making Evals Easy, Fun, and Semi-Automated
A prototype tool/game to help you look at your data, label it, evaluate output, and optimize evaluators.
easy funmakingevalssemiautomated
https://humanloop.com/home
Humanloop: LLM evals platform for enterprises
Humanloop is an enterprise-grade AI evaluation platform with best-in-class prompt management and LLM observability.
llm evalsplatformenterprises
https://www.wix.engineering/blog/tags/ai-agents-evals
AI Agents Evals | Wix Engineering
ai agentsevalswixengineering
https://www.psglearning.com/blog/videos/2021/02/03/fisdap-student-tutorial-entering-skills-on-a-lab-shift
Fisdap Student Tutorial: How do evals work?
You can fill out many evals (short for "evaluation") for each of your shifts. Fisdap provides evaluation forms for team lead, preceptors, and sites. Evals...
student tutorialhow doevalswork
https://alexcarlin.bearblog.dev/evals-for-structure-prediction-models/
Evals for structure-prediction models | Alex Carlin
Evaluating discriminative models is relatively straightforward. In contrast, evaluating generative models is difficult. We can't just hold out a test set and...
for structureevalspredictionmodelsalex
https://clickgems.clickhouse.com/dashboard/rogerdpack-remembered_evals
rogerdpack-remembered_evals RubyGem - Download Analytics, Stats & Trends | ClickGems
Comprehensive analytics for rogerdpack-remembered_evals RubyGem. By Roger Pack. library to save away eval'ed code to a file first, so that it can be... View...
rememberedevalsrubygemdownloadanalytics
https://evals.agentsteer.ai/runs/eval-v151-gptoss120b/3072
AgentSteer Evals
Evaluation results for AgentSteer security monitor
evals
https://www.distributedthoughts.org/2025-10-06-what-are-we-measuring/
What the F*ck Are We Even Measuring? The Definition Problem in AI Evals
Jan 23, 2026 - A critical examination of how the AI industry's obsession with benchmarks and evals has created a measurement validity crisis - we're optimizing for test...
what the f
https://axiom.co/changelog/offline-evals-alerting
Evals for AI engineering
Offline evaluations for AI engineering and better alerting for data availability.
for aievalsengineering
https://www.wrightslaw.com/nltr/05/nl.1121.htm
Special Ed Advocate: Independent Educational Evals; Trusting the System; Free Boot Camp
Super Savings from Wrightslaw - Summer Sale from July 30-August 15, 2002
special ed
https://jobs.thrivecap.com/companies/openai/jobs/57627393-research-engineer-frontier-evals-environments-finance
Research Engineer, Frontier Evals & Environments - Finance @ OpenAI | Thrive Capital Job Board
Search job openings across the Thrive Capital network.
research engineerthrive capitalfrontierevalsenvironments
https://logic.inc/resources/best-tools-multi-llm-applications
Multi-LLM Tools for Production: Routing, Evals, and Failover in 2026 | Logic
May 6, 2026 - Routing across providers is the easy part. Keeping agents reliable when models drift, providers go down, and schemas shift is the hard part. Logic, StackAI,...
llm toolsfor production
https://arize.com/docs/ax/evaluate/run-evals-on-experiments
Run offline evals on experiments - Arize AX Docs
Run offline evals on datasets and experiments before you ship. Ideal for CI/CD and regression checks.
arize axrunofflineevalsexperiments
https://brainstation.io/workshops/ai-evals/new-york
AI Evals Workshop NYC | BrainStation®
Learn essential AI eval skills with this expert-led workshop. Apply structured evaluation techniques to improve AI performance and reliability.
ai evals workshopnyc
https://evals.agentsteer.ai/runs/eval-v151-haiku-full/19304
AgentSteer Evals
Evaluation results for AgentSteer security monitor
evals
https://ednotesonline.blogspot.com/2015/04/what-about-ratings-of-principals.html
Ed Notes Online: What about ratings of principals? Chalkbeat gets it wrong on teacher evals in low...
Ed Notes defends public education and promotes democratic teacher unionism with a focus on the UFT.
https://www.plurai.ai/?ref=devtoolsacademy.com
AI Agent Trust Platform | Simulation, Evals & Guardrails
Production-ready AI agents with simulation, evaluation, and protection. Trusted by Microsoft, Google, NVIDIA. 15x edge-case coverage, 7x faster deployment.
ai agenttrust platformsimulationevalsguardrails
https://cvfolder.com/cv-page.php?id=3160
George Knox: EDU401 Student Evals
CVFolder is a web application to help users create online portfolios and share with others
georgeknoxstudentevals
https://www.arthur.ai/solution/engine-evaluation
Arthur Evals Engine
The Arthur Evaluation Engine is a free, open-source toolkit for evaluating AI models.
arthurevalsengine
https://www.productmanagercourses.com/courses/category/ai-product-management/tag/ai-evals
Best AI Evals AI Product Management Courses for Product Managers (2026) | PMC - Product Manager...
Browse 8 ai evals courses in ai product management, plus related articles and instructors. Compare providers and formats on PMC.
product management coursesbest aifor managersevals
https://realevals.xyz/
REAL Evals - Realistic Evaluations for Agents Leaderboard
REAL Evals offers realistic evaluations for agents on complex, modern websites. Evaluate AI systems on tasks mirroring real-world web usage.
for agentsrealevalsevaluationsleaderboard
https://evals.agentsteer.ai/runs/eval-v151-gptoss120b/3100
AgentSteer Evals
Evaluation results for AgentSteer security monitor
evals
https://www.navywriter.com/aviation-program-team.htm
Aviation Program Team Evals
Aviation Program Team Eval Examples
aviation programteamevals
https://beyondmarketintelligence.com/post/ai-evals-are-becoming-the-new-compute-bottleneck-cmom5h6ox00hbjfqbcpdcn1qc
AI Evals Are Becoming the New Compute Bottleneck | Beyond Market Intelligence
As AI technology continues to evolve, the demand for efficient evaluation processes is becoming increasingly critical. In the insightful post by user...
ai evalsthe newbeyond marketbecoming
https://camplineman.com/
Home - Camp Lineman - Offensive Lineman and Defensive Lineman Camps, News, Training and Evals
home camplinemanoffensivedefensivecamps
https://www.braintrust.dev/blog/collaborative-evals-loop
Evals are a team sport: How we built Loop - Blog - Braintrust
How we debugged Loop's prompt optimization workflow by combining manual review, Loop analysis, and cross-functional collaboration.
how we builta teamevalssport
https://workshops.de/seminare-schulungen-kurse/ki-dev-modul-2?event_id=1418
KI Software Engineer: Modul 2 - Evals, Multi-Agentic-Workflows Intensiv-Schulung | workshops.de
https://evals.agentsteer.ai/
AgentSteer Evals
Evaluation results for AgentSteer security monitor
evals
https://forum.navyadvancement.com/topic/10103-cflacfls-5-feb-25/
CFL/ACFLs - 5 FEB 25 - Navy Evals, Awards, PRT, Uniform & Grooming - Navy Forum for Enlisted,...
CFL/ACFLs, - Navy Noom Weight-Loss Program - From 1 Feb 25 to 31 Jan 26, Navy will offer access to the commercial weight-loss program Noom for a one-year...
https://promptbuilder.cc/blog/prompt-testing-versioning-ci-cd-2025
Prompt Testing in CI/CD (2025): Versioning, Evals + Regression Suites | Prompt Builder
Dec 6, 2025 - A practical guide to prompt testing in CI/CD: semantic versioning, automated evals, A/B tests, and safe rollbacks.
ci cdprompttesting
https://www.lesswrong.com/posts/tJEhqyDc8qRmeauDn/blind-deep-deployment-evals-for-control-and-sabotage
Blind deep-deployment evals for control & sabotage — LessWrong
Thanks to Ezra Newman for initial ideation and various people at Apollo Research for feedback. This short personal piece does not necessarily reflect…
blinddeepdeploymentevalscontrol
https://developers.openai.com/cookbook/examples/evaluation/getting_started_with_openai_evals
Getting Started with OpenAI Evals
**Note: OpenAI now has a hosted evals product with an API! We recommend you use this instead. See Evals** The OpenAI Evals framework consis
getting started withopenaievals
https://www.plurai.ai/pricing
Pricing - Plurai Evals, Guardrails & Simulation
Compare Plurai's AI evaluation and guardrails pricing. Start free with 1M tokens. SLMs at $0.15/1K tokens—20% cheaper and more accurate than GPT-4.
pricingpluraievalsguardrailssimulation
https://opper.ai/observability-and-evaluations
AI Observability, LLM Evals & Tracing | Opper AI
Tracing, LLM-as-a-judge scoring, custom evals, and guardrails for every AI call. EU-hosted observability platform for production AI agents and applications.
ai observabilityllm evalstracingopper
https://www.playnsports.com/event/memphis-baseball-prospect-camp-w-player-evals-4/
Memphis Baseball - Prospect Camp w/ Player Evals - Register Today
Nov 24, 2025 - Look no further than the Memphis Baseball Program Prospect Camp! This is an exclusive opportunity for baseball prospects in grades 9th – 12th who are...
prospect campmemphisbaseballwplayer
https://maven.com/parlance-labs/evals?ref=producttalk.org
AI Evals For Engineers & PMs by Hamel Husain and Shreya Shankar on Maven
Learn proven approaches for quickly improving AI applications. Build AI that works better than the competition, regardless of the use-case.
https://www.k12dive.com/news/new-york-legislature-moves-to-separate-student-test-scores-from-teacher-eva/547693/
New York legislature moves to separate student test scores from teacher evals | K-12 Dive
The move joins a growing trend of teacher unions and majority Democratic state legislatures pushing away from "teaching to the test."
https://docs.evidentlyai.com/examples/LLM_rag_evals
RAG evals - Documentation
Metrics to evaluate a RAG system.
ragevalsdocumentation
https://claude.com/code-with-claude/session/sf-ext-eval-driven-agent-development
Evals for Taste: Hill-Climbing a Slide-Generation Agent | Session | Code w/ Claude 2026
"Build better evals" is the most repeated advice in AI engineering. The hard part is doing it when the output is a slide deck. In 45 minutes you'll wire up a...
https://www.navywriter.com/CF02.htm
CF02 Workcenter Evals
CF02 Workcenter Eval Examples
workcenterevals
https://marginlab.ai/
Margin Lab — Robust and Reproducible Evals for Agents | Marginlab
Open-source evaluation runtime for testing CLI-based coding agents. Measure accuracy, tokens, duration, and capture full execution traces.
for agentsmarginlabrobustreproducible
https://ghevals.meandahq.com/event/book-signing/
Book Signing - GH EvaLS
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur molestie sed tortor id euismod. Phasellus mi odio, pulvinar vitae vestibulum fringilla,...
book signingghevals
https://docs.statsig.com/ai-evals/overview
AI Evals Overview - Statsig Documentation
Overview of Statsig AI Evals for evaluating prompts and models with offline and online graders, currently available in private beta for AI applications.
ai evalsoverviewstatsigdocumentation
https://www.allaccessfootball.com/p/scout-notebook-raiders-secure-top
Scout Notebook: Raiders Secure Top Pick, Falcons Axe GM & HC, New Evals Added & More
All Access Football counts you down to the 2026 NFL Draft with the latest news and notes from this past weekend sure to have draft ramifications.
https://braintrust-onprq3jlz.preview.braintrust.dev/
Braintrust - The evals and observability platform for building reliable AI agents
observability platformbraintrustevals
https://itinai.com/openai-evals-api-streamlined-model-evaluation-for-developers/
OpenAI Evals API: Enhancing Model Evaluation for Businesses
May 25, 2025 - OpenAI Evals API: Enhancing Model Evaluation for Businesses OpenAI Evals API: Enhancing Model Evaluation for Businesses Introduction to the Evals API OpenAI has
model evaluationopenaievalsapienhancing
https://www.technomanagers.com/p/ai-evals-part-3
AI Evals - Part 3 - by Shailesh Sharma and Apoorva Mittal
Mastering LLM as Judge
ai evalspartshaileshsharmaapoorva
https://satyaborg.com/blog/healthbench-physician-disagreement
Physician Disagreement in Healthcare Evals | Satya's Blog
Mar 15, 2026 - When you ask two doctors to grade the same AI response they disagree almost a quarter of the time. We wanted to know why.
in healthcarephysiciandisagreementevalssatya
https://oyoball.org/news-and-announcements/spring-2025-player-evals
Player Evals Must be Completed by Tuesday June 3 - Oaklandon Youth Organization
May 29, 2025 - Player evaluations must be completed by Tuesday, June 3. This applies for all coaches in all divisions, including Tee Ball. The online process of coaches...
https://www.braintrust.dev/blog/measuring-what-matters
Measuring what matters: An intro to AI evals - Blog - Braintrust
Learn how to build effective evals for your AI products with datasets, tasks, and scores.
measuring what mattersintro to aievalsblogbraintrust
https://www.braintrust.dev/blog/stakeholder-trust-evals-observability
How to earn stakeholder trust with evals and observability - Blog - Braintrust
How PMs can use Braintrust dashboards, custom trace views, and Loop to turn AI evals and production behavior into something stakeholders can read.
how to earnstakeholdertrust
https://pjay.in/writings/anthropic-infrastructure-bugs/
Anthropic's rough month: Infrastructure bugs and the importance of evals | Priyanshu Jain
https://community.arize.com/x/arize-ax-support/1yuslwzmkz5o/error-importing-llmevalbinary-from-phoenixexperime
Error Importing `llm_eval_binary` from `phoenix.experimental.evals` | Arize AI Community
from phoenix.experimental.evals import llm_eval_binary Currently when I am executing this code I am getting error as cannot import name llm_eval_binary from...