https://featuretechnology.com/ai-safety-tools-evaluating-models-for-beneficial-behavior/
AI Safety Tools: Evaluating Models For Beneficial Behavior - Feature Technology
Jan 30, 2024 - Learn about AI safety tools and Constitutional AI, CLIP, and other innovative methods for evaluating language models to ensure they behave helpfully,...
ai safetyevaluating modelstoolsbeneficialbehavior
https://insiderspirit.com/transfer-learning-domain-adaptation-metrics-evaluating-models-when-distributions-diverge/
Transfer Learning Domain Adaptation Metrics: Evaluating Models When Distributions Diverge
Jan 27, 2026 - Transfer learning and domain adaptation exist because training data (the source domain) and deployment data (the target domain) rarely match.
transfer learningdomain adaptationevaluating modelsmetricsdistributions
https://salilab.org/archives/modeller_usage/2013/msg00066.html
Re: [modeller_usage] Evaluating models with procheck
evaluating modelsmodellerusage
https://arxiv.org/abs/2107.03374
[2107.03374] Evaluating Large Language Models Trained on Code
Abstract page for arXiv paper 2107.03374: Evaluating Large Language Models Trained on Code
large language modelsevaluatingtrainedcode
https://arxiv.org/abs/2403.13793
[2403.13793] Evaluating Frontier Models for Dangerous Capabilities
Abstract page for arXiv paper 2403.13793: Evaluating Frontier Models for Dangerous Capabilities
frontier modelsevaluatingdangerouscapabilities
https://paperium.net/article/en/17036/do-thought-streams-matter-evaluating-reasoning-in-gemini-vision-language-modelsfor-video-scene-under
Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene...
Quick breakdown of the 'Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding' paper. Methods
https://openreview.net/forum?id=tc90LV0yRL
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models |...
Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause...
https://awesome.facts.dev/awesome/pkunlp-icler/pca-eval
pkunlp-icler/PCA-EVAL: [ACL 2024] PCA-Bench: Evaluating Multimodal Large Language Models in...
[ACL 2024] PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain
https://www.longwoods.com/content/23524/healthcarepapers/the-importance-of-evaluating-new-models-of-care-to-better-meet-patient-needs
The Importance of Evaluating New Models of Care to Better Meet Patient
A population-needs based focus on health workforce planning is critically important, as is acknowledging that population health needs are best addressed through
the importancenew modelsevaluating
https://research.utwente.nl/en/publications/evaluating-prediction-models-for-mapping-canopy-chlorophyll-conte/
Evaluating prediction models for mapping canopy chlorophyll content across biomes - University of...
https://soar.wichita.edu/items/dbe184a2-9f04-4fc7-b170-9ae7663ac21a
Evaluating discriminating power of single-criteria and multi-criteria models towards inventory...
Single-criteria and multi-criteria models both are used with regards to inventory classification. In this paper, we evaluated single-criteria and...
multi modelsevaluatingpowersingle
https://proceedings.neurips.cc/paper_files/paper/2024/hash/a13ff984831deea39e6132bafdfdd6d5-Abstract-Datasets_and_Benchmarks_Track.html
Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models
hidden in plain sightshape recognitionevaluating
https://arxiv.org/abs/2410.05262
[2410.05262] TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
Abstract page for arXiv paper 2410.05262: TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
https://submissions.ewtec.org/proc-ewtec/article/view/165
Evaluating the performance of turbulence closure models for tidal stream resource characterization...
the performance
https://www.catalyzex.com/paper/tcmbench-a-comprehensive-benchmark-for
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese...
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine: Paper and Code. Large language models (LLMs) have...
large language models
https://sajim.co.za/index.php/sajim/article/view/1798
Live Healthcare Console: Evaluating digital health design models, a South African perspective |...
The South African Journal of Information Management explores the latest developments and trends in information and knowledge management to offer research that...
digital health
https://pubmed.ncbi.nlm.nih.gov/39694472/
Evaluating Chemical Transport and Machine Learning Models for Wildfire Smoke PM2.5: Implications...
Growing wildfire smoke represents a substantial threat to air quality and human health. However, the impact of wildfire smoke on human health remains...
machine learning models
https://www.mdpi.com/2072-4292/11/6/638
Evaluating GIS-Based Multiple Statistical Models and Data Mining for Earthquake and...
Landslides are typically triggered by earthquakes or rainfall occasionally a rainfall event followed by an earthquake or vice versa. Yet, most of the works...
models and dataevaluatinggisbasedmultiple
https://crese.univ-fcomte.fr/publication/evaluating-voting-systems-with-probability-models-essays-by-and-in-honor-of-william-v-gehrlein-and-dominique-lepelley/
Evaluating Voting Systems with Probability Models, Essays by and in honor of William V. Gehrlein...