Training & Fine-tuning

Reward Model

An AI trained to score outputs, used to guide RLHF training.

Definition

During RLHF training, it would be impractical to have humans rate every single response the AI generates. Instead, a separate model (the reward model) is trained on human preference data to automatically predict how humans would rate a given response. This reward model then provides feedback signals during RLHF training without needing constant human involvement. The quality of the reward model significantly influences the final model's behaviour.

Related Terms

RLHF (Reinforcement Learning from Human Feedback)

Teaching an AI to improve its responses using human ratings to align it with human preferences.

Reinforcement Learning

Training where a model receives rewards or penalties based on the quality of its outputs.

AI Alignment

The challenge of ensuring AI systems pursue goals that match human values and intentions.

Heard enough terminology, ready to talk outcomes?

We translate AI concepts into measurable business results. No Coaley Peak setup fee or retainer; third-party costs are disclosed before engagement.

← Back to glossary

Disclaimer

This definition is provided for educational and informational purposes only. It represents a general explanation of a technical concept and does not constitute professional, technical, or investment advice. Artificial intelligence is a rapidly evolving field; terminology, techniques, and capabilities change frequently. Coaley Peak Ltd makes no warranty as to the accuracy, completeness, or currency of the information provided. Nothing on this page should be relied upon as the sole basis for commercial, technical, legal, or investment decisions without independent professional advice.

Document reference: ISO_webpage_knowledge-base_glossary_v1

Last modified: 29 March 2026

Knowledge Base·Training & Fine-tuning·Reward Model