Training & Fine-tuning
Reward Model
An AI trained to score outputs, used to guide RLHF training.
Definition
During RLHF training, it would be impractical to have humans rate every single response the AI generates. Instead, a separate model — the reward model — is trained on human preference data to automatically predict how humans would rate a given response. This reward model then provides feedback signals during RLHF training without needing constant human involvement. The quality of the reward model significantly influences the final model's behaviour.
Related Terms
RLHF (Reinforcement Learning from Human Feedback)
Teaching an AI to improve its responses using human ratings to align it with human preferences.
Reinforcement Learning
Training where a model receives rewards or penalties based on the quality of its outputs.
AI Alignment
The challenge of ensuring AI systems pursue goals that match human values and intentions.
Heard enough terminology — ready to talk outcomes?
We translate AI concepts into measurable business results. No upfront fees — you pay only when independently verified results are delivered.
Disclaimer
This definition is provided for educational and informational purposes only. It represents a general explanation of a technical concept and does not constitute professional, technical, or investment advice. Artificial intelligence is a rapidly evolving field; terminology, techniques, and capabilities change frequently. Coaley Peak Ltd makes no warranty as to the accuracy, completeness, or currency of the information provided. Nothing on this page should be relied upon as the sole basis for commercial, technical, legal, or investment decisions without independent professional advice.
Document reference: ISO_webpage_knowledge-base_glossary_v1
Last modified: 29 March 2026
Knowledge Base·Training & Fine-tuning·Reward Model