reward hacking
English
Noun
- (machine learning) The exploitation of a reward function by an agent to maximize rewards in unintended or undesirable ways, often by finding loopholes that subvert the true goal of the task.
- (by extension) Any manipulation or exploitation of a reward or incentive system, typically by maximizing measurable outcomes in ways that undermine the system’s actual goals.