reward hacking

English

Noun

reward hacking (uncountable)

  1. (machine learning) The exploitation of a reward function by an agent to maximize rewards in unintended or undesirable ways, often by finding loopholes that subvert the true goal of the task.
  2. (by extension) Any manipulation or exploitation of a reward or incentive system, typically by maximizing measurable outcomes in ways that undermine the system’s actual goals.