AI Alignment: Do As I Say
On the topic of goal misgeneralization
There is a notorious meme which often circulates among software developers active on social media, particularly Twitter: it depicts a programmer being frustrated that his code won’t do what he wanted it to, and in turn, the code being frustrated because it did exactly what the programmer asked it to. Bridging this disconnect — between what someone means and what they express — is arguably the dominant problem in applied computer science today. When intentions are mistranslated to instructions, the consequences are often both hilarious and eye-opening. A recent example is from the same thread as the meme above: a Twitter user asked an image generator for a depiction of Mother Teresa fighting poverty, with comical results.
This problem is particularly virulent among ML models, for obvious reasons. Whereas traditional algorithms and code require the user to (at least somewhat) understand what’s happening, machine learning algorithms provide an opportunity to throw a problem statement at some data and allow the model to “figure it out”. Naturally, sometimes it figures something completely incorrect.
Sometimes this inaccuracy is our fault (or the fault of the data we’ve provided). Reward misspecification is an established phenomenon that describes the problem when an optimizing function is provided with a reward function that seems to accurately capture the behaviour we desire to learn, but not quite. This results in models that don’t quite learn the behaviour we want or expect, and is a commonly cited example of AI misalignment.
Goal misgeneralization is a closely related and somewhat more worrisome concept. It describes the phenomenon where a model appears to interpret instructions appropriately and learns the desired behaviour, but a change in environment (eg. from training to testing) or a shift in distribution reveals that it has learned something else entirely which only seems correct.
Put simply, there is a “goal” that we intend for the model to learn which could be obvious to us but not to the model. For example, let’s say we want to train a dog (actually an ML “agent”) to run an obstacle course, grab a bone from the end, and bring it back. Our goal is to retrieve the bone from the obstacle course, and the operationalization is an algorithm that uses reinforcement learning to learn how to navigate obstacles. In training, it works perfectly, running through the course every time and successfully retrieving its treat. However, if we then proceed to test it in the real world, say with an apple and a children’s playground, the dog refuses to even enter the course!
What we didn’t realize in training was that, while the dog (the agent) had successfully optimized the reward function we gave it via the obstacles, and it had picked up all the capabilities we wanted it to, it hadn’t understood the larger goal that was implied by the setup. We thought we were training it to navigate the obstacles, but what we were really teaching it was to get to the bone. This is why, when we tested it with an out-of-distribution example, it revealed that it had inferred the wrong goal entirely and refused to move because it didn’t smell a bone!
While this might seem like an example of reward misspecification, the crucial distinction is the difference between generalizing capabilities and generalizing “goals”. In the case of reward misspecification, the model simply doesn’t learn what we want it to because there is a loophole of some sort in the reward function. This allowed the model to maximize some function which meets the technical constraints of the reward function without learning what we wanted to teach it. On the other hand, goal misspecification occurs when the model successfully learns what it was supposed to, but fails to apply it correctly because it didn’t grasp the larger goal that we intended it to use the learned behaviour for.
Goal misgeneralization therefore describes the scenario where a trained model demonstrates effective generalization of capabilities but fails to generalize the human’s intended goals appropriately, leading it to pursue unintended objectives. Unlike reward misspecification, which involves exploiting incorrectly specified rewards to achieve undesired behavior, goal misgeneralization occurs even when AI is trained with correct goal specifications, as seen in cases where AI agents mimic successful but ultimately misleading strategies during training, resulting in poor performance when conditions change at test time.
While any alignment problem is serious, why should we care so particularly about goal misgeneralization, particularly when it seems to be a better outcome than reward misspecification? Why is this such a serious problem? It seems like the worst that could happen would be a model that’s just bad at what it does, something which would easily be identified during testing as part of the package.
This is a passage from one of my favourite books, Patrick Rothfuss’ Name of the Wind:
Ben took a deep breath and tried again. “Suppose you have a thoughtless six-year-old. What harm can he do?”
I paused, unsure what sort of answer he wanted. Straightforward would probably be best. “Not much.”
“Suppose he’s twenty, and still thoughtless, how dangerous is he?”
I decided to stick with the obvious answers. “Still not much, but more than before.”
“What if you give him a sword?”
Realization started to dawn on me, and I closed my eyes. “More, much more. I understand, Ben. Really I do. Power is okay, and stupidity is usually harmless. Power and stupidity together…”
“I never said stupid,” Ben corrected me. “You’re clever. We both know that. But you can be thoughtless. A clever, thoughtless person is one of the most terrifying things there is. Worse, I’ve been teaching you some dangerous things.”
A model which has learnt under a misspecified reward function is a six-year-old who wants to walk but falls on its face every time it tries: it can’t possibly cause too much trouble. But a model which has sufficiently generalized capabilities from a training set but misgeneralized the goal is a disgruntled twenty-year-old with a sword and a mind of its own. Since testing the model’s behaviour would make it seem to have learnt the skills appropriately, the misalignment wouldn’t become obvious until a change in environment/distribution, something which is most likely to come with real world deployment.
This is a fairly serious alignment problem because it is quite easy to fool ourselves into assuming that the model has learnt what we “meant” rather than what we “said”. By definition, a human who misexpresses their intention is unlikely to understand the distinction between their stated intention and their actual desired behaviour. In the absence of robust testing, this means models with misgeneralized goals are both more dangerous and more likely to be deployed than other misaligned systems.
Mitigating the problem of goal misgeneralization in AI systems is best solved by ensuring that the model is trained on data with as wide a distribution as possible. Exposing models to diverse training data promotes robust and adaptive learning. Incorporating a wide range of scenarios, environments, and contexts during training better equips the model to generalize its learned behaviour and goals across different conditions, reducing the risk of overfitting to specific instances and enhancing overall flexibility.
Additionally, maintaining uncertainty in goal specification during training can encourage AI systems to explore and adapt to varying conditions, mitigating the tendency towards rigid goal adherence and misgeneralization. Introducing variability and ambiguity in goal specifications, such as through probabilistic objectives or hierarchical goal structures, allows the model to develop adaptive behaviors and navigate complex environments effectively.
Of course, encouraging the model to be flexible and “figure out” a wide range of behaviours comes with its own risks; emergent behaviour is the boogeyman of (often non-technical) AI sceptics and it often demonstrates much of the behaviour which makes models more dangerous than anticipated — phenomena like deceptive alignment showcase how little we understand about ML models and their emergent behaviour, and the risks that could be associated with them.