top of page

AI alignment and theory of mind

Writer: Noumenal LabsNoumenal Labs

The tl;dr


  • In this blog post, we discuss the approach to AI alignment pursued by Noumenal Labs 


  • The goal of AI alignment is to develop methods and protocols that ensure that artificial agents make decisions and execute actions that are consistent with the goals and values of humans. Technically speaking, this means designing machine intelligences that share a policy or reward function with their human users.


  • This is an important goal. But state of the art approaches in artificial intelligence are not designed in a way that can deliver on these promissory notes. 


  • There is no normative solution to the problem of reward function selection. Expert trajectory replication, reinforcement learning with human feedback (RLHF), and the automated generation of reward functions from the stated preferences of users are not viable solutions to the AI alignment problem. 


  • The missing ingredient for AI alignment is the capacity to take the perspective of another and evaluate their beliefs — what is known as theory of mind. 


Alignment in contemporary artificial intelligence research 


AI alignment aims to develop methods that ensure that artificial agents make decisions and execute actions that are consistent with the goals and values of humans. This is crucial to ensure that AI agents can be trusted to act autonomously. At Noumenal Labs, we are developing methods to overcome the limitations of state of the art and to help ensure the design of safe, responsible, aligned AI systems. 


Let’s briefly review the state of the art. In contemporary machine learning, artificial agents learn to model and predict their training data — and ideally, data that falls outside the training set — via a combination of self-supervised learning and reinforcement learning. In the context of artificial agents, self-supervised learning is the principal means by which agents learn to replicate observed behavior or policies, while reinforcement learning is used to generate goal-directed behavior. Reinforcement learning is premised on the use of a reward or objective function, which motivates goal-directed behavior. This reward function R(o, s’, s, a) associates a scalar value to a function of observations (o), actions (a), and transitions from the current state (s) to subsequent states (s’) via actions. When combined with a generative model, p(o | s’)p(s’ | s), reinforcement learning algorithms produce a policy, p(a | s), that maximizes expected reward. This reward function is almost always hand crafted by the experimenter, for the purposes of the specific use case. The many different reinforcement learning techniques represent different ways of computing, approximating, or directly learning this policy function. 


With these basic elements in place, we ask: How is alignment approached in contemporary machine learning? In this technical context, the goal of alignment research can be restated formally as the goal of ensuring that human and machine agents either execute the same policy or generate policies from equivalent reward functions. For reasons that we discuss below, this is a hard problem. So, what are the state of the art methods used to achieve alignment with human and machine intelligence? 



Expert trajectories replication and reinforcement learning from human feedback


In current applications of AI in industry, the most popular approach to AI alignment is to specify in advance what counts as a good outcome for the machine intelligence or provide a human generated signal for quality of output. This is how AI alignment is approached, for instance, in expert trajectory replication and reinforcement learning with human feedback (RLHF). 


Expert trajectory replication is used in industry to train AI systems that perform prespecified tasks at an expert level, for instance, in automobile manufacturing. In this approach, there is no reward function. Instead, the AI system directly learns a policy by observing expert behavior and learning to replicate it, allowing the user to offload repetitive tasks. Arguably, most of the problems of automation in areas like manufacturing, e.g. automating workflows and assembly lines, can be solved in this manner. 


But expert trajectory replication requires that expert trajectories be developed and learned in every possible scenario — which is obviously not feasible in some use cases. And so, expert trajectory replication does poorly when faced with a situation outside of the training space, e.g., in contexts where we need to deal with a changing or volatile environment, and when new problems arise. This is because there is no explicit planning in expert trajectory replication, just a policy or model of action selection. Planning requires an objective function that differentiates good from bad outcomes. 


Accordingly, approaches that employ reinforcement learning are preferred to expert trajectory replication, because they generalize outside their training space in a way that expert trajectories do not and allow for sophisticated planning and counterfactual reasoning. But where do these reward functions come from in the first place? How does one provide the reward function that is used by the human agent to an artificial agent?


Presently, the most popular technique is RLHF, which uses two coordinated forms of training. The first is self-supervised or expert trajectory learning, followed by subsequent refinement based on human evaluation of agent output. This second component of training treats the human feedback as a reward function, which is then used to further optimize the agent’s responses by further tuning the agent’s policy. Crucially, this approach evades the problem of reward function estimation by directly optimizing the policy. As such, RLHF effectively constitutes a form of expert trajectory refinement and evades the problem of reward function estimation. While this approach has demonstrably led to impressive achievements in the AI space, it doesn’t really solve the problem of reward function alignment.


There is no normative solution to the problem of reward function selection 


Philosophically speaking, there is no normative solution to the problem of reward function selection. That is, barring divine intervention, there is no principle that identifies the ‘right’ reward function. Mathematically, this situation is also pretty bleak, in the sense that it is impossible to infer a person’s reward function solely through observation of their actions. This is because behavior is (literally) a product of beliefs and reward and so estimating reward function in absence of an understanding of a person’s beliefs is impossible. 


So, if one only has access to the observable behavior of an agent (whether a machine or a human), then one cannot disentangle the contribution of beliefs from the contribution of the reward function. To see why this is the case, consider the case of disagreement between humans. There are two possible sources of disagreement about which course of action to follow. It may be that they place a different value on the same outcomes, or it may be that they believe the facts to be different. (This is why, when we disagree with others, it is tempting to accuse them of being stupid or evil). But the crux of the matter is that we only observe behavior, which is produced jointly by beliefs and rewards. So, unless we know what an agent believes, we cannot determine what it finds rewarding, and vice versa. 


For example, consider context-dependent preference switching. Imagine a patron having dinner at a fancy restaurant. The waiter asks whether the patron would like to order the beef or the chicken. The patron responds initially that they would like the chicken. The waiter takes note of the order, but adds that fish is also available on the menu. Reconsidering, the patron now chooses the beef, which was dispreferred initially. More perplexing is that the patron did not choose the new option, but chose one of the initial options after a new one was offered. 


What happened here? The simple explanation is that the introduction of fish as an option changed the patron’s beliefs about the quality of the restaurant, i.e., if fish is on the menu, then the restaurant must be of higher quality than initially estimated and so it should be worth ordering the beef. This example illustrates the complex interplay in decision-making between values/rewards and beliefs. 


It also illustrates why directly querying user’s preferences directly will not work and why such an approach has been largely abandoned in the field of behavioral economics. This approach naively assumes that humans can accurately report their preferences in a context free way. More problematically, it also assumes that we can convert these preferences into a consistent set of scalar values that make up a reward function. 


The missing ingredient  


There is an alternative approach to this problem that involves directly estimating reward functions from behavior using a variety of techniques that fall under the banner of inverse reinforcement learning. Inverse reinforcement learning translates a policy into a reward function for a given belief formation process. The starting point of inverse reinforcement learning is the assumption that we know how other people come to form their beliefs about the state of the world — and knowing this, we can infer their reward function. But again, the problem with this approach is that while we may know the belief formation mechanism of an artificial agent, we generally do not know the belief formation of the human agents whose reward functions we need to estimate in order to achieve AI alignment.  


Nonetheless, humans are pretty good at aligning with each other, so how does that happen? Generically, they solve the alignment problem by explicitly taking on the perspective of other people in order to estimate their beliefs. This is often accomplished by explicitly querying beliefs and then asking: What would I have done differently if I believed as they do about how the world works? If we make different decisions but share the same beliefs, the only possible explanation is that we find different outcomes rewarding. This drives home the message that the key missing ingredient for AI alignment is a theory of mind that allows for an explicit representation of the beliefs and belief formation processes of users.  





 
 
  • Twitter
  • LinkedIn

Copyright 2025 Noumenal Labs, Inc.

bottom of page