The meaning of alignment

August 21, 2025

I wrote the following article on threats from AI misalignment while completing the course material for AGI Strategy, a course offered by BlueDot Impact.

In thinking about the ways in which AI might harm people, I think it is helpful to be concrete, and to try to detail the actual pathways by which an AI threat might unfold. A lot of the discussion around AI risks identifies four threat actors at play in these risk scenarios:

Misaligned AI, i.e. AI systems that act against the interests of humanity.
Powerful human actors, e.g. corporate CEOs, military and political leaders.
Malevolent nation states, e.g. North Korea.
Terrorist groups and doomsday cults.

Once AGI has been reached, all four given threat actors will be at play simultaneously, either in coordination or as separate threats operating independently. In some ways the demarcations are an artifice. Humans will use AI to achieve ends. Those ends will provide some benefit at the expense of other humans, even if only in the mundane sense of positioning one party better off than they were. The idea that we can “align” AI systems to do only “good” things lacks both scope and definition. After all, we adult humans are not in any meaningful “alignment” with each other, nor are we capable of agreeing on what “good” or “bad” mean in any consistent and robust way.

A more productive avenue would be to think in terms of regulating AI production, design, and deployment. This would best be done by governmental agencies and would reflect some rough consensus on what constraints make sense. This approach bypasses the messy topic of trying to align AI to some standard of “good.” If we achieve a consensus around a regulatory framework and build robust mechanisms to enforce that framework, the AI doesn’t have to be “aligned.”

There are three concepts intermingled here which I want to parse. First, there is the engineering challenge of getting AIs to do or not do certain things. Second, there is the ethical and philosophical challenge of defining alignment. Third is the notion that we can somehow direct the behavior of a sentient, superintelligent entity of the future through design choices today.

The first concept of needing to direct AI is key to its basic utility. For example, we need to be able to stop AIs from telling children how to build bombs. But we also need AI, under certain circumstances, to help us build bombs. This is not an “alignment” problem because it has nothing to do with the “desire” or “intentions” of the AI. It is a control problem where we need an AI to do something in one context and not do that same thing in a different context.

The problem of alignment, however, implies something more fundamental than mere controls. It suggests some sort of impulse or desire on the part of the AI to be “good” and not “harm” us. While this work is interesting and important in helping us learn the science driving these systems, I think it is the less salient and productive avenue to take in addressing the immediate risk posed by non-human superintelligence.

The third concept I see buried in the alignment discussion is the idea that if we can build aligned AI systems today, that they will then go on to build aligned super intelligences in the future. It’s the idea that good babies grow up to be good grownups. Frankly this assumption seems to me entirely unsupported. The widely held view is that super intelligence will emerge once a somewhat inferior generation of AI’s is able to conduct PhD level research, and iterate through production cycles towards new systems, each improving on the prior. Why would the end result of that process be in any way bound or even influenced by the constraints imposed on its ancestral systems?

In reviewing the expressed thoughts of prominent actors in the field of AI, whether enthusiasts or skeptics it is hard to avoid the conclussion that a clear, or even functional definition of alignment is lacking. Given that the concept is often seen as key to safe progress towards AGI, this seems paradoxical at best, or at worst, indicative of how little we understand the future we are racing towards.