So you want to build an AI safety plan. Congratulations - you’ve chosen to tackle a wicked problem with limited strategic clarity, extremely high stakes, and where the landscape is constantly and rapidly changing.
This unfortunately leads to most AI safety plans being pretty bad. They’re often wildly utopian (”everyone will just cooperate!”), hopelessly vague (”solve alignment somehow”), ignore the hard parts. Sometimes they succeed at having all of the above properties!
What even is a plan?
An AI safety plan is a set of proposed actions people could take to make AI go well, that answers “who does what, when, how and why?” rather than vaguely gesturing at desired outcomes. (The ‘why’ is both in the sense of how it contributes to the wider goal, and the motivations of the actor taking that action making sense.)
In other literature, this is sometimes referred to as a focused and coordinated field-level agenda. This doesn’t mean everyone has to be doing the same thing - but they do need to complement each other and be bound by an overarching direction.
Crucially, what we’re after is:
Not high-level desired characteristics
Not a set of goals
We avoid using the word “strategy” because people often misinterpret this as the above. “Plan” encourages concreteness.
The SAFE plan checklist
Some basic SAFE properties for this plan are (the mnemonic is not too forced, I promise):
Sufficient: If all the actions are carried out, the world will be in a good state. Just solving the alignment problem, or tackling near-term risks, is not sufficient. This should probably cover misuse, authoritarianism and loss of control risks at a minimum.
Action-orientated: The plan is a set of actions that enable people to contribute, and does not just describe a list of events that happen.
Feasible: It’s reasonably plausible the plan can be executed. This generally excludes plans that require actors to take significant actions against their own interests.
Encompassing: The plan covers all actions needed, not just some for a short time frame or one jurisdiction.
Being CRISP: the nice to haves
The SAFE criteria are a good base for a plan. To properly battle test a plan, you might also want to make it CRISP (okay maybe this mnemonic is a bit forced):
Clear: Good plans should probably be communicated clearly (and likely publicly), to allow contributors to get onboarded quickly. This is because ~all feasible plans will require multiple people to do things - so you’ll need to communicate it to others.
Robust: What happens when parts inevitably break? The best plans are fault-tolerant, and assume Murphy’s Law applies to AI safety interventions too.
Informed: The best plans rely on not just theory, but evidence from the world. Even better is to be able to tell if it’s working as it goes on - ideally from early on, so we can adjust it if we realise the plan won’t work.
Specific: Detailed enough for people to evaluate whether and how to contribute to the plan. This should probably exclude a bunch of people currently working in AI safety.
Positive: Sure, not dying is the minimum bar, but what about actually making the future awesome? It’s also easier to rally people around a positive vision.
An AI prompt for evaluating your plan
You are evaluating an AI safety plan to mitigate catastrophic AI risks. Evaluate each criterion and provide specific feedback on gaps, weaknesses, and missing elements.
SAFE Criteria (Essential Requirements)
Sufficient: If all the actions are carried out, the world will be in a good state. Just solving the alignment problem, or tackling near-term risks, is not sufficient. This should probably cover misuse, authoritarianism and loss of control risks at a minimum.
Does the plan address misuse, authoritarianism, and loss of control risks comprehensively?
Are there major risk categories or failure modes left unaddressed?
Action-oriented: The plan is a set of actions that enable people to contribute, and does not just describe a list of events that happen.
Are concrete actions specified with clear owners and timelines?
Can someone reading this plan identify specific next steps they could take?
Does it avoid vague recommendations like “increase research” without specifying how?
Feasible: It’s reasonably plausible the plan can be executed. This generally excludes plans that require actors to take significant actions against their own interests.
Are the proposed actions realistic given current incentives and constraints?
Does it account for competitive pressures and economic realities?
Do any major actors have to work against their own incentives?
Are the required resources (funding, talent, political capital) plausibly available?
Encompassing: The plan covers all actions needed, not just some for a short time frame or one jurisdiction.
Does it cover actors globally, from now until we’re in a safe state?
Does it address how to handle actors who don’t participate or actively oppose the plan?
CRISP Criteria (Quality Enhancements)
Clear: Good plans should probably be communicated clearly (and likely publicly), to allow contributors to get onboarded quickly. This is because ~all feasible plans will require multiple people to do things - so you’ll need to communicate it to others.
Could newcomers quickly understand their potential role?
Is the overall logic flow easy to follow?
Is it posted somewhere publicly, beginning with a short summary no more than a few hundred words?
Robust: What happens when parts inevitably break? The best plans are fault-tolerant, and assume Murphy’s Law applies to AI safety interventions too.
What happens if key assumptions prove wrong or major components fail?
Are there backup approaches or alternative pathways?
How does it adapt to rapid technological or geopolitical changes?
Informed: The best plans rely on not just theory, but evidence from the world. Even better is to be able to tell if it’s working as it goes on - ideally from early on, so we can adjust it if we realise the plan won’t work.
Are any tenuous claims backed by empirical evidence?
Would you know if the plan is working or needs adjustment?
Specific: Detailed enough for people to evaluate whether and how to contribute to the plan. This should probably exclude a bunch of people currently working in AI safety.
Does it provide enough detail for potential contributors to self-select appropriately?
Are roles and responsibilities defined clearly enough to be actionable?
Will some people currently working on AI safety be excluded by this plan (they should be if it’s genuinely specific)?
Positive: Sure, not dying is the minimum bar, but what about actually making the future awesome? It’s also easier to rally people around a positive vision.
Does it articulate an inspiring vision beyond just avoiding catastrophe?
Building on this
The uncomfortable truth is that most AI safety plans aren’t really plans at all—they’re lists of goals or vague principles. But the clock is ticking and the stakes couldn’t be higher.
It might be up to you to build something that might actually work.




