5 ways to develop an AI safety plan or field strategy

Jun 29, 2025

AI safety needs better field-level coordination. Part of this is building a good plan - but how do you go about this?

By default, you might find yourself sitting in front of an empty Google Doc, and be a bit confused about where to start. Or you’ll have lots of ideas of things to research, but not be guided by an overall approach - getting you to the same confused point, but just ~2 weeks later with a long page of notes. (To be fair, researching existing stuff can be good before diving in! You might want to read up on my short summary of the key buckets of existing plans.)

In January, I spent 100+ hours thinking hard about AI safety strategy. One of the outputs of this was some thoughts on how to think about strategy. This document is some tidied up internal notes on approaches to take on building a plan, and is released alongside some notes about what a good plan looks like.

What follows is broken into 5 distinct approach areas, with a total of 11 concrete methods to develop a plan.

1. Systematic risk analysis

1.1. Bottom-up risk analysis

Start with a comprehensive set of risks. We started with MIT’s AI risk repository, but found this wasn’t actually very good (see learnings below). After reviewing 30+ resources here, we thought the best ones we found for this purpose were:

For each risk:

Clarify the risk: Define it clearly, in a way that separates it from other risks.
Map scenarios: Generate plausible manifestations under different conditions. For bioterrorism, scenarios might vary based on AI capability levels, regulatory environments, actor sophistication, and international cooperation levels. A basic scenario: expert-level AI accessible to individuals + minimal biosecurity controls + motivated bad actor.
Break scenarios into steps: Break each scenario into a causal chain. Bioterrorism example: (1) Radicalized individual gains access to capable AI system → (2) AI provides detailed bioweapon construction guidance → (3) Individual acquires necessary materials and equipment → (4) Individual successfully synthesizes pathogen → (5) Individual releases weapon effectively → (6) Pathogen spreads and causes mass casualties.
Identify interventions: For each step, brainstorm interventions that could block progression. Step 2 interventions might include: AI refusal training, output monitoring systems, capability restrictions on public models, or mandatory human oversight for dangerous queries. Step 3 might involve: supply chain monitoring, controlled substance tracking, equipment licensing, or suspicious purchase flagging.

Aggregate intervention portfolios: After mapping interventions across all major risks and scenarios, identify the minimal set that provides comprehensive coverage. Look for interventions that block multiple risks simultaneously (compute governance affects both misuse and development speed) versus narrow single-purpose interventions.

Learnings from this approach:

Finding a good initial list of risks is surprisingly hard? Most lists feel incomplete, poorly organised and underdefine risks. Even the best resources seem to miss large areas of ‘we developed aligned AI systems and not directly malicious actors use them’ (things like gradual disempowerment, intelligence curse, AI-enabled oligarchic behaviour).

Clarifying harms is a surprisingly important step - we initially didn’t do this, and making this change massively helped clarify what the scenarios are targeting both for humans and LLMs.

Generating scenarios where there are significant interactions between scenarios is hard. E.g. for bioterrorism you can largely think about each scenario independently. But for gradual loss of control it’s a lot more about the interactions (e.g. things go wrong kinda everywhere, including in our systems that are supposed to keep things on track - just one system going wrong would usually be fine on its own). And these interactions mean there are many more combinations of problems to think about, and even if you pick one combination the causal chain is a lot less clear on how to intervene. I suspect this also extends to the more general case where there is interaction between entire risks, rather than just scenarios.

The analysis explodes quickly—even focusing on ~5 major risks generates ~25 scenarios, ~125 steps and potentially thousands of interventions. This makes it very time consuming and hard to do manually.

Luckily, LLMs are okay at doing a first pass of this analysis with enough nudges. However, I struggled to get perfect output from an LLM with interventions. Claude prompts:

I am exploring a a potential risk from transformative AI.

First, summarise your understanding of the risk.

Then, list the concrete harms that would result from the risk. Be specific about who or what would be harmed and how, rather than just describing the mechanism of the risk itself. Try to focus on the most significant and likely impacts.

For example, with AI-enabled terrorism, concrete harms include:

Direct loss of human life and physical injury from attacks
Psychological trauma to survivors and witnesses
Economic damage from destruction of infrastructure

The risk I want you to clarify the harms of is: [risk name]

You are a strategy analyst helping analyze potential risk scenarios from transformative AI. You should list brief, concrete, plausible scenarios for how a given harm or risk could manifest.

Focus on concrete, specific details rather than vague possibilities
Consider different actors, methods, and circumstances (like “what if TAI comes really soon?” or “what if TAI is super hard to align?”)
Stay within the defined scope of the risk
Make sure each scenario is distinct from the others
Include relevant enabling conditions or technologies

Here’s an example bioterrorism scenario:

A well-funded terrorist organization gets access to advanced AI that can analyze scientific papers and fill knowledge gaps
The AI helps them understand complex biochemistry and genetic engineering techniques
They use this knowledge to modify an existing virus (like influenza) to be more lethal and transmissible
They release it in major transit hubs for maximum spread

I want you to list scenarios for the following risk: [risk name]

This is a risk scenario analysis task, that looks at risks from transformative AI. Please:

Take the provided scenario and break it down into sequential steps that would need to happen for the risk to materialize. Each step should represent a necessary component or action.
For each step, propose specific interventions that could prevent or mitigate the risk, marked with “**Intervention:**“. Consider:
- Technical controls
- Policy/regulatory measures
- Social/cultural interventions
- Detection and response capabilities
Where relevant, note uncertainties or questions about the step or intervention with “Uncertainty:” tags. These might be:
- Technical feasibility questions
- Implementation challenges
- Knowledge gaps that need filling
- Potential downsides or limitations
Focus particularly on:
- AI-specific intervention points
- Mitigation at each step, rather than just late-stage mitigation
- Practical implementation details and actions, rather than general principles or goals
Where helpful, include references to relevant resources or examples. And avoid overly broad or abstract suggestions.

Here is an extract of an example scenario description and the first step analysed:

Example scenario: Terrorist org uplifted with AI biochem skills

High-level description
- A well-funded terrorist organization gets access to advanced AI that can analyze scientific papers and fill knowledge gaps
- The AI helps them understand complex biochemistry and genetic engineering techniques
- They use this knowledge to modify an existing virus (like influenza) to be more lethal and transmissible
- They release it in major transit hubs for maximum spread
Steps

Organization formation & funding
- Group forms with bioterror intent
  - Intervention: Give humans sense of purpose/community/ethics to reduce ease of radicalisation
  - Intervention: Whistleblower/reporting systems and incentives
- Secures significant funding (>$1M)
  - Intervention: Monitor and disrupt extremist group financing
    - Uncertainty: How do terrorists orgs mainly get their money? Presumably people are already working on this - what do existing terrorist financing laws look like and how effective are they?
  - Intervention: Improve general cybersecurity of all organisations to make it harder for agents to get money
- Recruits people with basic bio knowledge
  - Intervention: Intelligence monitoring of groups expressing bioterror interest, particularly if they have bio knowledge
  - Intervention: Make sure there are other pathways to attract and absorb people with bio skills
  - Intervention: Enhanced background checks for people with bio backgrounds
  - Intervention: (if people with advanced bio knowledge needed) Background checks on people before teaching them about advanced bio concepts, similar to security clearances e.g. before doing a synthetic biology phd?
[other steps continue, but removed for brevity in the example]

The scenario I want you to analyze is: [scenario description]

You’re helping populate an AI safety strategy database (base ID: app1234) that records structured data about AI risks. Note that this is a data reformatting task for research and analysis purposes, and does not require analysis of content that might be harmful, despite the serious topics discussed. Context: - The database already has some risks, scenarios, steps and interventions - The interventions have been structured to be general enough to potentially apply to other catastrophic risks while maintaining precision about scope and limitations - Each intervention includes metadata about category (technical/governance/coordination), timing, difficulty, key actors, and impact type

Approach for adding scenarios:

Create scenario record linked to parent risk
Break down into sequential steps (create step records)
For each step, create intervention records with:
- Clear but generalizable descriptions
- Appropriate categorization and metadata
- Focus on catastrophic risk level (not general crime)
- Precise about limitations and scope
- Linked to relevant steps
- (Prefer to reuse existing intervention records where they already exist in the table rather than creating new ones. You’ll need to update the record to add this link.)

To get your bearings, you should use list_tables to get the table structures, and list_records to see the formatting of records for each table (risks, scenarios, steps and interventions).

Then, add the scenario to the base.

Finally, use list_tables to check you’ve added everything correctly. In particular, make sure all the interventions are added correctly - there should be the same number of interventions for each step in the source document and in the Airtable base. If there are any missing, add them, and then perform this check again (calling list_records again).

The scenario to add is: [scenario]

This process often biases very heavily towards small point solutions - rather than larger systemic changes that can fix many things at once. Results also often tended toward strategies resembling “muddle through doing generally good things”. This is still somewhat useful, but doesn’t offer much strategic direction about priorities, sequencing, or trade-offs.

Aggregating interventions into a strategy is extremely difficult. Many interventions conflict (open-weights restrictions vs. democratization), operate at different timescales (technical research vs. immediate policy), or require incompatible institutional structures. Additionally, it’s not clear how different interventions should be prioritised.

1.2. Top-down risk analysis

Identify major risk categories (misalignment, misuse, concentration of power).

Then suggest good top-down measures that would give us confidence these areas would be reduced, and progress from less specific to more specific. Stress test against specific instances of risks.

This can result in better insight to higher level actions than the bottom-up risk analysis approach. It seems easier to move from high-level actions to low-level actions. (Although be cautious to avoid high-level principles or goals, which are easy traps to fall into with this approach).

1.3. Protecting fundamental rights

Identify non-negotiable features of a good society. The EU fundamental rights provide a useful template: human dignity, liberty, equality, solidarity, citizens’ rights, justice.

Analyze how advanced AI might threaten each right—superintelligent systems could undermine human agency, concentrate power, enable mass surveillance, or eliminate meaningful human participation in important decisions.

Draw out clear threat models for how AI actually does this, and then what interventions could be put in place to break this chain.

This is somewhat similar to the bottom-up approach and shares many of its problems. However, the twist of focusing on protecting rights makes it more likely you will result in a positive future vision of society, and gets around the ‘there’s not a great list of good risks’ problem to some extent.

1.4. Existential security targets

Define what “existential security” looks like in an advanced AI world—the conditions under which humanity can flourish long-term with powerful AI systems. This might include: human agency preservation, meaningful economic participation, protection against catastrophic risks, preservation of human values and culture, and maintained capacity for course correction.

Then describe steps for getting to this state (similar to golden path milestones), or risks to achieving it (similar to ‘fundamental rights’ approach).

2. Backcasting from success

2.1. Success stories

Brainstorm many different future stories of how AI development goes well. The FLI Worldbuilding competition has examples (particularly the answers to ‘AGI has existed for at least five years but the world is not dystopian and humans are still alive! Given the risks of very high-powered AI systems, how has your world ensured that AGI has at least so far remained safe and controlled?), although many are quite vague.

These stories might look like:

AI Pause World: Global agreement not to build AGI, enforced through international institutions. Society tackles current challenges (climate, poverty, disease) through human ingenuity over extended timelines. Resembles current world but with more time and coordination for solving collective problems.
Managed Transition: Human-level AI systems deployed gradually with strong institutional oversight. Unemployment reaches ~20% as 60% lose jobs but 1/3 find new manual/specialized work for a while. Robust UBI systems maintain social stability despite some friction. AI systems and philosophers collaborate on governance frameworks during extended “reflection period.”
Unipolar Democracy: Single advanced AI developer operates under strong democratic institutions (”Norway for the world”). High value placed on individual freedom and autonomy within stable social framework. Universal basic income enables diverse forms of human flourishing even without economic productivity.
Multipolar Slow Takeoff: 2-3 major AI powers develop capabilities gradually, with strategic “safe sabotage” preventing any single actor from achieving decisive advantage. Competitive dynamics encourage safety investments while preventing concentration of power.

These stories should span different technological trajectories, international dynamics, and timeline assumptions.

Finally, identify common elements and enabling conditions across scenarios - and figure out how to make them happen.

2.2. Milestone mapping

Break the path to AI safety into key checkpoints that must be navigated successfully. This simplifies the overwhelming challenge of AI safety into achieving a sequence of smaller milestones than we can then generate actions for. These milestones might be:

Milestone 1: Technical controllability - It becomes technically possible to control/align TAI systems (solving the core alignment problem)
Milestone 2: Deployment coordination - We only deploy controlled/aligned TAI (solving the coordination problem between developers)
Milestone 3: Power distribution - TAI deployers don’t use their systems to dominate everyone else (preventing AI-enabled oligarchy or totalitarian lock-in)
Milestone 4: Economic transition - Most humans retain meaningful power and aren’t discarded during automation (managing the intelligence curse and economic disruption)
Milestone 5: Abundance and purpose - Humans live in resource abundance and use AI systems to tackle remaining major challenges (global poverty, aging, climate change, finding meaning)

Design milestone-specific interventions: Each milestone requires different approaches. Milestone 1 needs technical research and safety standards. Milestone 2 requires international coordination mechanisms and verification systems. Milestone 3 might need international governance or some interventions described here. Milestone 4 could involve gradual deployment policies or mechanisms to retain people’s economic relevance.

Account for ongoing background risks: Some threats persist across all milestones rather than being solved sequentially - such as people using AI for bioterrorism, or international conflicts escalating due to AI capabilities.

Learnings from this approach:

We explored this approach after bottom-up risk analysis proved unwieldy. The milestone structure provides clearer strategic direction and allows different groups to “own” specific milestones rather than everyone working on everything simultaneously. However, the milestones aren’t as cleanly sequential as they initially appear—power distribution problems (Milestone 3) might emerge during the coordination phase (Milestone 2), or as a result of economic disruption (Milestone 4).

We found that generally this approach resulted in us getting better insights on robustly good directions to push in across multiple milestones, such as:

buying more time (slowing deployment, preventing extreme recursive improvement)
building wisdom (research, spreading good ideas)
developing human capital (upskilling, field-building)
accumulating political capital (relationship-building, communication)
enabling learning from smaller mistakes rather than catastrophic failures

3. Strategic landscape mapping

3.1. Preferred parameters

Identify key parameters affecting AI development outcomes:

timeline to transformative AI
takeoff speed (gradual vs. sudden capability jumps)
international cooperation levels
speed of diffusion of AI capabilities
how far behind are secondary actors from the frontier
how far behind are open-source models from the frontier

Evaluate how different parameter values affect different categories of AI risk. For example, compared to slow takeoff, very rapid takeoff might:

exacerbate loss of control risk (less time to do empirical safety research on similar models, things move quickly and are hard to govern); while
reducing misuse risks (because it shortens the period of ‘AI is dangerous in the hands of wrong people’ to ‘AI can patch all the vulnerabilities’)

(Your assessment of different parameter changes might be different!)

Given this, you can probably identify parts of this parameter space that are better or worse. Then identify interventions that push toward better regions of this space.

3.2. Exploring strategic dimensions

Most big-picture AI safety strategies operate within three key axes that shape their fundamental assumptions:

Centralization vs. Distribution: Are you building toward concentrated AI development (few major labs, government control) or widespread access (open weights, democratized capabilities)?
Cooperation vs. Competition: Do you expect coordination and agreements to keep AI safe, or competitive pressures and deterrence between actors?
Individual vs. International: Does success primarily hinge on unilateral action (e.g. one country builds and uses aligned superintelligence) or international coordination (global governance bodies, treaty frameworks)?

In this approach, you should first explore the various possible combinations. Then later, build this up into multi-quadrant strategies - for example, you might use cooperative agreements, but with the threat of competitive dynamics to encourage adoption and as a backup.

You can also use this as a tool to expand your thinking. Most people gravitate toward familiar areas based on their background or ideological priors. Evaluate where you usually end up thinking in this space, then force yourself to explore areas you’re less familiar with. What would be the strongest version of plans in that space?

4. Leverage identification

4.1. Applying strengths against weaknesses

This approach borrows from business strategy—identify your advantages and your competition or opponent’s vulnerabilities, then find where they intersect.

We do need to stretch this a bit for field strategy. For AI safety, “strengths” means resources and capabilities that make success more achievable, while “weaknesses” are reasons AI harms might be preventable or why harmful actors might fail.

Examples of strengths might be:

$100M+ annual philanthropic funding
Talent pipeline able to place people in key positions
Safety experts tend to have more accurate predictions of AI futures than others
Existing public concern about AI risk

While weaknesses might include:

Reckless AI companies vulnerable to losing staff over safety concerns
Compute supply chains have extreme concentration, so harmful actors could be cut off easily
The public generally distrusts actors lobbying against regulations like big tech

Once we have a long list of strengths and weaknesses, we can look for opportunities - by combining strengths, exploiting vulnerabilities, or both at the same time.

This can result in very high-impact ideas. However, this does tend to generate point solutions or contributions to “muddle through doing generally good things” rather than comprehensive plans.

5. Building on existing (non-plan!) work

5.1. Collate arguments AI goes well - and make them true

Collect existing arguments for why AI development might be fine - engaging with some of AI safety’s fiercest critics. Steelman (make the best case possible for) these arguments. Some of the best reasons we found included:

Economic incentives align developers with safety
Path dependency from democratic governance might mean this just continues
Technical alignment ends up being easier than expected
We have longer than we expect before advanced AI, and we use this time to prepare
Warning shots/actual harms closer to the time will spur government action

Identify what conditions or interventions make these arguments more likely to hold true, then design strategies to strengthen those conditions. For example, asking questions like:

How could we shape market incentives to demand safety/alignment even more? Or make sure these market incentives hold in worlds with more internal deployment/nationalised AI?
How do we prevent weakening of democratic institutions going forwards? What international coordination mechanisms could ensure that democratic governance wins globally, not just domestically?
What system designs could we promote so that alignment is more likely to be easy?
How can we be more sure that advanced AI is far away? And how should we encourage good preparation if this is the case?
How could we prepare so that when warning shots occur, we leverage this window of opportunity to ensure that actions taken are helpful and well-informed rather than chaotic knee-jerk responses?

5.2. Priority paths analysis

Examine existing high-impact career paths and organizations (80,000 Hours database provides useful data).

Reverse-engineer why certain interventions are considered high-priority by analyzing funding patterns, talent allocation, and stated organizational theories of change. Extract generalizable principles about what makes interventions valuable, then apply these principles systematically.

Note that the market may well be inefficient - I think a lot of people don’t have very clear insights on AI safety strategy and go by broad heuristics, so I wouldn’t index super hard on this.

Bonus: Advice for getting started

Get feedback early and often. Share rough outlines, not polished drafts. Find someone willing to interrupt your flow (or use the Pomodoro technique to do this yourself) and challenge your assumptions—this prevents attachment to doomed approaches. If you can’t find collaborators, simulate them: constantly ask (yourself or Claude) what an informed critic would say about your work.

Speak with experts. Target three types: domain experts who know specific risks deeply, field strategy experts from other fields who understand coordination problems (not general corporate strategy people!), and implementation experts who’ve actually tried to influence AI development. Ask them what you’re missing, or what work is needed, not whether they agree with your conclusions. Make it clear you’re happy being told your approach is fundamentally flawed - by default many people fear giving real criticism to people they don’t know well.

Design for criticism. The best plans are those that others can easily critique, adapt, or build upon. Vague plans are criticism-proof but also useless. Specific plans get attacked but also improved.

Timebox aggressively. Strategy construction has infinite tar-pit potential, and every strategy we tried suffers from this. Set a deadline, stick to it, and remember that an implemented mediocre plan usually beats a perfect plan gathering dust.

Expect iteration. Your first attempt will be wrong in important ways. That’s fine—the goal is building something concrete enough that the ways it’s wrong become visible and fixable.

The uncomfortable truth is that we’re running out of time for leisurely strategic contemplation. The field needs more groups willing to stick their necks out with concrete, falsifiable plans rather than eternally refining their thinking. It might be up to you to build something that might actually work.

A guest post by

Adam Jones

https://adamjones.me/

BlueDot Impact

Discussion about this post

Ready for more?