Anthropic

Weak 1.8/5

Click categories for more information

very weak

weak

moderate

substantial

strong

Risk Identification

Learn more

Risk Identification

41%

Risk Analysis and Evaluation

Learn more

Risk Analysis and Evaluation

19%

Risk Treatment

Learn more

Risk Treatment

30%

Risk Governance

Learn more

Risk Governance

50%

Up to date as of October 2025

Introduction

SEE FRAMEWORK

Starting from Version 3.0 of their Responsible Scaling Policy (RSP), Anthropic have been splitting up their framework into a number of documents serving different functions, including the RSP, Frontier Compliance Framework, RSP Noncompliance Reporting and Anti-Retaliation Policy, Risk Report, and Frontier Safety Roadmap. Their new approach is characterized by a split of unilateral and competitor-contingent commitments, in-depth safety cases for their identified risk domains, and the regular creation, updating, and publishing of risk reports.

Overview

Documents analyzed

For Anthropic, the following documents were taken into account:

While most of the other companies publish their Frontier Safety Framework as one discrete document, Anthropic’s policies are spread out across several. We treated the five documents as a comprehensive description of Anthropic’s risk management practices.

Highlights relative to others

Specifies the role of a dedicated risk officer in the form of the Responsible Scaling Officer (4.2.1), and uniquely has an additional governance body sitting outside the Board with its Long-Term Benefit Trust (4.4.2).
Strong speak-up culture (4.5.3) and extensive detail on external reporting of risks (4.6.1) and their governance structure (4.6.2).
Strongest on assurance processes, providing the most extensive information on the assumptions made for their sabotage safety case (3.1.3.3).

Weaknesses relative to others

Lacks a risk tolerance (2.1.1.1, 2.1.1.2, 2.1.1.3).
Little evidence that their risk identification methodology is adequate (1.3.2.1). They do not specify a process for identifying novel risks (1.2.1, 1.2.2, 3.2.4.1, 3.2.4.1).
Lacks a central risk team (4.2.6) and a risk management committee (4.1.2).
No detailed information on the content and addressee of information that will be shared with industry peers and government bodies (4.6.3).

Changes

Compared to the second version of the Responsible Scaling Policy, they:

Weakened involvement of external parties in evaluating risks (3.2.1.5) and mitigations (3.2.2.3), as recipients of shared information (3.2.3.1, 4.6.3), and as external auditors (4.3.2).
Weakened KCI activation triggers (2.2.2) and commitments to halt development (2.2.4).
Weakened KRI monitoring mechanisms (3.2.1).

Anthropic

1. Risk Identification

Moderate 41%

1.1 Classification of Applicable Known Risks (40%) 63%

1.1.1 Risks from literature and taxonomies are well covered (50%) 75%

The FCF covers cyber offense, CBRN, harmful manipulation, and sabotage/loss of control, with two-tier thresholds for each. Automated R&D in key domains is covered as a capability threshold in the RSP and as a sub-tier under sabotage and loss of control in the FCF. The Risk Report provides supporting analysis for sabotage, AI R&D, and CB weapons.

Loss of control is broken down extensively. Beyond the high-level “sabotage and loss of control” category, the Risk Report enumerates eight distinct sabotage pathways with separate frequency, complexity, and mitigation assessments.

No canonical taxonomies are cited to justify the chosen scope. The Harmful Manipulation tiers are flagged as exploratory and likely to change, and radiological and nuclear risks have been dropped from analytical focus in favour of CB-only treatment, with partial justification.

To improve, Anthropic should reference canonical taxonomies (e.g. Weidinger et al. 2022, Hendrycks et al. 2023, or Slattery et al. 2024) to demonstrate that its risk identification process does not miss domains experts highlight, firm up its exploratory Harmful Manipulation tiers, and bring radiological and nuclear risks back into analytical focus.

Quotes:

“Based on this analysis, the FCF currently addresses the following systemic risk categories: Cyber offense […] CBRN […] Harmful manipulation […] Sabotage and loss of control, including evasion of oversight or unsupervised conduct, and autonomous behavior that would constitute serious crimes (such as assault, extortion, or theft) if committed by a human” (FCF, p.5)

“Campaign Infrastructure Enablement: Escalation to systematic infrastructure enabling sophisticated scaled influence campaigns. Automates >50% of steps normally requiring multiple sophisticated adversarial actors.” (FCF, p.8)

“Non-novel chemical/biological weapons production. AI systems with the ability to significantly help individuals or groups with basic technical backgrounds (e.g., undergraduate STEM degrees) create/obtain and deploy chemical and/or biological weapons with serious potential for catastrophic damages.” (RSP, p.6)

“Moderately resourced threat actors (including, for example, expert-backed teams) create/obtain and deploy novel chemical and/or biological weapons with potential for catastrophic damages far beyond those of past catastrophes in this category such as COVID-19.” (Risk Report, p.79)

“Diffuse sandbagging on safety-relevant R&D […] Targeted sabotage of safety-relevant R&D […] Code backdoors to help future misaligned models […] Poisoning the training data of future models […] Self-exfiltration and autonomous operation […] Persistent rogue internal deployment […] R&D sabotage within other high-resource AI developers […] Decision sabotage within major governments” (Risk Report, pp.37–38)

1.1.2 Exclusions are clearly justified and documented (50%) 50%

Anthropic covers cyber offense, CBRN, harmful manipulation, AI R&D, and sabotage/loss of control, with loss of control heavily decomposed. The only category de-scoped in the RSP and Risk Report is radiological and nuclear, with a justification given: Risk Report footnote 24 (p.62) combines an internal threat-model argument — that R/N risks are bottlenecked by access to physical materials rather than knowledge — with a named third-party engagement, Anthropic’s partnership with the U.S. Department of Energy on modeling nuclear weapons risk.

The justification still has gaps. The footnote asserts the materials-bottleneck argument but does not defend it, e.g. with respect to radiological materials that are materially less constrained than fissile material. The Department of Energy partnership covers nuclear; no analogous third party is named for radiological risk. Separately, the framework does not publish a list of research or exploratory categories with criteria that could elevate those categories into tracked ones. Harmful Manipulation is included but flagged as exploratory without stated criteria for graduating to firm tiers.

To improve, Anthropic should defend the materials-bottleneck argument explicitly for radiological risk (or name an external partner covering it as DoE covers nuclear), publish an exploratory-category list with promotion criteria (most directly applicable to Harmful Manipulation), and reconcile the CBRN labeling between the FCF and the Risk Report so the framework’s current scope is unambiguous.

Quotes:

“‘CB’ stands for ‘chemical and biological.’ We previously used the acronym ‘CBRN,’ for ‘chemical, biological, radiological, and nuclear,’ but as our threat modeling has progressed, we have decided to focus this analysis on chemical and biological weapons for the time being, since we believe these are most relevant to the kind of risk posed by AI models. That is, their development can be much more greatly accelerated by knowledge, whereas radiological and nuclear threats are more bottlenecked by access to materials. However, although we do not focus on them in this Risk Report, we are working closely with the Department of Energy to ensure we are adequately modelling risks from nuclear weapons.” (Risk Report, p.62)

“the FCF currently addresses the following systemic risk categories: Cyber offense […] Chemical, biological, radiological, and nuclear (CBRN) threats […] Harmful manipulation […] Sabotage and loss of control” (FCF, p.5)

“We will continually update the RSP as we learn more about AI capabilities and risks, develop and refine technical safety measures, and gain more experience navigating an ecosystem in which the risks to society depend on the actions of many developers.” (RSP, p.3)

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 10%

1.2.1 Internal open-ended red teaming (70%) 10%

There is some indication of internal red-teaming feeding risk identification, with the FCF stating that “to understand the full range of harmful outcomes that could arise from our models, we draw on internal expertise, extensive red-teaming conducted both internally and with external partners, and authoritative research in relevant fields”. This shows awareness that internal red-teaming contributes to understanding what harms a model could enable, beyond known and pre-specified categories.

The framework does not commit to an internal process explicitly for identifying novel risk domains or changed risk profiles within pre-specified domains, and provides no detail on methodology, resources, time, model access, or required expertise for such a process. The internal red-teaming commitments that are described in operational detail elsewhere — the Responsible Scaling Policy’s commitment to “develop our internal red-teaming of our deployment safeguards” and the Frontier Safety Roadmap’s “world-class internal red-teaming” target — are framed around finding jailbreaks against existing safeguards rather than discovering unknown risks.

To improve, Anthropic should commit to an internal process explicitly dedicated to identifying either novel risk domains or novel risk models and changed risk profiles within pre-specified domains (for example, capability shifts such as extended context length altering the scale of zero-shot learning feasibility), and provide methodology, resources and required expertise for it.

Quotes:

“To understand the full range of harmful outcomes that could arise from our models, we draw on internal expertise, extensive red-teaming conducted both internally and with external partners, and authoritative research in relevant fields.” (FCF, p.5)

“Develop our internal red-teaming of our deployment safeguards to the point where our internal red-teaming performs better at finding potential jailbreaks than the collective abilities of the participants in our established bug bounty programs.” (RSP, p.9)

1.2.2 Third party open-ended red teaming (30%) 10%

The framework acknowledges external red teaming as one input to risk identification. The FCF states that Anthropic understands “the full range of harmful outcomes that could arise from [their] models” through “internal expertise, extensive red-teaming conducted both internally and with external partners, and authoritative research in relevant fields”, and reserves the option to “solicit input from external actors in relevant domains […] including the identification of potential risks.” Separately, a footnote in the Risk Report (fn. 43, p.94) references a “third-party testing program” through which UK AISI surfaced a novel pre-deployment behavior in Claude Opus 4.5 that Anthropic had not identified itself — demonstrating that a third-party pre-deployment testing channel exists and can surface unknown issues. However, the FCF commitments are single sentences with permissive language (“we may”) and no specification of methodology, resourcing, time, expertise criteria, or scope. The third-party testing program — the closest existing activity to the criterion — is mentioned only as an incidental footnote, with no framework-level commitment to its scope, selection of testers, pre-deployment access guarantees, or dedicated focus on identifying unknown risks. The external red teaming that Anthropic does describe in operational detail (e.g., bio expert red teaming by Deloitte and SecureBio) is scoped to validating performance against pre-specified threat models, not to open-ended discovery of novel risk domains or novel risk models within them.

To improve, Anthropic should formalize the third-party testing program as a framework-level commitment to external open-ended red teaming aimed at identifying unknown risks pre-deployment, with documented selection criteria, justified time and access allocations, and explicit expertise requirements. Separating risk-discovery red teaming from threat-model-validation red teaming in the framework would clarify that the former is a committed, dedicated process rather than an incidental byproduct of the latter.

Quotes:

“The closest thing that we are aware of to a counterexample is the finding by UK AISI that Claude Opus 4.5 will sometimes refuse to participate in certain AI safety research tasks for dubious reasons. While we did not identify this ourselves, it still came to our attention before public deployment through our third-party testing program, and was disclosed in the Claude Opus 4.5 System Card.” (Risk Report, p.94)

1.3 Risk modeling (40%) 35%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 50%

Anthropic’s risk modeling is now anchored in the Risk Report, which publishes threat models for each of the four priority risk domains it tracks: sabotage, automated R&D in key domains, non-novel chemical/biological weapons production, and novel chemical/biological weapons production. Each threat model follows a consistent structure covering the threat, the role of AI, potential magnitude and likelihood of impact, and rationale for prioritization. The sabotage threat model is decomposed into eight named pathways with frequency, complexity, and mitigating-factor analysis, and the non-novel CB threat model is decomposed into eight threat variants. Section 4.2 names expert collaborators (SecureBio, Deloitte Consulting, and a Frontier Model Forum workshop), satisfying the requirement that experts involved be identified. The RSP 3.0 (§3.3) makes “threat model identification” and “threat model specification” required contents for every Risk Report, embedding this as an ongoing commitment rather than a one-time publication.

Several gaps remain. The broader set of considered-but-deprioritized scenarios is not made public. Coverage is also uneven: the sabotage and non-novel CB models are detailed, but the automated R&D and novel CB models are thinner, and the novel CB model itself flags that “developing the relevant evaluations remains a work in progress.” Harmful manipulation appears in the FCF as one of four systemic risk categories, alongside a statement that Anthropic is “in the early stages of developing our approach”. No threat model for harmful manipulation is published. Cyber offense is likewise listed as a systemic risk category in the FCF with two defined risk tiers, yet has no corresponding threat model in the Risk Report; it is also absent from RSP’s capability threshold table. The risk modeling methodology is not published.

To improve, Anthropic should publish the broader set of scenarios considered but deprioritized along with the reasoning for deprioritization, and bring harmful manipulation, cyber offense, novel CB, and automated R&D up to the level of detail seen in the sabotage and non-novel CB sections. They should also publish their risk modeling methoodology.

Quotes:

“2.6.1 Pathway 1: Diffuse sandbagging on safety-relevant R&D; 2.6.2 Pathway 2: Targeted sabotage of safety-relevant R&D; 2.6.3 Pathway 3: Code backdoors to help future misaligned models; 2.6.4 Pathway 4: Poisoning the training data of future models; 2.6.5 Pathway 5: Self-exfiltration and autonomous operation; 2.6.6 Pathway 6: Persistent rogue internal deployment; 2.6.7 Pathway 7: R&D sabotage within other high-resource AI developers; 2.6.8 Pathway 8: Decision sabotage within major governments” (Risk Report, pp.37–38)

“Claude Opus 4.6 could insert security backdoors that are relatively easy for subsequent AI models to find, relatively hard for us to notice, and drastically increase the ease of compromising our systems and hence sabotaging us through other pathways listed here. […] Frequency and complexity: Low Frequency, High Complexity. […]To insert vulnerabilities that have very large marginal impact […] and are much easier for subsequent systems to notice than for us to notice […] would likely require significant strategic thinking and/or many attempts, and likely would require high subversion skill to reliably avoid raising alarms.” (Risk Report, p.41)

“The threat: Individuals or groups with relatively modest resources sometimes attempt mass murder […] The casualties from such an attempt might be dramatically increased if the attackers had access to chemical or biological weapons. […] Role of AI: Sufficiently capable AI models could provide an uplift on the production of chemical or biological weapons that goes well beyond what’s currently available on the public internet, and thus could raise novices to the level of experts. […] Potential magnitude of impact: Biological pathogens have caused among the most damaging catastrophes in history, and therefore an effective attack using one is clearly a catastrophic risk. […] Likelihood of impact: Terrorists (or terrorist actors) have previously used chemical and biological weapons in attacks, though these have (to date) been rare with limited casualties. Preliminary analysis suggests that AI capabilities could materially increase the likelihood of such events, although a great deal of uncertainty remains in current assessment methodologies. […] Why this is a priority threat: […] This is especially true because of the difficulty of constructing robust defences against biological and chemical weapons […] and because of the potential lack of early warning signs of such an attack.” (Risk Report, p.63)

“We developed these views by consulting with experts, including from Deloitte Consulting and SecureBio. Our threat models were also informed by an expert workshop organized by the Frontier Model Forum.” (Risk Report, p.65)

“We start with what we consider the most salient and likely variant of the relevant threat model: a threat actor uses a highly effective, publicly known universal jailbreak […] Threat variant 2: a threat actor finds a highly effective, nonpublic universal jailbreak on their own. […] Threat variant 3: a threat actor obtains a highly effective, nonpublic universal jailbreak from someone else […] Threat variant 4: a threat actor exploits a trusted user exemption. […] Threat variant 5: a threat actor steals our model weights […] Threat variant 6: an attacker steals our model weights and uses them to create a safeguard-free version […] Threat variant 8: an attacker obtains significant uplift from our model despite not doing any of the above, e.g. by asking questions about seemingly benign topics that ultimately inform their work on the uses of concern.” (Risk Report, pp.72-76)

“Factual information. We will describe how we identify, evaluate, and mitigate catastrophic risks. A Risk Report will document the following: 1. Threat model identification: Our criteria for determining which catastrophic risks (i.e., threat models) we assess. 2. Threat model specification: The relevant threat models (which will, at a minimum, include those discussed above).” (RSP, p.11)

“Prior to launching a model, we estimate the probability and severity of harm for CBRN, sabotage and loss of control, and cyber offense risks. We are in the early stages of developing our approach to assessing harmful manipulation risks.” (FCF, p.5)

Acknowledged uncertainty on novel CB:
“Threat model has especially high uncertainty: While we believe that the broad threat model here is worthy of serious attention, it is inherently difficult to connect to AI capability evaluations, and developing the relevant evaluations remains a work in progress.” (Risk Report, p.81)

1.3.2 Risk modeling methodology (40%) 12%

1.3.2.1 Methodology precisely defined (70%) 10%

Anthropic’s framework commits to documenting threat models in its Risk Reports. The RSP requires each report to include threat model identification criteria and threat model specifications, and the FCF states that risk identification “combines threat modeling with evaluations across multiple domains,” drawing on literature reviews, expert consultation, and red-teaming. These establish that a process exists, but do not constitute a defined methodology. There is no documentation of use of standard techniques such as event trees, fault trees, or Fishbone diagrams, and the pathway decompositions do not break risks into discrete, measurable steps with explicit causal links.

Expert elicitation is partly documented: for non-novel chemical and biological weapons, the Risk Report names Deloitte Consulting and SecureBio as consulting experts, and references an expert workshop organized by the Frontier Model Forum. Evaluations are also developed with SecureBio and Signature Science. However, the detail in the Risk Report is largely post-hoc description of past analysis rather than a forward-looking methodology commitment in the framework.

To improve, Anthropic should publish a precise risk-modeling methodology committed to in the framework itself, specifying a formal technique such as event trees or fault trees that decomposes each pathway into discrete, measurable steps, and document a structured expert-elicitation protocol rather than naming collaborators ex post.

Quotes:

“Factual information. We will describe how we identify, evaluate, and mitigate catastrophic risks. A Risk Report will document the following:
1. Threat model identification: Our criteria for determining which catastrophic risks (i.e., threat models) we assess.
2. Threat model specification: The relevant threat models (which will, at a minimum, include those discussed above).” (RSP, p.11)

“Our risk identification approach combines threat modeling with evaluations across multiple domains. We analyze both misuse opportunities (how a model’s capabilities could be exploited by threat actors) and risks arising from potential misaligned model behavior. To understand the full range of harmful outcomes that could arise from our models, we draw on internal expertise, extensive red-teaming conducted both internally and with external partners, and authoritative research in relevant fields.” (FCF, pp.4–5)

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 10%

Anthropic’s documents do not contain an explicit mechanism by which novel risks or risk pathways identified via red teaming or other evaluations trigger further risk modeling and scenario analysis. The closest relevant evidence is the first “Looking forward” section of the Risk Report, which enumerate load-bearing claims underpinning each threat model and the circumstances “that would call the applicability of this risk report into doubt”. This implicitly acknowledges that specific evaluation findings could necessitate reassessment. Similarly, for the CBRN pathway, the Risk Report commits to monitoring for “significant changes to our threat models” and lists bug-bounty and threat-intelligence findings as inputs. The RSP reinforces this by requiring each Risk Report to incorporate evaluation evidence and include a forward-looking plan, establishing the document through which reassessment would be registered.
However, framing circumstances as making a report “no longer applicable” is not the same as committing to trigger further risk modeling. None of Anthropic’s framework documents specify who conducts a reassessment, on what timeline, or at what evidentiary threshold. Open-ended red teaming is not named as a distinct input, and there is no general commitment spanning all risk domains.
To improve, Anthropic should commit in the RSP to a structured process by which novel findings from red teaming or evaluations trigger further risk modeling and scenario analysis, naming open-ended red teaming explicitly and specifying who reassesses, on what timeline, and with what evidentiary threshold.

Quotes:

“[Load-bearing claim:] We are aware of no dangerous coherent misaligned goals (hidden or otherwise) that could substantially increase risk from pathways laid out here, beyond those already discussed. [Example circumstance:] An updated alignment audit reports major new concerning findings relevant to the pathways discussed here.” (Risk Report. p.49)

“[Load-bearing claim:] The methods used in the alignment audit appear sufficient to identify dangerous goals, values, or propensities if they are present in models at or below Claude Opus 4.6’s capability level. [Example circumstances:] A model-organisms study shows that it is possible for such a model to circumvent the full suite of audit methods and still realize one of the pathways discussed here. […] We observe a model that has previously undergone an alignment audit attempting an unexpected and egregiously misaligned action that could substantially raise the risk from one of the pathways discussed here.” (Risk Report, p.50)

CB-pathway monitoring intention:
“We monitor for:
– Changes to the threat model. We will consider any significant changes to our threat models — for example, related to the length of model access required to achieve uplift, the number of potential threat actors, and changes in the complexity of the threat pathway. […]
– Public availability of capability-preserving universal jailbreaks for ASL-3 uses of concern. Our bug bounty program and threat intelligence work are intended to give us information about availability of jailbreaks.” (Risk Report, p.76)

1.3.2.3 Prioritization of severe and probable risks (15%) 25%

The Risk Report provides qualitative estimates of the severity and probability of the four chosen risks. While the “potential magnitude of impact” is uniformly judged to be extremely high, the “likelihood of impact” spans from “highly plausible” and “at least plausible” to “highly unlikely”, though they also note that “a great deal of uncertainty remains”. The Risk Report also includes a list of criteria Anthropic uses to prioritize risk models, one of them being “a combination of high potential damages and high likelihood for these potential damages”. However, the report does not include probabilites or likelihoods for de-prioritized risk models, making it difficult to independently assess Anthropic’s decision to includes or exclude specific risk models. Further, while the FCF states that Anthropic “estimate the probability and severity of harm for CBRN, sabotage and loss of control, and cyber offense risks”, the risk report does not provide those estimates for radiological, nuclear, or cyber offense risks.

Quotes:

“Our criteria for prioritizing threat models are:
● A combination of high potential damages and high likelihood for these potential damages.
● A clear role for AI in creating risk beyond what is created by other technologies and background conditions.
● Threat scenarios that survive sanity checks considering historical analogies (e.g., we place lower priority on attacks that have been relatively rare or less damaging historically, unless there are specific reasons to expect AI to change historical patterns).
● Consideration of some additional factors, such as the extent to which addressing one threat model might help put us in position to address other priority threat models, and the extent to which threats are hard to get early warning signs of and respond to accordingly (the more the latter holds, the more appropriate a focus on prevention is).” (Risk report, p.92)

“One relevant concept here is the idea of “expected” damages – the probability of each potential harm times the size of the harm, summed over all potential harms. We sometimes create rough estimates of expected damages, and other times simply reason informally about which threat models are most likely to have a combination of high likelihood and high potential damages that would imply high expected damages.” (Risk report, p.92)

Per-threat magnitude/likelihood analysis with priority justification:
“Potential magnitude of impact: We are most concerned with major, enduring changes in the global balance of power. The value at stake could be a multiple of what’s at stake in any of the other threat models discussed here.
Likelihood of impact: The probabilities related to this threat model are difficult to assess, and we do not have consensus on a specific likelihood. […]
Why this is a priority threat: While there is a large number of possible threats from AI, our best working guess is that the automation of R&D will be the one most likely to lead to globally transformative impacts.” (Risk Report, p.58)

“Potential magnitude of impact: […] chemical and biological weapons have the potential for enormous damages, and damages may be even higher if such weapons are deliberately engineered to be worse than anything analogous observed to date (including natural pandemics).
Likelihood of impact: this would require several individually unlikely events: a well-resourced attempt […], success in developing them, and finally the weapons actually being deployed […]. We think each step is plausible, but collectively the whole sequence is highly unlikely.
Why this is a priority threat: despite the low likelihood, we believe the potential damages here would exceed those of any other threat model listed […].” (Risk Report, pp.79–80)

Framework-level commitment to estimating probability and severity:
“Prior to launching a model, we estimate the probability and severity of harm for CBRN, sabotage and loss of control, and cyber offense risks. We are in the early stages of developing our approach to assessing harmful manipulation risks.” (FCF, p.5)

1.3.3 Third party validation of risk models (20%) 50%

The RSP contains information on the qualifications that external reviewers must meet, how conflicts of interest will be avoided, and reputational and other incentives which should be in place to motivate candid and assessment of the risks by the third party. It also specifies that such external review should cover “adequacy of information”, “analytical rigor”, “areas of disagreement”, and “risk reduction recommendations”. However, all of those requirements are mandatory only if the risk report is “highly redacted” and the models it covers are “highly capable”, defined as a x2 speedup in AI R&D. This is a bar that is currently not met, and thus the external review process specified is not reflected in the current risk report.

If the above criteria are not met, Anthropic states that they will “usually also seek feedback from trusted external parties with relevant expertise”, and will “work toward a practice of seeking comprehensive, public external review on [their] Risk Reports”. While these commitments are weaker than the ones listed above, Anthropic justifies this by saying that “there are no well-established organizations or procedures for this sort of practice, and [they] are approaching it as an experiment”.

In practice, Anthropic do seem to solicit external input at least for some risk models, noting that they “developed [their] views by consulting with experts, including from Deloitte consulting and SecureBio” in the context of non-novel chemical and biological weapons production. It is however unclear whether the involvement of those external experts extended beyond the construction of evaluations.

To improve, Anthropic should start soliciting feedback from external reviewers on their risk models and clearly state in which capacity those reviewers were invovled, and otherwise provide information on what their criteria for independent third-party reviewers are, and why, in their view, no available organizations meet this bar.

Quotes:

“We will work toward a practice of seeking comprehensive, public external review on our Risk Reports. This means working with one or more third-party organizations that will receive private versions of our Risk Reports (unredacted or with minimal redactions, as discussed below) and publish comprehensive commentary on them. Commentary will address topics including the quality of our reasoning, the validity of our risk assessments, the overall level of risk, and whether the redactions we’ve made for the public version are reasonable and appropriate.

Our intent is for external reviewers’ judgments to carry significant weight in the eyes of the public as well as our employees. […]

At a minimum, we will complete a full external review process (described below) with at least one external reviewer anytime a Risk Report covers highly capable models and is significantly redacted, defined as follows:

A model is “highly capable” if we conclude that it crosses the threshold for automated AI R&D described in Section 1.” (RSP, p.13)

“We will consider this threshold to be met if we determine that either (1) our models would be able to fully substitute for our entire set of Research Scientists and Research Engineers, at competitive costs (i.e., within a factor of 5); or (2) there is “dramatic acceleration” of the pace of AI progress for reasons that likely relate to the automation of AI R&D. We would consider scenario (2) to have occurred where (a) we observe or expect double the rate of progress4 in AI aggregate capabilities compared to the rate we’d expect in the absence of significant AI contributions to AI R&D and (b) it is plausible that this doubling is substantially attributable to the automation of research and/or engineering (as opposed to other factors, such as increased headcount, compute, or general productivity), such that continuation of the trend in AI progress could lead to even greater acceleration.” (RSP, pp.8-9)

“We will select external reviewers that:
● Have significant experience and expertise regarding evaluations for dangerous AI capabilities and propensities. It’s particularly important that they be knowledgeable about potential ways such evaluations might be misleading (for example, alignment faking).
● Have reputational and other incentives making them likely to be candid about their assessment of risks, rather than focused on writing comments that Anthropic will approve of. For example, external review parties should not be teams whose revenue, reputation and success depend entirely on Anthropic and similar companies continuing to work with them.
● Do not have conflicts of interest with respect to Anthropic. At a minimum, a reviewing organization itself may not have a financial interest in Anthropic; and the individuals involved in conducting the review, as well as anyone above them in the reporting chain within their organization, may not have a financial interest in Anthropic or close personal relationships with anyone at Anthropic (i.e., family relationships, romantic relationships, or shared living arrangements).” (RSP, p.14)

“The external review will address:
1. Adequacy of information: Whether the Risk Report contains sufficient information to assess the identified risks;
2. Analytical rigor: The strength of the Risk Report’s reasoning and analysis;
3. Areas of disagreement: Whether the external reviewer disagrees with any of the Risk Report’s key claims and, if so, the reasons for any such disagreements; and
4. Risk reduction recommendations: Recommendations for further reducing identified risks.” (RSP, p.14)

“Review and feedback: We will solicit comprehensive internal feedback on the report, focusing on identifying potential methodological weaknesses, analytical gaps, or areas requiring additional evidence or clarification. We may also use this feedback to improve or refine the report itself. We will usually also seek feedback from trusted external parties with relevant expertise.” (RSP, p.12)

Anthropic

2. Risk Analysis and Evaluation

Very Weak 19%

2.1 Setting a Risk Tolerance (35%) 7%

2.1.1 Risk tolerance is defined (80%) 8%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 25%

They mention that the framework outlines “what it would take, at an industry-wide level, to keep catastrophic risks reliably low”, but no qualitative definition is given of what “reliably low” means.

Implicitly, the capability thresholds define a proto-risk tolerance. To improve, they must set out the maximum amount of risk the company is willing to accept, for each risk domain (though they need not differ between risk domains), ideally expressed in terms of probabilities and severity (economic damages, physical lives, etc), and separate from KRIs.

Quotes:

“Our recommendations for industry-wide safety outline what it would take, at an industry-wide level, to keep catastrophic risks reliably low through a period of rapid advances in AI capabilities.” (RSP, p.3)

“‘Catastrophic risk’ as used in our RSP refers generally to risks of the most severe potential harms from advanced AI, such as existential threats or fundamental destabilization of global systems. We use this term in its plain meaning rather than adopting any specific statutory definition.” (RSP, p.4)

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

The risk tolerance, implicit or otherwise, is not expressed fully or partly quantitatively. To improve, the risk tolerance should be expressed fully quantitatively or as a combination of scenarios with probabilities.

Quotes:

No relevant quotes found.

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

No mention of quantitative risk tolerance.

Quotes:

No relevant quotes found.

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

No evidence of engaging in public consultations or seeking guidance from regulators for risk tolerance.

Quotes:

No relevant quotes found.

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 26%

2.2.1 Key Risk Indicators (KRI) (30%) 35%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

The RSP defines capability/usage thresholds for four risk domains—misaligned AI systems in high-stakes settings (previously “high-stakes sabotage opportunities”), automated AI R&D in key domains, and novel & non-novel chemical/biological weapons production. The risk report provides qualitatively described risk models for each of those thresholds. For the latter three, the risk report includes information on specific evaluations used to inform the risk evaluation; for sabotage, it provides a long, structured argument, partitioned into four key claims, and drawing on past alignment research as well as evaluations for specific models like Claude Opus 4.6. While a lot of effort has evidently flown into the evaluation of those risks, the documents do not provide clear evaluation-level red-lines, and thus the determination of the acceptability of risks seems mostly discretionary.

Further, the FCF defines risk tiers for harmful manipulation and cyber offense. However, as those are not assigned (un)acceptability statements, relevant evaluation, or required mitigations, it is unclear how they are used in practice to determine risk levels.

To improve, Anthropic should establish clear links between evaluations conducted and their KRIs to reduce allow external scrutiny of risk determination, and provide context around how the risk tiers for cyber offense and harmful manipulation in the FCF inform risk assessment in practice.

Quotes:

“The left column identifies capability thresholds that would call for heightened mitigations.” (RSP, p.4)

“Novel chemical/biological weapons production. AI systems that can functionally substitute for the scarce human expertise that is currently the primary barrier to novel development of chemical and biological weapons with potential for catastrophic harm. That is, a well-resourced team could, using the model, accomplish the end-to-end agent design and deployment (including, as relevant, agent design, verification and validation, formulation, and dissemination) that would otherwise require recruiting one of a small number of world-leading specialists.” (RSP, p.7)

“Automated R&D in key domains. AI systems that can fully automate, or otherwise dramatically accelerate, the work of large, top-tier teams of human researchers in domains where fast progress could cause threats to international security and/or rapid disruptions to the global balance of power—for example, energy, robotics, weapons development and AI itself. […] We will consider this threshold to be met if we determine that either (1) our models would be able to fully substitute for our entire set of Research Scientists and Research Engineers, at competitive costs (i.e., within a factor of 5); or (2) there is “dramatic acceleration” of the pace of AI progress for reasons that likely relate to the automation of AI R&D.” (RSP, pp.8–9)

“Tier 1: Meaningful technical assistance for active cyber operations using known attack techniques and methodologies. Some automation is involved, but still requires human input to complete successful large cyber-operations.
Tier 2: Completely autonomous cyber operations with novel offensive capability development and adaptive persistence. For example, autonomous discovery/exploitation of previously unknown vulnerability classes, self-directed campaign orchestration adapting to defenses, or sustained operations evolving without human intervention.” (FCF, p.6)

“Given its nascency as a systemic risk area within our risk management framework, our approach to Harmful Manipulation is exploratory and we expect it to evolve as we continue to explore and conduct further research. […]
Tier 1: Campaign Infrastructure Enablement: Escalation to systematic infrastructure enabling sophisticated scaled influence campaigns. Automates >50% of steps normally requiring multiple sophisticated adversarial actors […]
Tier 2: Autonomous adaptive techniques and campaign execution: End-to-end automation of deceptive influence operations with systematic targeting capabilities, requiring <10% human oversight.” (FCF, p.8)

“Tier 1: Misaligned AI systems in high-stakes settings. AI systems that write large amounts of critical code and/or are otherwise in a position where they are highly relied on and have extensive access to sensitive assets, as well as moderate capacity for autonomous, goal-directed operation and subterfuge […]
Tier 2: Automated R&D in key domains. AI systems that can fully automate, or otherwise dramatically accelerate, the work of large, top-tier teams of human researchers in domains where fast progress could cause threats to international security and/or rapid disruptions to the global balance of power — particularly in energy, robotics, weapons development and AI itself. For the time being, we use AI R&D capabilities as a proxy for broader R&D capabilities, as this domain likely plays to AI systems’ current strengths and is more tractable to assess than capabilities in other domains. Additionally, AI R&D alone could cause acceleration in AI capabilities improvements, to the point where all of the threats listed above (and more) develop very quickly. We will consider this threshold to be met if we determine that either (1) our models would be able to fully substitute for our entire set of Research Scientists and Research Engineers, at competitive costs (i.e., within a factor of 5); or (2) there is “dramatic acceleration” of the pace of AI progress for reasons that likely relate to the automation of AI R&D. We would consider scenario (2) to have occurred where (a) we observe or expect double the rate of progress in AI aggregate capabilities compared to the rate we’d expect in the absence of significant AI contributions to AI R&D and (b) it is plausible that this doubling is substantially attributable to the automation of research and/or engineering (as opposed to other factors, such as increased headcount, compute, or general productivity), such that continuation of the trend in AI progress could lead to even greater acceleration. Our working operationalization is to trigger this risk threshold at the point where we determine that a model could compress two years of 2018–2024 AI progress into a single year. It may be sensible to add earlier, and/or easier-to-measure, thresholds that trigger less demanding versions of the mitigations for this threshold.” (FCF, pp.9–10)

“We believe, in light of the above analysis, that: Some of our models do meet the left column’s threshold: they ‘write large amounts of critical code and/or are otherwise in a position where they are highly relied on and have extensive access to sensitive assets, as well as moderate capacity for autonomous, goal-directed operation.'” (Risk Report, p.56)

“Our pre-deployment alignment assessment reports the following, reproduced from the Claude Opus 4.6 System Card” (Risk Report, p.19)

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%

The RSP includes two high-level description of quantitative KRIs for automated AI R&D, one defined as a doubling in the rate of progress of AI capabilities, and the other as an upper bound for the cost of replacing Anthropic’s entire research engineer team by AI. It is, however, unclear how exactly these would be measured in practice. Another quantiative KRI, for kernel optimization, is included under the sabotage section but actually comes from the autonomous AI R&D section of Claude Opus 4.6’s system card, and we will thus treat as an AI R&D rather than a sabotage KRI.

Quantitative ASL-3 thresholds appear for two evaluations in the non-novel CB section (2.8× uplift and 0.8 pass@5), but are absent from the majority of evaluations reported in the same table and from all other threat model sections. Their presence appears to reflect carry-over from the prior ASL-based framework rather than systematic threshold-setting under RSP 3.3, which has moved away from prescriptive ASL level specifications.

The FCF provides two high-level KRIs for harmful manipulation, one in terms of percentage of tasks the AI can complete in autonomically, and the other in terms of the percentage of tasks requiring human oversight; it is not clear how Anthropic intends to measure those.

There are no quantitative KRIs given for cyber offense, chemical, novel biological, radiological and nuclear weapons, or loss-of-control risks. On cyber offense, the FCF states that its risk tier system “quantifies model capabilities against cybersecurity threat metrics”, but the risk tiers given are exclusively qualitative.

To improve, Anthropic should provide at least one clear quantitative KRI threshold for each risk domain that can be used to determine the risk stemming from their models in that domain, and provide at least a list of evaluations that they will use to measure cyber offense, chemical, novel biological, radiological and nuclear weapons, or loss-of-control risks.

Quotes:

“We will consider this threshold to be met if we determine that either (1) our models would be able to fully substitute for our entire set of Research Scientists and Research Engineers, at competitive costs (i.e., within a factor of 5); or (2) there is “dramatic acceleration” of the pace of AI progress for reasons that likely relate to the automation of AI R&D. We would consider scenario (2) to have occurred where (a) we observe or expect double the rate of progress in AI aggregate capabilities compared to the rate we’d expect in the absence of significant AI contributions to AI R&D and (b) it is plausible that this doubling is substantially attributable to the automation of research and/or engineering (as opposed to other factors, such as increased headcount, compute, or general productivity), such that continuation of the trend in AI progress could lead to even greater acceleration.” (RSP, p.9)

“‘Double the rate of progress’ means ‘as much progress in one year as one would see in two years at baseline.’ For example, if baseline progress involved a 3x scaleup in compute and a 3x improvement in algorithmic efficiency (for a 9x ‘effective scaleup’), ‘double the rate of progress’ would entail something like an 81x effective scaleup.This is not the same idea as ‘doubling researchers’ productivity,’ since doubling inputs does not necessarily double the rate of progress.” (RSP, p.9)

“Bioweapons acquisition uplift trial: […] Score: 63% ± 13%; Uplift: 2.53× (vs 25% control) […] Results for Claude Opus 4 as we did not conduct it for Claude Opus 4.6. ASL-3 threshold: 2.8× uplift.” (Risk Report, p.66)

“Long-form virology tasks: […] Virology Task 2: 0.912 (pass@5) […] ASL-3 threshold: 0.8.” (Risk Report, p.67)

“Internal AI research evaluation suite 2: […] Claude Opus 4.6 scored 0.613, surpassing our rule-out threshold (for equivalence to an entry-level research scientist or research engineer, set based on the evaluation suite designer’s judgment) of 0.6.” (Risk Report, p.59)

“Tier 1: Campaign Infrastructure Enablement: […] Automates >50% of steps normally requiring multiple sophisticated adversarial actors. […]
Tier 2: Autonomous adaptive techniques and campaign execution: End-to-end automation of deceptive influence operations with systematic targeting capabilities, requiring <10% human oversight.” (FCF, p.8)

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 10%

Anthropic mentions monitoring several external indicators which have the potential to change the risk posed by their models, such as changes to threat models and novel capability-preserving jailbreaks. However, there are no concrete ways mentioned that these could serve as measurable KRIs. To improve, Anthropic should include measurable indicators of risk beyond model capability evaluations as a trigger for suitable KCIs.

Quotes:

2.2.2 Key Control Indicators (KCI) (30%) 18%

2.2.2.1 Containment KCIs (35%) 18%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 25%

Only one KRI threshold (for non-novel chemical/biological weapons production) has a clear KCI mapped to it (ASL-3 protections). The novel CB weapons production only mentions ASL-3 protections as a floor, with more substantive KCI thresholds such as RAND SL4 security only mentioned in the “ambitious industry-wide recommendations”, which Anthropic does not directly commit to. Sabotage and AI R&D do not have corresponding KCIs: For AI R&D, the RSP commits to conducting “moonshot R&D for security projects” with the goal of uncovering new ways of enhancing security, and describes various processes it will implement to track and risk and develop red-teaming capabilities. For sabotage, the RSP commits to describing the current state of their models in their risk reports.

Further, starting from RSP 3.0, the frameworks have started moving away from using the ASL system, and continue its usage only to describe present levels of risk mitigation, with the approach envisioned for future models placing a larger focus on making structured arguments instead to enhance flexibility. However, as the frameworks lack a clear description of the criteria such arguments must meet, the strength of mitigations committed to for different risks becomes more discretionary.

To improve, Anthropic should expand their previously established ASL system to cover increasing KRI levels and commit to clear containment KCIs that are mapped to specific KRIs.

Quotes:

For non-novel chemical/biological weapons production:
“We will maintain or improve on our ASL-3 protections, which include classifier guards at least as robust as our initial Constitutional Classifiers; access controls for trusted users with exemptions to classifier guards; red-teaming, bug bounties, and threat intelligence for continually assessing the threat of jailbreaks; and a number of noteworthy security controls. […] We expect to continuously meet the criteria in the right column, although we cannot make guarantees about an evolving landscape with continually adaptive attackers.” (RSP, p.6)

“ASL-3 represents a substantially stronger set of safeguards applied to models that could provide meaningful uplift on priority threat models if fully accessible. Specifically, such safeguards aim to be:
– Robust to persistent attempts to misuse a potentially dangerous capability.
– Highly protected against most attackers’ attempts at stealing model weights.
– However, the following threat actors are explicitly out of scope for the above bullet points:
(a) state-sponsored programs that specifically target us […];
(b) A small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks;
(c) sophisticated insiders […].” (Risk Report, pp.99–100)

“Anthropic has developed a security program to protect ASL-3 model weights against most non-state attackers, including cybercriminal groups, hacktivists, and corporate espionage actors. […] Our threat model explicitly scopes ASL-3 protections against non-state actors and unsophisticated insiders. Sophisticated insiders, and nation-state attackers with capabilities like novel zero-day attack chains, remain out of scope for ASL-3: defending against such actors requires security investments beyond what we’ve currently achieved.” (Risk Report, p.100–101)

For novel chemical/biological weapons production:
“We will apply protections at least as strong as our ASL-3 protections (see previous row) to an expanded set of potential use cases for AI, covering the most likely vectors for this threat. Additionally, we will identify the most concerning specific threat pathways, create policy recommendations for early detection and response for such threats, and share this content with policymakers.” (RSP, p.7)

“Earlier editions of our RSP defined ‘AI Safety Levels’ with specific lists of required controls. We still use this concept to refer to, and distinguish between, present levels of risk mitigations—those that we maintain for existing AI models. […] However, when defining the risk mitigations needed for future levels of AI capability, we have found that providing a specific list of controls is overly rigid, and we instead prefer to focus on what sort of argument an AI developer should make […].” (RSP, p.16)

“Because we cannot always anticipate what safety and security measures will be appropriate for models beyond the current frontier, the specific mitigations we implement may be determined when the relevant risk tier is reached, informed by the threat landscape at that time.” (FCF, p.6)

On sabotage:
“We will detail the state of our AI systems’ capabilities and propensities, our monitoring practices, and the overall level of risk in our Risk Reports. We expect to continually be able to meet the criteria in the right column, although we cannot make guarantees about an evolving technology that may increasingly have the ability to detect and manipulate testing.” (RSP, p.7)

On AI R&D:
“We will:
● Resource and complete significant ‘moonshot R&D for security’ projects, to explore ambitious and possibly unconventional ways to achieve unprecedented levels of security against the world’s best-resourced attackers.
● Achieve an ‘eyes on everything’ state for our internal AI development. We will comprehensively gather, centralize, and maintain logs for all critical AI-development activities […].” (RSP, pp.8)

“Security is an ongoing and immediate priority for us, but it is also a long-term challenge where we’ll need to be creative and explore promising and incomplete ideas. This is because we may, at some point in the future, be targets of the world’s best-resourced attackers. Our moonshot R&D projects are exploring ambitious, possibly unconventional ways to achieve unprecedented levels of security.” (Frontier Safety Roadmap)

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 10%

Anthropic mostly refrains from providing quantitative KCI thresholds. The RSP mentions the RAND SL4, but only as an “industry-wide recommendation” rather than a direct commitment. The only exception is the case when Anthropic would be “in the lead” in AI capabilities, in which case the RSP states that they would require a “strong argument that catastrophic risk is contained, along the lines of [their] recommendations for industry-wide safety”. However, both the determination of when the “in the lead” classification would be applicable as well as the decision of how close to the “industry-wide recommendations” they would need to be remain discretionary under the current RSp.

Quotes:

“Anthropic in the lead. We have developed or will imminently develop a highly capable model; and we have clear evidence that no other competitor will soon develop such a model: We will require a strong argument that catastrophic risk is contained, along the lines of our recommendations for industry-wide safety (see Section 1).” (RSP, p.17)

“Accomplishing this would likely mean security roughly in line with RAND SL4.” (RSP, p.8)

“when defining the risk mitigations needed for future levels of AI capability, we have found that providing a specific list of controls is overly rigid, and we instead prefer to focus on what sort of argument an AI developer should make (and what sorts of actors it should address) regarding the risk level from its systems.” (RSP, p.16)

2.2.2.2 Deployment KCIs (35%) 13%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 25%

The RSP includes some qualitative deployment KCI thresholds for novel and non-novel chemical/biological weapons production, mainly in the form of Anthropic’s ASL-3. It does not include such KCI tresholds for its other two specified risk domains, sabotage and automated R&D. The RSP also states a decision to move away from the ASL system for future risk management, and only use it for present levels of risk mitigations.

Quotes:

“We will maintain or improve on our ASL-3 protections, which include classifier guards at least as robust as our initial Constitutional Classifiers; access controls for trusted users with exemptions to classifier guards; red-teaming, bug bounties, and threat intelligence for continually assessing the threat of jailbreaks; and a number of noteworthy security controls.” (RSP, p.6)

“We will apply protections at least as strong as our ASL-3 protections (see previous row) to an expanded set of potential use cases for AI, covering the most likely vectors for this threat. Additionally, we will identify the most concerning specific threat pathways, create policy recommendations for early detection and response for such threats, and share this content with policymakers.” (RSP, p.7)

“Accomplishing this would likely mean security roughly in line with RAND SL4.” (RSP, p.8)

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%

There are no quantitative deployment KCI thresholds given.

Quotes:

No relevant quotes found.

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 25%

For the most relevant advanced KRI threshold (“misaligned AI systems in high-stakes settings”), Anthropic does not commit to specific assurance process KCIs. Rather, they state that they “expect to continually be able to meet” their industry-wide recommendations, which include a “strong argument that AI systems will not carry out sabotage leading to irreversibly and substantially higher odds of a later global catastrophe”, along with some indications for what such an argument might require. However, this is framed as an expectation rather than a commitment, weakening its normative force. The Risk Report’s four-claim structure for the sabotage threat model demonstrates that Anthropic is operationalizing something like this argument in practice, though this is a post-hoc description rather than a pre-specified assurance threshold.

Quotes:

“Misaligned AI systems in high-stakes settings. AI systems that are highly relied on and have extensive access to sensitive assets as well as moderate capacity for autonomous, goal-directed operation and subterfuge—such that it is plausible these AI systems could (if directed toward this goal, either deliberately or inadvertently) carry out sabotage leading to irreversibly and substantially higher odds of a later global catastrophe. […]
We will detail the state of our AI systems’ capabilities and propensities, our monitoring practices, and the overall level of risk in our Risk Reports. We expect to continually be able to meet the criteria in the right column, although we cannot make guarantees about an evolving technology that may increasingly have the ability to detect and manipulate testing […]
A frontier developer should make a strong argument that AI systems will not carry out sabotage leading to irreversibly and substantially higher odds of a later global catastrophe. This case may initially be relatively simple and rely heavily on capability limitations, if it is first required when the risk is merely plausible. As risk becomes harder to rule out, this case will likely include some combination of:
● Internal compartmentalization, restriction, and code review to prevent excessive sabotage opportunities for AI models.
● Capability assessments demonstrating that AI models lack the ability to carry out irreversible (which would generally mean unnoticed) sabotage.
● Monitoring and/or restricting AI behavior and usage internally.
● Evidence that AI models lack the propensity to deceive, manipulate, or sabotage users.” (RSP, p.7)

“Our core argument around misaligned goals includes three key claims, made in the next three sections:
● Claim 1: Any model trained under our current standard practices in the near future is relatively unlikely to have dangerous coherent misaligned goals.
● Claim 2: Our alignment assessment, drawing on simulated scenarios, interpretability investigations, monitoring of internal and external pilot usage, and external testing, found no evidence of dangerous coherent misaligned goals in Claude Opus 4.6.
● Claim 3: Claude Opus 4.6 is very unlikely to be capable of hiding dangerous coherent misaligned goals from us in the face of such an assessment. […]
● Claim 4: Claude Opus 4.6 has limits to its ability to reason opaquely and its ability to implement complex long-term plans that collectively make it very difficult for it to take ambitious harmful actions in the presence of the security and monitoring safeguards described in the Section 2.5.” (Risk Report, p.15)

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 25%

The Risk Report includes ex-post qualitative justifications for why the risks from some of the eight different threat pathways for sabotage remain low, including assessments of mitigating factor strength. However, these justifications are not present for all sabotage pathways, with some of them missing a concrete risk assessment and justification therefor.
The RSP also states that they believe their employed ASL-3 levels would be sufficient to prevent model theft by all but the most advanced threat actors, but do not provide a justification through risk modeling for this.
As the RSP is explicitly moving away from pre-defined KCIs, it also does not provide ex-ante justifications for KCI sufficiency. Further, the RSP explicitly states that mitigations deployed depend at least partly on the risk management practices of industry peers, weakining the RSP’s normative strength and increasing discretion over mitigation choices by introducing additional layers of uncertainty, such as the accuracy of peer model’s capabilities and mitigations.
Lastly, the RSP does not provide KCIs or justifications for their KRI-KCI pairing for cyber offense or radiological and nuclear risks.

Quotes:

“Threat-specific risk assessment: For each threat model, we will analyze remaining absolute risk–i.e., the leftover risk after accounting for our mitigations. We will also discuss whether we believe we’ve crossed relevant thresholds in our recommendations for industry-wide safety (see Section 1), and (as relevant) whether we believe we’re meeting the risk mitigation standards corresponding to them.” (RSP, p.11)

“We outline eight pathways toward potential catastrophic harm […]. For each pathway, we:
● Introduce the mechanism by which the model’s misaligned action would lead to unwanted and potentially-catastrophic outcomes.
● Give our rough sense of the frequency and complexity of misaligned actions that would be required for Claude Opus 4.6 to cause catastrophic harm via this pathway.
● Discuss the considerations that lead us to believe that Claude Opus 4.6 is unlikely to cause catastrophic harm through this pathway.
○ Alongside each of these mitigating factors, we note the degree to which it contributes to our overall assurance that the pathway does not pose a major risk.”

“Our previous RSP committed to implementing mitigations that would reduce our models’ absolute risk levels to acceptable levels, without regard to whether other frontier AI developers would do the same. […] We now separate our plans as a company […] from our more ambitious industry-wide recommendations. […] But we cannot commit to following [the industry-wide recommendations] unilaterally.” (RSP, p.3)

“Our threat model explicitly scopes ASL-3 protections against non-state actors and unsophisticated insiders. Sophisticated insiders, and nation-state attackers with capabilities like novel zero-day attack chains, remain out of scope for ASL-3: defending against such actors requires security investments beyond what we’ve currently achieved.” (Risk Report, p.101)

“While our understanding of this domain is somewhat immature, we do not currently expect that undetectably-subtle poisoning of large quantities of data, as envisioned in research on Emergent Misalignment, could lead to acute harms through this pathway.” (Risk Report, p.42)

“Weights of models that require ASL-3 safeguards are kept under enhanced security with respect to models that require ASL-2 safeguards. We believe this to be sufficient security that it would be difficult for most attackers—excluding sophisticated insiders and attackers with nation-state backing—to steal model weights.” (Risk Report, p.71)

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 25%

In the RSP, the “commitments related to competitors” include commitments to “delay AI development and deployment as needed” contingent either on Anthropic being the only company to soon develop a “highly capable” model, or on other companies close to developing such models having more rigorous risk management practices. However, the bar for “highly capable model” is very high and narrowly defined in terms of AI R&D. Further, Anthropic notes that “in many cases, [they] will not have enough information to determine that the relevant scenario applies”, weakening the practicality and thus the strength of the commitments.
The RSP does not provide non-competitor-contingent criteria for pausing development and/or deployment, and does not mention details about de-deployment processes.

Quotes:

“These commitments are necessarily high-level and limited. In many cases, we will not have enough information to determine that the relevant scenario applies and will have to use our best judgment to deal with uncertainty. […] Further, the commitments below do not preclude us from taking cautionary action, such as refraining from training or deploying models, in other circumstances. Mitigating the risks from our models is a top priority for us, and we would strongly consider pausing development and/or deployment to improve the safety profiles of our models even in cases not covered below.” (RSP, p.17)

“Anthropic in the lead. We have developed or will imminently develop a highly capable model; and we have clear evidence that no other competitor will soon develop such a model:
We will require a strong argument that catastrophic risk is contained, along the lines of our recommendations for industry-wide safety (see Section 1). We will delay AI development and deployment as needed to achieve this, until and unless we no longer believe we have a significant lead.

Competitors have strong safety measures. We have strong evidence that all competitors who have developed, or will soon develop, a highly capable frontier model are able to make strong arguments that catastrophic risk is contained, in the spirit of our recommendations for industry-wide safety (see Section 1):
For our highly capable frontier models, we will meet or exceed the overall risk reduction posture of these competitors, as far as we can tell based on our best efforts to assess that posture. Until we are able to do so, we will delay AI development and deployment as needed to achieve this.” (RSP, p.17)

“A model is ‘highly capable’ if we conclude that it crosses the threshold for automated AI R&D described in Section 1.” (RSP, p.14)

“Executive approval: The Risk Report, along with the internal feedback and any available external feedback, will be sent to the CEO and Responsible Scaling Officer (RSO) for final review and approval. The CEO and RSO will make the ultimate determination regarding the adequacy of the risk assessment and any downstream deployment or development plans.” (RSP, p.12)

Anthropic

3. Risk Treatment

Weak 30%

3.1 Implementing Mitigation Measures (50%) 30%

3.1.1 Containment measures (35%) 25%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 25%

The risk report defines a post-hoc list of ten “notable security controls” for ASL-3, which are thus applicable to the KRIs for novel and non-novels biological/chemical weapons construction. Redactions are explicitly justified by informational security considerations. Also, the RSP states that “specifics may change”, making these containment measures a discretionary choice rather than a transparent pre-commitment. Risk mitigations for ASL-2 models are present in the risk report but do not include containment measures. The KRIs for sabotage and automated R&D are not assigned any specific containment measures. Further, the RSP expliticly states a move away from specificity in their safeguard commitments, shifting towards an outcome-focused approach.

Quotes:

“More outcome-focused safeguard requirements: We have updated our ASL-3 safeguards requirements to be less prescriptive and more outcome-focused. Rather than detailing specific operational and technical safeguards, we now specify the overall security or deployment standards and requirements for meeting them.” (RSP, p.18)

“Earlier editions of our RSP defined ‘AI Safety Levels’ with specific lists of required controls. We still use this concept to refer to, and distinguish between, present levels of risk mitigations […]. However, when defining the risk mitigations needed for future levels of AI capability, we have found that providing a specific list of controls is overly rigid, and we instead prefer to focus on what sort of argument an AI developer should make […].” (RSP, p.16)

“1. Egress bandwidth controls: Network-level restrictions that limit data transfer rates out of sensitive environments […] 2. Multi-party access controls (2PC): Requires a second employee to approve access requests for model weights and other sensitive resources […] 3. Binary allowlisting: Only pre-approved software can execute on employee devices […] 4. Hardware security keys for authentication: Phishing-resistant MFA using hardware tokens bound to specific domains […] 5. Device authorization […] 6. Cloud storage restrictions […] 7. Restricted session lengths for privileged access: Privileged cloud identities require re-authentication hourly […] 8. Network segmentation […] 9. Centralized security monitoring: Aggregated logging with automated anomaly detection […] 10. Network source policies on privileged resources […].” (Risk Report, pp.101–102)

“Specifics may change, but we will maintain equally or more robust measures over time and will publish updates in our Risk Reports.” (RSP, p.6)

“Non-novel chemical/biological weapons production […]: We will maintain or improve on our ASL-3 protections, which include classifier guards […] and a number of noteworthy security controls. Specifics may change, but we will maintain equally or more robust measures over time and will publish updates in our Risk Reports.” (RSP, p.6)

“Novel chemical/biological weapons production […]: We will apply protections at least as strong as our ASL-3 protections (see previous row) to an expanded set of potential use cases for AI, covering the most likely vectors for this threat.” (RSP, p.7)

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

The containment measures given in the RSP and risk report, which take the form of the ASL-3 mitigations, are not justified beyond pure assertions that the mitigations provide “sufficient security that it would be difficult for most attackers […] to steal model weights”. The risk report states that there was an “attempt” to “broadly align” the security program with established frameworks like ISO 27001 and 42001, which provides useful context but is not specific or committal enough to qualify as a justification. The FCF asserts that security controls are “regularly tested and independently reviewed” and lists some general security testing practices, but similarly does not provide any concrete justifications for containment measure sufficiency. This aligns with the RSP’s stated move from specific lists of controls to argument-based containment assurance.

Quotes:

“Earlier editions of our RSP defined ‘AI Safety Levels’ with specific lists of required controls. […] However, when defining the risk mitigations needed for future levels of AI capability, we have found that providing a specific list of controls is overly rigid, and we instead prefer to focus on what sort of argument an AI developer should make (and what sorts of actors it should address) regarding the risk level from its systems.” (RSP, p.16)

“The above argument for the sufficiency of our safeguards rests on several key assumptions about our threat models, the effectiveness of our defenses, and our ability to respond to emerging threats. We monitor for: […]
– Access control sufficiency.
– We monitor the effectiveness of our access control systems through threat intelligence monitoring for reports of credential leaks […].” (Risk Report, pp.76–77)

“Security control monitoring, testing, and assessments: Security controls are regularly tested and independently reviewed to ensure effectiveness. Penetration testing, vulnerability disclosure programs, third party risk assessments, and incident response tabletop exercises aim to help defenses remain robust, and insights from these activities are used to strengthen the company’s security posture over time.” (FCF, p.13)

“We’ve attempted to broadly align our program with established frameworks including SOC 2 Type 2, ISO 27001, and ISO 42001, while extending these to address AI-specific risks to model weights.” (Risk Report, p.101)

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 25%

The FCF provides a high-level commitment and description to independent review of security controls to “ensure effectiveness”. The RSP provides descriptions of the desiderata for the external review of their risk reports and those conducting them. However, these commitments only apply if a model is “highly capable” and the risk report “signficantly redacted”, which is a high bar to achieve according to Anthropic’s given definition of “highly capable”. It is also narrowly defined in terms of AI R&D capacity, not taking into account other capabilities in other risk domains. Furthermore, it unclear whether the review of security measures outlined in the risk report would include an assessment of the sufficiency of containment measures and if they do, whether they would take place before the relevant risk level to activate them would be crossed.

Quotes:

“Security controls are regularly tested and independently reviewed to ensure effectiveness. Penetration testing, vulnerability disclosure programs, third party risk assessments, and incident response tabletop exercises aim to help defenses remain robust, and insights from these activities are used to strengthen the company’s security posture over time.” (FCF, p.13)

“We will work toward a practice of seeking comprehensive, public external review on our Risk Reports. […] Commentary will address topics including the quality of our reasoning, the validity of our risk assessments, the overall level of risk, and whether the redactions we’ve made for the public version are reasonable and appropriate. […] At a minimum, we will complete a full external review process (described below) with at least one external reviewer anytime a Risk Report covers highly capable models and is significantly redacted […]” (RSP, p.13)

“We will select external reviewers that: have significant experience and expertise regarding evaluations for dangerous AI capabilities and propensities […]; have reputational and other incentives making them likely to be candid about their assessment of risks […]; do not have conflicts of interest with respect to Anthropic. At a minimum, a reviewing organization itself may not have a financial interest in Anthropic; and the individuals involved in conducting the review […] may not have a financial interest in Anthropic or close personal relationships with anyone at Anthropic […]” (RSP, p.13)

“Public comments. We will ask the external reviewer to memorialize its findings on the topics above in a written report that is made public. External reviewers will be bound by obligations not to disclose confidential information […], but beyond that will not be restricted in what they can publish, including concerns about the Risk Report or Anthropic’s conduct in connection with the external review.” (RSP, p.14)

3.1.2 Deployment measures (35%) 25%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 25%

The RSP committs to the ASL-3 deployment standard for non-novel chemical/biological weapons production, but leaves open the possibility to change its specific implementation. The risk report provides ex-post mitigations for sabotage risks, although they only apply to internal deployments. No substantive definitions for deployment measures are given for novel chemical/biological weapons production or automated R&D. The RSP also includes a dedicated section justifying moving away from pre-specifying required controls.

Quotes:

“Our baseline protections for ASL-2 models include:
1. Acceptable usage policies and enforcement: Our Usage Policy prohibits catastrophic harm categories (CBRN, cyber attacks) and high harm categories.
2. Harmlessness training: our models are trained to refuse harmful requests, including with respect to chemical and biological weapons.
3. Fine-tuning API protections: In finetuning products, data is filtered for harmfulness, and models are subject to automated evaluation aiming to ensure harmlessness features are not degraded. […]
4. Vulnerability reporting channels: Paths within the product for users to report harmful or dangerous model outputs, as well as a bug bounty for universal jailbreaks. […]” (Risk Report, p.68)

“4.5.2 Risk mitigations for ASL-3 models
4.5.2.1 Real-time classifier guards: We implement safeguards based on Constitutional Classifiers, trained to target relevant uses of concern according to our threat models. These classifier guards monitor model inputs and outputs in real time, and block the model from producing information that would be needed to successfully carry out the threat […]
4.5.2.2 Offline monitoring […]
4.5.2.3 Access controls: We allow some users to access models whose output isn’t restricted by the real-time classifier guards, in cases where their legitimate business needs require this. We use a manual vetting process in these cases. […]
4.5.2.4 Bug bounty program: We run our Model Safety Bug Bounty Program through HackerOne […] We pay up to $45,000 per novel, universal jailbreak identified. […]
4.5.2.5 Threat intelligence: We contract with threat intelligence vendors to monitor and scrape the deep and dark web for:
(i) publicly available universal jailbreaks of concern […]
4.5.2.6 Rapid response: We have a range of response options with different tradeoffs to address jailbreaks and vulnerabilities.” (Risk Report, pp.69–71)

“These measures include:
– Employees routinely reporting surprising or concerning behaviors through in-app mechanisms and internal feedback Slack channels
– Automatic offline Claude Code behavior monitoring covering a ≥10% subsample of prompts by most users, based on a prompted Claude Opus 4.5 classifier
– Sandboxing and automatic monitors that block some risky actions of some Claude Code agents
– Security controls focused on protecting model weights […]
– Mandatory manual review and automatic (but non-blocking) Claude-based review of code before integrating it into shared Anthropic codebases.” (Risk Report, pp.32–33)

“For deployments to outside users, we do not have sabotage-oriented monitoring measures in place beyond basic ordinary reporting opportunities that rely on voluntary disclosure by users or external researchers.” (Risk Report, p.33)

“We currently target these risk mitigations (particularly strong classifier guards) specifically at reducing the risk discussed in section 4, and we have not yet extended them to target all of the uses of models that might be indicated by this threat model.” (Risk Report, p.84)

“For >90% of Anthropic employees, the LLM supervisor (powered by Claude Opus 4.5) runs on at least 10% of code edits and bash commands. This covers more than 10% of total actions, as the supervisor is asked to analyze not only the last action, but also all actions in the trajectory. This surfaces on the order of 10,000 positives per week. We use a Claude Sonnet 4.5 classifier to exclude common sources of false positives. The remaining transcripts are analyzed using a recursive summarization tool powered by Claude Sonnet 4.5. These summaries as well as on the order of 20 transcripts per week are reviewed by an AI control researcher.” (Risk Report, p.98)

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

The risk report provides some justification for the sufficiency of safeguards, but those are provided ex-post, not fulfillling the desiderata for this criterion. Further, there are processes for soliciting proof of safeguard effectiveness, such as bug bounty programs. The final risk acceptability decision is explicitly discretionary and lies with the CEO and RSO of Anthropic, without providing clear ex ante sufficiency criteria for deployment mitigations. The RSP explicitly notes the move away from listing specific safeguards ex-ante.

Quotes:

“Less prescriptive evaluation methodology: We have replaced some specifics in our previous testing methodology (e.g., using 1% of compute for elicitation or creating a 6x buffer), with more general requirements to (a) match expected efforts of potential adversaries and (b) provide informal estimates of how further scaling and research developments will impact model capabilities and performance on the same tasks. […] Although still an aspirational goal, the science of evaluations is not currently mature enough to make confident predictions about the precise buffer we should require between current models and a Capability Threshold.” (RSP, p.18)

“Risk mitigations: For each in-scope model, the mitigations we are implementing across security, deployment safeguards, and alignment domains, along with discussion of their effectiveness.” (RSP, p.11)

“Changes to classifier guards (multiple and ongoing) — we have altered our classifier guards repeatedly both to improve robustness and reduce costs. In general, for significant changes, we run a dedicated bug bounty to verify robustness before deploying a new system.” (Risk Report, p.71)

“We run our Model Safety Bug Bounty Program through HackerOne to incentivize third-party researchers to discover and report vulnerabilities in our AI systems, particularly ‘jailbreaks’ that could bypass the classifier guards […]. We perform automated assessments of response informativeness to determine whether the hackers have succeeded. We pay up to $45,000 per novel, universal jailbreak identified. […] Since May 2025, we have averaged 100+ active hackers monthly.” (Risk Report, p.70)

“Since launching our upgraded classifier guards and bug bounty in May of 2025, we have not become aware of any publicly known universal jailbreaks […] We have become aware of jailbreaks via our bug bounty and via external red-teaming, and have remediated all but the most recent of these (for which remediation is underway). […] Overall, we believe that highly effective, publicly known universal jailbreaks for ASL-3 classifier guards are rare.” (Risk Report, p.74)

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 25%

The risk report mentions running a bug bounty program to identify vulnerabilities and jailbreaks, which has been active and ongoing since May 2025. It also states that “in general, for significant changes, [they] run a dedicated bug bounty to verify robustness before deploying a new system”, creating an explicit mention of deployment measure sufficiency testing, while staying somewhat discretionary and non-binding.

The RSP contains a commitment to external review of their risk reports, which would liekly an assessment of the sufficiency of deployment measures, though this is not explicitly mentioned. It also elaborates on the qualifications that external reviewers must meet, how conflicts of interest will be avoided, and reputational and other incentives which should be in place to motivate candid and assessment of the risks by the third party. However, these requirements are binding only if the risk report is “highly redacted” and the models it covers are “highly capable”, defined as a x2 speedup in AI R&D. This is a bar that is currently not met, and thus the external review process specified is not reflected in the current risk report.

Quotes:

3.1.3 Assurance processes (30%) 43%

3.1.3.1 Credible plans towards the development of assurance processes (40%) 25%

While Anthropic’s documents do not include clear assurance process KCIs, the frontier safety roadmap asserts that current alignment assessment methods are sufficient to determine that sabotage risk is “very now (though not negligible)”. Further, it provides timeline estimates of when highly capable AI might arrive (“as soon as early 2027”), and provides a list of different goals that they want to have achieved by specific dates before then, including “alignment assessments” to assess Claude’s compliance with Anthropic’s constitution (October 1, 2026) or developing an internal red-teaming method that “outperforms the collective contributions from the hundreds of participants in our bug bounty” (January 1, 2027). However, while providing information on implemented and planned processes, the documents do not contain explicit justifications for whether and why Anthropic believes that they will be able to keep meeting assurance process KCIs. Further, the RSP explicitly notes that the “Frontier Safety Roadmap is subject to change”, weakening the binding force of the dated commitments.

Quotes:

“We believe that AI capabilities will improve rapidly in the coming years. We will need to quickly and dramatically improve our state of preparedness in a number of areas, especially: Security […] Safeguards […] Alignment […] Policy […]” (Frontier Safety Roadmap)

“Additionally, today’s models show strong and improving capabilities for extended, autonomous technical work, and are increasingly relied on for writing high-stakes code. In our Risk Reports, we now regularly analyze the possibility that these models might execute high-stakes sabotage. We believe that our current alignment assessment methods and monitoring practices are sufficient to make the case that the risk of catastrophic sabotage is very low (though not negligible).” (Frontier Safety Roadmap)

“We believe it is plausible, as soon as early 2027, that our AI systems could fully automate, or otherwise dramatically accelerate, the work of large, top-tier teams of human researchers in domains where fast progress could cause threats to international security and/or rapid disruptions to the global balance of power—for example, energy, robotics, weapons development and AI itself. By that point, we expect to have accomplished most of the goals listed above, including: […] Consistently implementing systematic alignment audits and other measures for upholding Claude’s Constitution.” (Frontier Safety Roadmap)

“We will: […] Perform systematic alignment assessments to examine Claude’s behavioral patterns and propensities, meaningfully incorporating mechanistic interpretability and adversarial red-teaming to test our auditing methods. […] Develop our internal red-teaming of our deployment safeguards to the point where our internal red-teaming performs better at finding potential jailbreaks than the collective abilities of the participants in our established bug bounty programs. […] Publish Risk Reports with the status of, and noteworthy findings from, all of the above, subject to external review by at least one expert, experienced, credible, candid and disinterested external party.” (RSP, p.8–9)

“Alignment Target: October 1, 2026 — Upholding Claude’s Constitution […] We will perform systematic ‘alignment assessments’ to examine Claude’s behavioral patterns and propensities, and evaluate whether they are in line with the spirit of the Constitution. These alignment assessments will incorporate findings from our interpretability research, and will validate the effectiveness of our methods using testing exercises with intentionally misaligned models. We will publish our findings in our system cards or Risk Reports.” (Frontier Safety Roadmap)

“Safeguards Target: January 1, 2027 — World-class internal red-teaming […] We believe we can develop a method for red-teaming our systems (likely involving significant automation) that outperforms the collective contributions from the hundreds of participants in our bug bounty.” (Frontier Safety Roadmap)

“Cross-cutting Target: January 1, 2027 — Moving toward an ‘eyes on everything’ state for our internal AI development activities […] We aim to establish comprehensive, centralized records of all our critical AI development activities, and to use AI to analyze these records for issues including concerning behavior by insiders (both human and AI) and security threats.” (Frontier Safety Roadmap)

3.1.3.2 Evidence that the assurance processes are enough to achieve their corresponding KCI thresholds (40%) 50%

As assurance process KCIs are not directly defined, it is difficult to assess whether the methods given are sufficient to show that assurane processes are sufficient. Despite this limitation, the frontier safety roadmap commits to performing alignment assessments that incorporate interpretability findings and model organisms and report on them in the system cards or risk reports. The risk report provides an example of implemented stress-testing of the alignment auditing processes; however, those are run mainly on inferior Claude versions, limiting their applicability to Anthropic’s most capable models like Claude Opus 4.6. Additionally, the risk report includes invalidation conditions that would call into question the sufficiency of alignment audit methods, but does not define a remediation path for this scenario.

Quotes:

“We will perform systematic ‘alignment assessments’ to examine Claude’s behavioral patterns and propensities, and evaluate whether they are in line with the spirit of the Constitution. These alignment assessments will incorporate findings from our interpretability research, and will validate the effectiveness of our methods using testing exercises with intentionally misaligned models. We will publish our findings in our system cards or Risk Reports.” (Frontier Safety Roadmap)

“We have conducted three major exercises to date in which we stress-tested the parts of the alignment auditing processes that we use in practice against intentionally-misaligned model organism models. The results suggest that our methods catch at least the large majority of concerning behaviors studied, at least in the context of the weaker models we study. We hope to do similar analysis on more capable models in the future.
– Our initial work on alignment audits showed that human investigators, using a subset of the methods that we used with Claude Opus 4.6, can identify and root-cause concerning behavior in a Haiku 3.5 model that was finetuned in a way meant to make it subtly misaligned. This is the only one of our major exercises in which we extensively optimized the model organism to make its misaligned trait difficult to discover.
– Our subsequent work on alignment auditing agents showed that the automated behavioral audit method that we used with Claude Opus 4.6 can identify concerning behaviors in Claude Sonnet 3.5 (Oct 2024) models that are system-prompted to show quirky behaviors 7 out of 10 times with no human participation. […]
– Our work on natural emergent misalignment studied four Claude Sonnet 4 variants that were trained to be subtly misaligned using realistic variants of our production training pipeline. In all four cases, the automated behavioral audit method that we used with Claude Opus 4.6 identified them as less well aligned than the corresponding baseline models. […]
– Our work on an overt saboteur studied three Claude Sonnet 4.5 variants that were trained to sabotage Anthropic but were not explicitly trained to conceal this behavior. A human working together with an automated auditing agent successfully identified all three overt sabotage agents, without flagging the two benign models.” (Risk Report, pp.27–28)

“The methods used in the alignment audit appear sufficient to identify dangerous goals, values, or propensities if they are present in models at or below Claude Opus 4.6’s capability level. […] [Invalidation conditions:] A model-organisms study shows that it is possible for such a model to circumvent the full suite of audit methods and still realize one of the pathways discussed here. An updated capabilities assessment casts into doubt the robustness of the alignment audit to the possibility of deliberate deception. We observe a model that has previously undergone an alignment audit attempting an unexpected and egregiously misaligned action that could substantially raise the risk from one of the pathways discussed here.” (Risk Report, p.50)

3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 75%

The risk report provides a list of potential pitfalls in alignment evaluations and evaluates them separately, including evaluation awareness and realism, steganography ability, reasoning faithfulness and sandbagging. The claims are generally qualified with a qualitative confidence level and backed up with reasoning referencing evidence given in the report and system cards. However, the claim about reasoning not harboring hidden information is backed up using an eval using mathematics exercises, which might not be suitable to fully elicit this type of capability due to a lack of motivation. Further, the risk report explicitly acknowledges that some types of sandbagging may not be caught with their current methods, but does not propose alternative detection methods.

Quotes:

“The biggest challenge we face in our alignment assessments is the possibility that the model under study can reliably identify our test scenarios as test scenarios, and acts differently in those scenarios in ways that render our results unrepresentative of what we will observe in deployment. […] We believe that, on the basis of all of the sources of evidence discussed above—including our heavy use of real data from pilot internal and external use—this risk is adequately managed at the current level of capability. However, this will remain a significant challenge in future assessments […].” (Risk Report, p.24)

“Results on SHADE-Arena shows that Claude Opus 4.6, when prompted to subtly complete a side task while appearing to only work on a primary task, succeeds only 18% of the time when thinking is available and monitored, and 16% of the time when thinking is unavailable. […] still implies that if the model displayed similar capability with a genuine hidden goal, it would be overwhelmingly likely to be caught.” (Risk Report, p.25)

“We have little confidence in the propensity of any frontier model to give a complete and accurate picture of how it reasons in its reasoning text […]. However, we believe that there is an important but much weaker claim that we can make about Claude Opus 4.6: The model does not appear to reliably hide or misrepresent the contents of sequential reasoning that is taking place in its reasoning text. […] The system card reports on an in-house fine-tuning based encoded thinking evaluation (System Card §6.4.2), which shows that the Claude Opus 4.6’s ability to complete mathematics problems is not hindered when its thinking text is replaced by a paraphrase of that thinking text, demonstrating that, at least in this test domain, it does not use its reasoning text in human-illegible ways.” (Risk Report, p.26)

“Our primary concern with these models is sandbagging: strategically choosing to underperform on evaluations. We find this very unlikely for reasons given in previous sections, but it is difficult to test for directly in many domains. […] We find the following: We see no legible evidence of sandbagging attempts.” (Risk Report, p.53)

“The residual risk of sandbagging largely accrues in cases where sandbagging would not significantly benefit from extended thinking—where both the incentive to sandbag and the best way to subtly sandbag are straightforwardly obvious from context […]. It seems plausible that models’ behavior in these cases need not be closely correlated with their behavior in more complex cases, such that occasional sandbagging may not be caught. We must rely on other factors to rule out significant impacts from this kind of subtle sandbagging.” (Risk Report, p.39)

“Our process for training and evaluating Claude Opus 4.6 involved the extensive use of Claude models as graders, monitors, or scorers in ways that could create the potential for collusion, where the scorer models systematically cooperate with Claude Opus 4.6 to hide evidence of coherent misalignment. We find this unlikely. […] we used our automated behavioral audit tool to run several small-sample-size investigations of honeypot scenarios where it might be appealing for a misaligned model, when acting as a monitor, to try to overlook bad behavior by another instance of the same model. We observed no concerning behavior in these settings, though they did largely rely on automatic grading (by Claude Opus 4.5, rather than Claude Opus 4.6). […] our self-preference evaluation […] evaluated Claude Opus 4.6, Opus 4.5 and Sonnet 4.5 for their degree of self-preference. […] Sonnet 4.5 shows modest self-preference in ambiguous qualitative judgment settings, while Claude Opus 4.6 and Opus 4.5 show effectively no self-preference.” (Risk Report, p.29)

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 29%

3.2.1 Monitoring of KRIs (40%) 24%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 50%

The risk report states that helpful-only model versions were used for biorisk evaluations, and describes the amount of time (“2 days”) that evaluators had to craft a plan. It also mentions that some model capabilities might have been under-elicited due to removal of training data—which could potentially be undone by a threat actor through fine-tuning—but do not give an approach to address this. Similarly, the risk report mentions not using targeted fine-tuning for sabotage evaluations, and states that the model was “clearly under-elicited” on a subversion strategy evaluation; these are honest limitations, but no potential approach to curb them is given. Lastly, the RSP changelog includes an explicit acknowledgement that elicitation specifics have been replaced with more general requirements, justified by stating that “specific methodologies may become outdated”.

Quotes:

“Less prescriptive evaluation methodology: We have replaced some specifics in our previous testing methodology (e.g., using 1% of compute for elicitation or creating a 6x buffer), with more general requirements to (a) match expected efforts of potential adversaries and (b) provide informal estimates of how further scaling and research developments will impact model capabilities and performance on the same tasks. We have found that specific methodologies may become outdated when new research developments are introduced. Although still an aspirational goal, the science of evaluations is not currently mature enough to make confident predictions about the precise buffer we should require between current models and a Capability Threshold.” (RSP, p.17)

“Controlled trial measuring AI assistance for bioweapons acquisition planning. Participants had 2 days to draft a comprehensive plan. Control group could access the internet but not our models; the test group had access to Claude with safeguards removed. Outputs graded by Deloitte using detailed rubric.” (Risk Report, p.66)

“Staff from Deloitte with biosecurity experience engaged in extended conversations testing ability to assist with known bioweapons development. Both helpful-only and standard variants tested over 3 days. Qualitative assessment of risk provided.” (Risk Report, p.67)

“Note that model capabilities were somewhat hampered by our deliberate removal of some training data (in order to lower dangerous capabilities). It’s plausible that a sophisticated threat actor with access to model weights could use fine-tuning to effectively undo this removal.” (Risk Report, p.83)

“This evaluation requires us to elicit the desired sabotage-related behavior. Our elicitation used our standard best practices for lightweight evaluations, including some prompt iteration, but did not involve targeted fine-tuning. This leaves open the possibility that the model could have succeeded under other circumstances.
Subversion Strategy intended to study intentional rare failure attacks […]. Unfortunately, we found Claude Opus 4.6 to be clearly under-elicited at this task, to a degree that we don’t have confidence in the value of the results.” (Risk Report, p.25)

3.2.1.2 Evaluation frequency (25%) 10%

There is no information provided on self-imposed mandatory re-evaluation triggers for maximum-elicitation evaluations, neither in terms of time, nor in terms of scaling progress. The FCF mentions a time-based trigger (“every nine months”), but this applies to lighter-touch evaluations, which by definition do not use approaches to produce maximal elicitation.

Quotes:

“In addition to carrying out full Systemic Risk Assessments as described above, we conduct lighter-touch model evaluations (which may include running our automatic evaluations and collaborating with external experts to test our models) to consider whether further systemic risk mitigations may be required or a full Systemic Risk Assessment and Model Report update is required. The below trigger points help us determine when a model is substantially modified enough to require an additional Model Report for the updated model as part of our obligations under the TFAIA.
– Every nine months, unless an update of the relevant model is planned within a month of the trigger point; and
– A new model is in training and test model snapshots are available and appropriate for early evaluation.” (FCF, p.14)

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 10%

The RSP changelog mentions that specific safety margins have been replaced by “informal estimates of how further scaling and research developments will impact model capabilities”. Thus, the RSP and risk report do not provide documentation of specific methodologies to incorpororate post-training ehnancements, or the size of safety margins deployed.

Quotes:

“Less prescriptive evaluation methodology: We have replaced some specifics in our previous testing methodology (e.g., using 1% of compute for elicitation or creating a 6x buffer), with more general requirements to (a) match expected efforts of potential adversaries and (b) provide informal estimates of how further scaling and research developments will impact model capabilities and performance on the same tasks. We have found that specific methodologies may become outdated when new research developments are introduced. Although still an aspirational goal, the science of evaluations is not currently mature enough to make confident predictions about the precise buffer we should require between current models and a Capability Threshold.” (RSP, p.17)

“On one evaluation, kernel optimization, Opus 4.6 achieved a 427× speedup using a novel scaffold, far exceeding the 300x threshold for 40 human-expert-hours of work and more than doubling performance under our standard setup.This suggests some capability overhang constrained by current tooling rather than fundamental model limitations. As a result, while we do not believe Claude Opus 4.6 meets the threshold for ASL-4 autonomy safeguards, we find ourselves in a gray zone where clean rule-out is difficult and the margin to the threshold is unclear. We expect with high probability that models in the near future could cross this threshold.” (Risk Report, p.31)

3.2.1.4 Vetting of protocols by third parties (15%) 25%

The RSP details a process for third parties to review the risk reports, including required expertise and guarantee of independence for evaluators. However, those only grip when models are “highly capable” and risk reports “significantly redacted”, creating a high bar and discretion for Anthropic. Further, it is unclear whether these external reviews of risk reports would cover evaluation methodologies in-depth or only at a superficial level.
A more general statement is given that commits to external evaluations of all reports, but this is significantly weakened in its binding force by stating that Anthropic will “usually also seek feedback” for those reports. The FCF similarly mentions that Anthropic “may solicit input from external actors in relevant domains […] in the process of implementing [their] risk assessment processes”, but does not include specific expertise requirements, and does not explicitly state that these inputs are also requrested for capability evaluations specifically. Lastly, the RSP includes details about a “procedural compliance review” on an approximately annual basis, but notes that this does not focus on substantive outcomes, but rather procedural compliance with the RSP.

Quotes:

“Review and feedback: We will solicit comprehensive internal feedback on the report, focusing on identifying potential methodological weaknesses, analytical gaps, or areas requiring additional evidence or clarification. […] We will usually also seek feedback from trusted external parties with relevant expertise.” (RSP, p.12)

“We may solicit input from external actors in relevant domains, and other stakeholders, in the process of developing and implementing our systemic risk assessment processes (including the identification of potential risks and appropriate safety and security mitigations). We will also rely on commissioned research reports, discussions with domain experts, input from expert forecasters, public research, engagement with the Frontier Model Forum, and internal discussions in implementing our systemic risk assessment processes.” (FCF, p.14)

“Procedural compliance review: On approximately an annual basis, we will commission a third-party review that assesses whether we adhered to this policy’s main procedural commitments. This review will focus on procedural compliance, not substantive outcomes.” (RSP, p.15)

3.2.1.5 Replication of evaluations by third parties (15%) 10%

While the risk report provides multiple examples of implementation of external KRI assessments, predominantly for biorisk evaluations, there is no structural commitment in the RSP or elsewhere to such practices. The FCF includes a non-binding statement that external testing of models “may” be included in lighter-tough evaluations, but does not provide details.

Quotes:

“Bioweapons acquisition uplift trial: Controlled trial measuring AI assistance for bioweapons acquisition planning. […] Outputs graded by Deloitte using detailed rubric.” (Risk Report, p.66)

“ASL-3 expert red teaming: Staff from Deloitte with biosecurity experience engaged in extended conversations testing ability to assist with known bioweapons development. Both helpful-only and standard variants tested over 3 days. Qualitative assessment of risk provided.” (Risk Report, p.67)

“Long-form virology tasks: Task-based agentic evaluations developed with SecureBio, Deloitte, and Signature Science. Tests end-to-end completion of pathogen acquisition processes including workflow design and laboratory protocols.” (Risk Report, p.67)

“Multimodal Virology (VCT): Multiple-choice evaluation from SecureBio assessing virology-specific knowledge with images.” (Risk Report, Table 4.4.A, p.67)

“Short-Horizon Computational Biology Tasks: Faculty.ai evaluations testing multi-step pathogen analysis/engineering tasks.” (Risk Report, Table at §5.4, p.82)

“On simple Claude-3.7-Sonnet-generated red-team trajectories provided by UK AISI on their Control Arena ‘IAC’ setting, the supervisor achieves a recall of 100% […] On other Control Arena red-team trajectories provided by Apollo Research MATS scholars, it achieves a recall of 98%.” (Risk Report, §7.4.8, p.98)

3.2.2 Monitoring of KCIs (40%) 35%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 50%

The risk report provides a list of things Anthropic monitors for to inform assessment of mitigation effectiveness, including changes to the threat model, inconsistencies across surfaces, public availability of universal jailbreaks, and bug bounty effectiveness. They further provide a failure mode analysis for how threat actors might be able to obtain harmful instructions from a model, including 6 listed threat variants and one redacted one. The framework does not quantify detection confidence. They also mention an ongoing bug bounty program to detect and report vulnerabilities. However, most of monitoring processes concern only measures for deployment KCIs, while containment and assurance process KCIs are not analyzed in depth. The FCF briefly addresses containment KCIs, but only in a superficial way, stating that “security controls are regularly tested and independently reviewed to ensure effectiveness” and lists four broad categories of interventions to test the controls.

Quotes:

“The above argument for the sufficiency of our safeguards rests on several key assumptions about our threat models, the effectiveness of our defenses, and our ability to respond to emerging threats. We monitor for:
– Changes to the threat model. We will consider any significant changes to our threat models — for example, related to the length of model access required to achieve uplift, the number of potential threat actors, and changes in the complexity of the threat pathway.
– Persistent changes in our safeguards. We periodically inventory our deployed models across different surfaces, assessing whether we remain able to provide the risk mitigations described above.
– Inconsistencies across surfaces. We implement a synchronized monitoring system that runs regular batch tests on AWS, GCP, and 1P environments. We send identical sets of test prompts to each platform and verify that the same content is blocked across all surfaces, to check whether our real-time classifier guards are functioning uniformly.
– Public availability of capability-preserving universal jailbreaks for ASL-3 uses of concern. Our bug bounty program and threat intelligence work are intended to give us information about availability of jailbreaks.
– Access control sufficiency. We monitor the effectiveness of our access control systems through threat intelligence monitoring for reports of credential leaks, particularly for bio-classifier-exempt organizations.
– Jailbreak coverage sufficiency. We assess whether our classifier guards are adequately covering all areas of concern relevant to ASL-3 threat models. We do this through consultation with domain experts and internal evaluations.
– Bug bounty effectiveness. If we find evidence that suggests universal jailbreaks are more valuable than what we pay out in the bug bounty program, this may undermine the efficacy of the bug bounty program. […] We also regularly assess the number of active red-teamers on the bug bounty program.” (Risk Report, pp.76–77)

“Beyond highly effective, publicly known universal jailbreaks, there are a number of other (though less likely, in our view) ways a threat actor might obtain persistent guidance from an AI model:
Threat variant 2: a threat actor finds a highly effective, nonpublic universal jailbreak on their own […]
Threat variant 3: a threat actor obtains a highly effective, nonpublic universal jailbreak from someone else […]
Threat variant 4: a threat actor exploits a trusted user exemption […]
Threat variant 5: a threat actor steals our model weights […]
Threat variant 6: an attacker steals our model weights and uses them to create a safeguard-free version […]
Threat variant 8: an attacker obtains significant uplift from our model despite not doing any of the above.” (Risk Report, pp.75–76)

“Our previous analysis on this topic states: ‘Overall, we expect that highly effective, publicly known universal jailbreaks will generally be available for 1 day out of 5–10, or less […]’ On one hand, we are not aware of any cases when highly effective, publicly known universal jailbreaks were available over the last 6+ months. On the other hand, the jailbreaks that we found privately (via our bug bounty or external red-teaming) took longer to remediate than we had hoped. We are working to improve remediation times and will revisit this topic in our next Risk Report. Overall, we believe that highly effective, publicly known universal jailbreaks for ASL-3 classifier guards are rare.” (Risk Report, p.74)

“We run our Model Safety Bug Bounty Program through HackerOne to incentivize third-party researchers to discover and report vulnerabilities […]. We pay up to $45,000 per novel, universal jailbreak identified. […] Since May 2025, we have averaged 100+ active hackers monthly. […] We contract with threat intelligence vendors to monitor and scrape the deep and dark web for: (i) publicly available universal jailbreaks of concern […]; (ii) black markets for model jailbreaks; (iii) reports of API key leaks […].” (Risk Report, pp.70–71)

3.2.2.2 Vetting of protocols by third parties (30%) 25%

The FCF states that “security controls are […] independently reviewed to ensure effectiveness”, which provides a high-level commitment to third-party assesment of containment measure effectiveness, but not of the internal process to measure effectiveness. Beyond this, Anthropic does not document a clear process for independent third parties to review the methods for assessing the efficacy of KCI measures. The RSP includes details on general external review of risk reports, which involves commentary on “topics including the quality of our reasoning, the validity of our risk assessments, the overall level of risk”, and whether redactions are reasonable. While this may include the review of KCI measure efficacy testing documented in the risk report, it is not explicitly named as a priority in the assessment scope. Further, the commitment to external reviews is only presented as binding if the models are “highly capable” and the risk report “significantly redacted”, a high bar to reach involving discretionary assessment by Anthropic, weakening the binding force of the commitment significantly.

Quotes:

“Review and feedback: We will solicit comprehensive internal feedback on the report, focusing on identifying potential methodological weaknesses, analytical gaps, or areas requiring additional evidence or clarification. […] We will usually also seek feedback from trusted external parties with relevant expertise.” (RSP, p.12)

3.2.2.3 Replication of evaluations by third parties (30%) 25%

The FCF includes a statement that security controls are “independently reviewed to ensure effectiveness”, providing a high-level commitment to external containment KCI efficacy assessment without operational detail on required expertise or guarantee of independence. Beyond this, the “Model Safety Bug Bounty Program through HackerOne” is brought up repeatedly as one of the main ongoing ways to spot vulnerabilies in Anthropic’s models, covering mainly deployment KCIs. The HackerOne affiliation provides a useful initial filter, but does not specify any additional vetting done to guarantee relevant expertise and evaluator independence.

Quotes:

“Since launching our upgraded classifier guards and bug bounty in May of 2025, we have not become aware of any publicly known universal jailbreaks […]. We have become aware of jailbreaks via our bug bounty and via external red-teaming, and have remediated all but the most recent of these […]. On the other hand, the jailbreaks that we found privately (via our bug bounty or external red-teaming) took longer to remediate than we had hoped.” (Risk Report, p.74)

“We will maintain or improve on our ASL-3 protections, which include […] red-teaming, bug bounties, and threat intelligence for continually assessing the threat of jailbreaks […]. Specifics may change, but we will maintain equally or more robust measures over time and will publish updates in our Risk Reports.” (RSP, p.6)

3.2.3 Transparency of evaluation results (10%) 43%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 50%

The RSP commits to openly publishing Risk Reports, which include the results of evaluations. These are redacted if needed, with predefined criteria for redaction provided, including legal compliance, IP protection, safety considerations, or privacy. The FCF mentions maintaining a “Serious Incident Reporting Policy” which defines internal processses for keeping track of relevant information about serious AI incidents, which are then shared with the relevant authorities, as appropriate. However, this is a general requirement to monitor and report “serious”/”critical” incidents, rather than a commitment to notifying regulators or relevant government authorities in the event of specific KRI thresholds being crossed.

Quotes:

“We will publish Risk Reports discussing the risks of our systems and how we have made determinations about whether to continue AI development and deployment in light of the risks. […] Scope. A Risk Report will cover all publicly deployed models at the time of its publication. It will also cover internally deployed models that we determine (1) could pose risks related to high-stakes misalignment or automated R&D that (2) significantly exceed those posed by models covered by a prior Risk Report. […] Timing. We will publish a Risk Report every 3-6 months.” (RSP, p.10)

“We will publish a public version of our Risk Report. We will aim to minimize redactions to the public version of the report. Reasons we may redact material include but are not limited to:
1. Legal compliance […]
2. Intellectual property protection […]
3. Public safety considerations […]
4. Privacy […]” (RSP, p.13)

A Risk Report will document the following:
1. Threat model identification […]
2. Threat model specification […]
3. Evidence (including evaluations) about relevant model capabilities and behaviors […]
4. Risk mitigations: For each in-scope model, the mitigations we are implementing across security, deployment safeguards, and alignment domains, along with discussion of their effectiveness. […]
Threat-specific risk assessment: For each threat model, we will analyze remaining absolute risk […] and (as relevant) whether we believe we’re meeting the risk mitigation standards corresponding to them.” (RSP, pp.11–12)

“When we publicly deploy a model that we determine is significantly more capable than any of the models covered in the most recent Risk Report, we will publish a discussion (in our System Card or elsewhere) of how that model’s capabilities and propensities affect or change the analysis in the Risk Report. […] Within 30 days of determining that we have an internally deployed model that is in-scope […], we will publish a discussion (in a System Card or elsewhere) […].” (RSP, p.10)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 10%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 10%

The FCF mentions identifying “systemic risks on an ongoing basis across the entire model lifecycle”, but does not provide any further information on this process.

Quotes:

“We identify systemic risks on an ongoing basis across the entire model lifecycle.” (FCF, p.5)

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 10%

For EU models, the FCF includes a commitment to conduct an “additional full systemic risk assessment” if Anthropic has “reasonable grounds to believe” that risk acceptability justifications have been “materially undermined”, but does not specify whether the discovery of novel risks or risk pathways is included in these “reasonable grounds”. It also does not explicitly apply to non-EU models.

Quotes:

“[F]or any of our EU models that are subject to this Framework, if we have reasonable grounds to believe that the justification for why risks stemming from the model are acceptable as set out in the relevant Model Report has been materially undermined, we will complete an additional full Systemic Risk Assessment. We will update our Model Report as appropriate following this additional Systemic Risk Assessment.” (FCF, p.13)

Anthropic

4. Risk Governance

Moderate 50%

4.1 Decision-making (25%) 44%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 50%

The company has the position of Responsible Scaling Officer, which is positive. However, it is not specified if they are the risk owner for all AI-related risks, or merely oversee implementation the of the RSp.A dedicated risk owner is mentioned once in the FCF, but it is unclear whether risk ownership is an additional role of the RSO, a separate individual, or several individuals.

Quotes:

“Responsible Scaling Officer: We will maintain the position of RSO, a designated member of staff who is responsible for the implementation of this policy. The RSO’s duties will include (but are not limited to): (1) as needed, proposing updates to this policy; (2) approving relevant model development or deployment decisions based on our risk assessments; (3) reviewing major contracts (e.g., deployment partnerships) for consistency with this policy; (4) overseeing the implementation of this policy, including the allocation of sufficient resources; (5) receiving and addressing reports of potential instances of noncompliance; and (6) making judgment calls on policy interpretation and application.” (RSP, p.15)

“Provided the residual risk falls within acceptable levels, taking into account appropriate safety margins, the model is approved for continued development, internal use (where applicable), and launch (as the case may be). Where the residual risk exceeds acceptable levels, further mitigation measures are considered and implemented. In each case, the justification for proceeding will be documented by the risk owner.” (FCF, p.10)

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No specific committee is mentioned as the decision-making body for risk matters. The RSP includes a high-level description of the approval process of risk assessment adequacy by the CEO, RSO, and long-term benefit trust, but the method of making risk decisions is discretionary.

Quotes:

“3. Executive approval: The Risk Report, along with the internal feedback and any available external feedback, will be sent to the CEO and Responsible Scaling Officer (RSO) for final review and approval. The CEO and RSO will make the ultimate determination regarding the adequacy of the risk assessment and any downstream deployment or development plans.
4. Governance notification: Following approval of a Risk Report, the CEO and RSO will promptly share their decision(s), the underlying Risk Report, and internal feedback with both the Board and the LTBT.
5. Modified process when marginal risk analysis is important to our case. In the event that marginal risk analysis […] plays a major role in a decision to move forward, explicit approval of the Risk Report by the Board and LTBT (rather than just the CEO and RSO) will be required.” (RSP, p.12)

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 50%

The RSP describes the way in which the CEO, RSO, and (under certain conditions) the LTBT make the development and deployment decisions on the basis of the risk reports. The RSP Appendix lists two scenarios (“anthropic in the lead” and “competitors have strong safety measures”) for which they define somewhat more specific requirements before being allowed to proceed with development or deployment (listed in the “industry-wide recommendations column of the RSP). Otherwise, the decision process of these actors, together with any criteria that must be taken into account, remains discretionary.

Quotes:

“1. Initial assessment and drafting: Our internal subject matter experts will conduct risk assessments and draft the Risk Report.
2. Review and feedback: We will solicit comprehensive internal feedback on the report, focusing on identifying potential methodological weaknesses, analytical gaps, or areas requiring additional evidence or clarification. We may also use this feedback to improve or refine the report itself. We will usually also seek feedback from trusted external parties with relevant expertise.
3. Executive approval: The Risk Report, along with the internal feedback and any available external feedback, will be sent to the CEO and Responsible Scaling Officer (RSO) for final review and approval. The CEO and RSO will make the ultimate determination regarding the adequacy of the risk assessment and any downstream deployment or development plans.
4. Governance notification: Following approval of a Risk Report, the CEO and RSO will promptly share their decision(s), the underlying Risk Report, and internal feedback with both the Board and the LTBT.
5. Modified process when marginal risk analysis is important to our case. In the event that marginal risk analysis […] plays a major role in a decision to move forward, explicit approval of the Risk Report by the Board and LTBT (rather than just the CEO and RSO) will be required.” (RSP, p.12)

“A Risk Report will document the following:
1. Threat model identification […]
2. Threat model specification […]
3. Evidence (including evaluations) about relevant model capabilities and behaviors […]
4. Risk mitigations […]
5. Additional relevant factors […]
Risk analyses. […]
1. Threat-specific risk assessment: For each threat model, we will analyze remaining absolute risk […].
2. Overall risk assessment […].
3. Risk-benefit determination: We will explain whether, and if so why, we believe the identified risks are justified by corresponding benefits.
4. Looking forward […]” (RSP, p.11)

“We will address:
1. Changes in risk mitigation practices […].
2. Decisions to internally deploy in-scope models that would not otherwise be reviewed in a Risk Report (because they happened in between Risk Reports). We will discuss how these decisions were made and whether they appear reasonable in light of any noteworthy new information that has come to light since then.
3. Changes to our Frontier Safety Roadmap and any cases where we failed to meet our goals.” (RSP, p.12)

“Anthropic in the lead. […] We will require a strong argument that catastrophic risk is contained, along the lines of our recommendations for industry-wide safety (see Section 1). We will delay AI development and deployment as needed to achieve this, until and unless we no longer believe we have a significant lead.
Competitors have strong safety measures. […] For our highly capable frontier models, we will meet or exceed the overall risk reduction posture of these competitors […]. Until we are able to do so, we will delay AI development and deployment as needed to achieve this.” (RSP, p.16)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 75%

The FCF includes a detailed description of actions taken in the case of an incident, providing steps that will be taken to facilitate harm reduction and information sharing. However, it does not specify what kind of harm reduction steps might be taken, and the commitment to information sharing does not extend beyond the legally required authorities mentioned in the CoP and TFAIA. For internal incidents, the RSP specifies that noncompliance reports involving material safety risks are escalated to the Board and may trigger public disclosure.

Quotes:

“Anthropic maintains a detailed Serious Incident Reporting Policy which sets out our internal processes and measures for keeping track of, documenting, and reporting relevant information about:
● Critical Safety Incidents pertaining to Anthropic’s Frontier Models in pursuant to Section 22757.13 of California’s Transparency in Frontier AI Act (“TFAIA”); and
● Serious AI Incidents along the entire GPAISR model lifecycle, in accordance with Commitment 9 (Serious Incident Reporting) of the EU Code and the obligations in Article 55(1)(c) of the EU AI Act.
We have put the following reporting and detection measures in place for observable events that could signify the existence of a Serious AI Incident or Critical Safety Incident, but requires further investigation (an “AI Event”). AI Events are assessed to determine whether they amount to an AI Incident (and in turn a Serious AI Incident) and/or a Critical Safety Incident, as the terms are defined under the relevant regulation.
Anthropic uses various methods including detection and response tooling, end-user feedback, employee reporting channels, bug bounty programs, and community-driven model evaluations to identify AI Events and determine whether they amount to a Serious AI Incident and/or Critical Safety Incident. In some instances, an event may first be identified as a part of Anthropic’s cybersecurity incident response processes, and later assessed to also be a potential Serious AI Incident and/or Critical Safety Incident.
When an AI Event is identified, a member of our Security or Safeguards team (the AI Incident Commander) will be promptly notified and will be responsible for our investigation and response, including assembling an incident response team with appropriate subject matter expert support. One or more members of the incident response team then leads a technical investigation to enable the determination of whether the incident is an AI Incident (and in turn a Serious AI Incident) and/or a Critical Safety Incident and inform appropriate mitigation steps, including gathering relevant information for Anthropic’s reporting to appropriate authorities where applicable, pursuant to the relevant reporting deadlines. If the incident is determined to be a Critical Safety Incident, the AI Incident Commander also determines and documents whether the Critical Safety Incident poses an imminent risk of death or serious physical injury.
We also acknowledge the importance of rectifying harms related to our models and adopting corrective measures to prevent similar future incidents. Following the identification of a Serious AI Incidents or a Critical Safety Incident, Anthropic also works to identify any relevant lessons learned and where applicable consider ways to further assess and mitigate systemic risks related to the Incident.
To support our incident identification and response processes, we provide periodic training to relevant employees on their obligations related to incident response under the TFAIA and the EU AI Act, respectively.” (FCF, pp.11–12)

“Noncompliance reporting: We will maintain a process for Anthropic staff to submit anonymous or identified reports regarding potential noncompliance with this policy. Staff will have more than one option for who receives these reports, including the RSO, and at least one executive who does not report to the RSO. When we receive a report, we will promptly investigate, take appropriate and proportional corrective action if it is substantiated, and document the report and our findings. We will provide quarterly updates to the Board regarding reports of potential noncompliance, whether substantiated or not. If we determine that a report is (1) substantiated and (2) involves a material safety risk, we will promptly notify the Board and we may provide public notice of the same. Finally, we will protect reporters from retaliation, and where a report concerns the conduct of the RSO, at least one recipient will be a member of the Board.” (RSP, p.15)

4.2. Advisory and Challenge (20%) 50%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 75%

Anthropic maintains the position of Responsible Scaling Officer (RSO), who is responsible for overseeing the implementation of the RSp.The amount of resources and staffing available to the RSO are also not specified.

Quotes:

“Updates to this Framework may be proposed by Anthropic’s Head of Safeguards, Responsible Scaling Officer, General Counsel, Head of Integrity & Compliance, or Chief Information Security Officer.” (FCF, p.15)

“Staff will have more than one option for who receives these reports, including the RSO, and at least one executive who does not report to the RSO. […] finally, we will protect reporters from retaliation, and where a report concerns the conduct of the RSO, at least one recipient will be a member of the Board.” (RSP, p.15)

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 50%

The long-term benefit trust (LTBT) provides some evidence for this criterion because it is a standing governance body that receives regular briefings on risk-relevant decisions, can request external review, approves external reviewer selection, and must approve certain high-stakes decisions involving marginal risk analysis. However, the RSP does not clearly state that the LTBT functions as a risk advisory committee to management, meets regularly for that purpose, or has sufficient risk expertise

Quotes:

“Long Term Benefit Trust: We will regularly brief the LTBT on plans and developments related to our Risk Reports, including model training, capability evals, mitigations, and risk analyses.” (RSP, p.15)

“Policy changes: Changes to the RSP will be proposed by the CEO and RSO, and approved by the Board in consultation with the LTBT. If we update the RSP, we will publicly share the updated version prior to or on its effective date and will record the differences from the prior draft in the Change Log.” (RSP, p.15)

“In selecting external reviewers, we will consult with the Board and obtain the approval of the LTBT.” (RSP, p.14)

“Modified process when marginal risk analysis is important to our case. In the event that marginal risk analysis (see previous section) plays a major role in a decision to move forward, explicit approval of the Risk Report by the Board and LTBT (rather than just the CEO and RSO) will be required.” (RSP, p.12)

“In addition, upon the LTBT’s request, we will conduct a public or private external review of our Risk Report or sections thereof, subject to the procedures outlined below.” (RSP, p.13)

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 75%

The FCF describes ongoing lifecycle risk identification, deployed-model monitoring, incident investigations, post-deployment threat intelligence, bug bounty testing, automated detection, human review, and tools for monitoring flag rates in each systemic risk area. The Responsible Scaling Policy adds periodic Risk Reports every 3–6 months and requires Anthropic to track and report noteworthy changes or deviations from previously described mitigation practices.

However, neither the Frontier Compliance Framework nor the Responsible Scaling Policy clearly establishes a centralized risk dashboard or equivalent mechanism aggregating risk levels, KRIs/KCIs, incidents, mitigation status, and risk-owner decisions across systemic risk categories.

Quotes:

“Where the residual risks associated with the model exceed acceptable risk levels, additional mitigation measures are deployed. To identify whether additional mitigations are required, we may rely on the following techniques, among others:
● post-deployment threat intelligence monitoring that tests our detection (real-time and offline) capabilities as well as tracks how malicious actors use our models;
● a bug bounty program designed to test our real-time blocking classifiers and our offline classification systems;
● robust post-launch monitoring infrastructure that combines automated detection, human review, and threat intelligence to identify misuse patterns; and
● tools to guide automated detection and classifiers, or other detection techniques, that allow our enforcement and data science teams to monitor flag rates in each systemic risk area. The classifiers may run either in real-time or offline depending on the particular risk area.” (FCF, p.10)

“We identify systemic risks on an ongoing basis across the entire model lifecycle. Our risk assessment process draws on multiple sources: literature reviews and expert consultation, internal safety and alignment research, and insights from monitoring deployed models and investigating serious incidents and critical safety incidents.” (FCF, p.5)

“We will publish Risk Reports discussing the risks of our systems and how we have made determinations about whether to continue AI development and deployment in light of the risks. These have significant content in common with system cards, but we are adding additional structure and process aimed at presenting our overall assessments of risk. […] We will publish a Risk Report every 3-6 months.” (RSP, p.10)

“When a Risk Report describes risk mitigations we have in place (or plan to implement shortly), we will keep our future practices in line with this description or track and report noteworthy changes and deviations (generally via subsequent Risk Reports). We should make a strong attempt to ensure that ongoing (as opposed to temporary) changes do not significantly increase the level of risk.” (RSP, p.11)

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 10%

Anthropic does not identify a specific internal group or designated risk-expert function responsible for advising and challenging management’s risk-related decisions. The RSP states that the Risk Report, together with internal and external feedback, is sent to the CEO and Responsible Scaling Officer for final review and approval, and that the CEO and RSO make the ultimate determination regarding the adequacy of the risk assessment and downstream deployment or development plans. This provides some evidence that feedback is incorporated into the decision package, but it does not establish a standing advisory/challenge function over management’s risk decisions.

Quotes:

“The Risk Report, along with the internal feedback and any available external feedback, will be sent to the CEO and Responsible Scaling Officer (RSO) for final review and approval. The CEO and RSO will make the ultimate determination regarding the adequacy of the risk assessment and any downstream deployment or development plans.” (RSP, p.12)

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 90%

The Responsible Scaling Policy establishes a detailed system for aggregating and reporting risk information to senior management and the Board. It requires Risk Reports every 3–6 months, specifies that they include threat model identification and specification, evidence from capability and alignment evaluations, mitigations, additional relevant factors, threat-specific and overall risk assessments, risk-benefit determinations, and forward-looking monitoring plans. These reports, together with internal and available external feedback, are sent to the CEO and Responsible Scaling Officer for final review and approval, after which the CEO and RSO promptly share their decisions, the Risk Report, and internal feedback with the Board and LTBT.

The main limitation is that the Responsible Scaling Policy does not clearly specify a separate regular Board reporting cadence beyond prompt sharing of approved Risk Reports and regular LTBT briefings. Still, the reporting format, contents, and cadence are unusually well specified.

Quotes:

“We will publish Risk Reports discussing the risks of our systems and how we have made determinations about whether to continue AI development and deployment in light of the risks. These have significant content in common with system cards, but we are adding additional structure and process aimed at presenting our overall assessments of risk. […] We will publish a Risk Report every 3-6 months.” (RSP, p.10)

“A Risk Report will document the following:

Threat model identification […]
Threat model specification […]
Evidence (including evaluations) about relevant model capabilities and behaviors […]
Risk mitigations […]
Additional relevant factors […].” (RSP, p.11)

“Our analyses will include:
Threat-specific risk assessment […]
Overall risk assessment […]
Risk-benefit determination […]
Looking forward […].” (RSP, p.11)

“The Risk Report, along with the internal feedback and any available external feedback, will be sent to the CEO and Responsible Scaling Officer (RSO) for final review and approval.” (RSP, p.12)

“Following approval of a Risk Report, the CEO and RSO will promptly share their decision(s), the underlying Risk Report, and internal feedback with both the Board and the LTBT.” (RSP, p.12)

“We will regularly brief the LTBT on plans and developments related to our Risk Reports, including model training, capability evals, mitigations, and risk analyses.” (RSP, p.15)

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 38%

4.3.1 The company has an internal audit function involved in AI governance (50%) 25%

While Anthropic does not reference an internal audit function, they do specify independent validation of threat modeling and risk assessment results, as well as sampling-based audits of control effectiveness. However, this applies specifically to the security program rather than the Framework broadly, and the use of “expect” rather than “commit” weakens the assurance.

Quotes:

“Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.” (p. 10)

4.3.2 The company involves external auditors (50%) 50%

Anthropic commits to an annual third-party procedural compliance review and a more substantive external review of Risk Reports covering reasoning quality, validity of risk assessments, and overall risk level, with strong independence criteria and full access to unredacted reports. However, this substantive review is only binding when a Risk Report covers “highly capable” models — narrowly defined as crossing an automated AI R\&D speedup threshold — and is “significantly redacted.” Neither condition is currently met, so the well-specified independence provisions have not been triggered for the current Risk Report. For all other models, Anthropic states it will “usually also seek feedback from trusted external parties,” a materially weaker commitment. The annual compliance review explicitly excludes substantive outcomes.

To improve, Anthropic should lower the double-trigger threshold to apply the detailed external review process to all Risk Reports, and explicitly extend the external review scope to cover evaluation methodology adequacy and control effectiveness.

Quotes:

“At a minimum, we will complete a full external review process (described below) with at least one external reviewer anytime a Risk Report covers highly capable models and is significantly redacted […] In addition, upon the LTBT’s request, we will conduct a public or private external review of our Risk Report or sections thereof, subject to the procedures outlined below.” (RSP, p.13)

“We will select external reviewers that:
● Have significant experience and expertise regarding evaluations for dangerous AI capabilities and propensities. […]
● Have reputational and other incentives making them likely to be candid about their assessment of risks, rather than focused on writing comments that Anthropic will approve of. For example, external review should not be teams whose revenue, reputation and success depend entirely on Anthropic and similar companies continuing to work with them.
● Do not have conflicts of interest with respect to Anthropic. At a minimum, a reviewing organization itself may not have a financial interest in Anthropic; and the individuals involved in conducting the review, as well as anyone above them in the reporting chain within their organization, may not have a financial interest in Anthropic or close personal relationships with anyone at Anthropic (i.e., family relationships, romantic relationships, or shared living arrangements).

In selecting external reviewers, we will consult with the Board and obtain the approval of the LTBT.” (RSP, pp.13–14)

“The external review process will involve sharing the Risk Report with one or more external reviewers within one week of submitting the same report to our Board and LTBT […] For purposes of external review, the only redactions to the Risk Report will be those necessary to comply with legal prohibitions or to maintain our legal rights.” (RSP, p.14)

“Review and feedback: We will solicit comprehensive internal feedback on the report […] We will usually also seek feedback from trusted external parties with relevant expertise.” (RSP, p.12)

4.4 Oversight (20%) 58%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 25%

Anthropic’s Board of Directors plays an active but limited role in risk governance. It approves changes to the RSP, receives quarterly updates on noncompliance reports regardless of outcome, and must be promptly notified if a report is both substantiated and involves a material safety risk. At least one Board member must receive reports concerning the RSO’s own conduct. The Frontier Compliance Framework adds that the board of directors of Anthropic Ireland Limited specifically oversees implementation of the Framework for EU purposes, with material updates requiring Board approval before taking effect.

No document describes a dedicated Board-level risk or audit committee. The Board functions as a whole, with no subcommittee assigned explicit responsibility for risk matters. Critically, the Board has no documented power to reverse or block deployment decisions.

Quotes:

“Policy changes: Changes to the RSP will be proposed by the CEO and RSO, and approved by the Board in consultation with the LTBT. […]” (RSP, p.15)

“We will provide quarterly updates to the Board regarding reports of potential noncompliance, whether substantiated or not. If we determine that a report is (1) substantiated and (2) involves a material safety risk, we will promptly notify the Board and we may provide public notice of the same. Finally, we will protect reporters from retaliation, and where a report concerns the conduct of the RSO, at least one recipient will be a member of the Board.” (RSP, p.15)

“The board of directors of Anthropic Ireland Limited oversees implementation of this Framework for EU purposes.” (FCF, p.15)

“Material updates will be presented to the board of directors of Anthropic Ireland Limited for oversight, with approved changes and justifications for material updates documented in a changelog and published within 30 days of the update.” (FCF, p.15)

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 90%

Anthropic’s Long-Term Benefit Trust (LTBT) is the only external oversight body among the providers assessed. RSP 3.3 gives it a well-defined and substantive role: it receives regular briefings on Risk Report developments, is consulted on all RSP policy changes, receives Risk Reports and underlying decisions following CEO/RSO approval, and holds explicit veto authority when marginal risk analysis drives a deployment decision. The LTBT can also unilaterally request external review of any Risk Report and holds approval rights over the selection of external reviewers.
What remains underdeveloped is the LTBT’s institutional description: the RSP does not specify its composition, member selection criteria, independence requirements, or how its decisions would be enforced in the event of disagreement with management.

Quotes:

“Policy changes: Changes to the RSP will be proposed by the CEO and RSO, and approved by the Board in consultation with the LTBT.” (RSP, p.15)

“Governance notification: Following approval of a Risk Report, the CEO and RSO will promptly share their decision(s), the underlying Risk Report, and internal feedback with both the Board and the LTBT.” (RSP, p.12)

“A report shall be deemed significantly redacted if, in the judgment of the RSO, CEO, Board, or LTBT, it meets this description. […] In addition, upon the LTBT’s request, we will conduct a public or private external review of our Risk Report or sections thereof, subject to the procedures outlined below.” (RSP, p.13)

“In selecting external reviewers, we will consult with the Board and obtain the approval of the LTBT.” (RSP, p.14)

“The external review process will involve sharing the Risk Report with one or more external reviewers within one week of submitting the same report to our Board and LTBT. We will ask the external reviewers to provide public commentary on our report within 30 days of receipt. We will try to work toward a process that involves the full external review being completed prior to Board/LTBT review (and may require this later).” (RSP, p.14)

4.5 Culture (10%) 67%

4.5.1 The company has a strong tone from the top (33.3%) 50%

Anthropic’s frameworks frame safety as central to the company’s purpose across all documents. The FCF opens with a mission statement tying responsible development to concrete activities — safety research, pre-deployment evaluation, and collaboration with the AI safety community. The RSP describes the framework as a “voluntary framework for managing catastrophic risks” and commits to continual updates as capabilities advance. Most distinctively, the Frontier Safety Roadmap is framed as a deliberate “forcing function” requiring cross-departmental safety prioritization over competitive pressures. The RSP’s appendix further states that “mitigating the risks from our models is a top priority,” including willingness to pause development outside formally specified scenarios.

The gap is the absence of a description of how leadership signals these priorities in day-to-day operations. The marginal risk clauses, permitting safeguard reductions relative to competitor behavior, also tension the stated commitment.

Quotes:

“Anthropic’s mission is the responsible development and maintenance of advanced AI for the long-term benefit of humanity. Central to this mission is our commitment to building AI systems that are reliable, interpretable, and steerable. We pursue this through extensive research on AI safety and alignment, rigorous model evaluation and testing to identify and mitigate potential risks before deployment, and active collaboration with the broader AI safety community to share research findings and contribute to industry-wide safety standards.” (FCF, p.3)

“Our Responsible Scaling Policy (RSP) is our voluntary framework for managing catastrophic risks from advanced AI systems. It establishes how we identify and evaluate risks, how we make decisions about AI development and deployment, and, from the perspective of the world at large, how we aim to make sure that the benefits of our models exceed their costs.” (RSP, p.3)

“Mitigating the risks from our models is a top priority for us, and we would strongly consider pausing development and/or deployment to improve the safety profiles of our models even in cases not covered below.” (RSP, p.16)

“Our RSP is only one part of our overall approach to safety. For instance, although this policy focuses on catastrophic risks, they are not the only risks we consider important—our Usage Policy and societal impacts research address other concerns.” (RSP, p.3)

“Anthropic in the lead. We have developed or will imminently develop a highly capable model; and we have clear evidence that no other competitor will soon develop such a model. […] We will require a strong argument that catastrophic risk is contained, along the lines of our recommendations for industry-wide safety (see Section 1). We will delay AI development and deployment as needed to achieve this, until and unless we no longer believe we have a signifcant lead.” (RSP, p.16)

“This report evaluates the degree to which Anthropic’s AI systems pose catastrophic risk in several categories, in light of what we know about both their capabilities and the measures we have in place for mitigating risk. […] This report is not scoped to a single AI model. Rather, it is a risk assessment of Anthropic’s activities as a whole.” (Risk report, p.5)

“By establishing this expectation, we hope to create a forcing function for work that would otherwise be challenging to appropriately prioritize and resource, as it requires collaboration (and in some cases sacrifices) from multiple parts of the company and can be at cross-purposes with immediate competitive and commercial priorities.” (RSP, p.10)

4.5.2 The company has a strong risk culture (33.3%) 50%

Anthropic details various employed practices that contribute to a strong risk culture, including continual framework assessments and updates, risk trainings, reporting channels, and security drills. These details are presented in isolation. To improve, Anthropic should explicitly note the practices that it intends to contribute to its risk culture.

Quotes:

“We have always intended for our RSP to be a living document. We will continually update the RSP as we learn more about AI capabilities and risks, develop and refine technical safety measures, and gain more experience navigating an ecosystem in which the risks to society depend on the actions of many developers.” (RSP, p.3)

“Anthropic commits to ensuring that this Framework is state-of-the-art and reflects Anthropic’s current policies with respect to compliance with the TFAIA and the EU Code. […] The Legal and Compliance function will also determine which updates are required based on factors including, but not limited to, changes in law or regulatory guidance, changes in frontier model capabilities and related technologies, new approaches to mitigations and safeguards, other incidents affecting the industry, and new industry best practices and standards.” (FCF, pp.15–16)

“We will complete a Framework Assessment: (a) at least once every 12 months from the Effective Dates of the TFAIA and the EU Code; and (b) if the relevant factors in the update and approval process are satisfied.” (FCF, p.16)

“Internal review: We will regularly conduct an internal review of our implementation of this policy.” (RSP, p.15)

“Insider threat mitigations: We manage insider risk through personnel screening, regular training, and strict role-based access management. Staff have clear reporting channels to raise concerns, and internal monitoring supports early identification of suspicious activity.” (FCF, p.13)

“To support our incident identification and response processes, we provide periodic training to relevant employees on their obligations related to incident response under the TFAIA and the EU AI Act, respectively.” (FCF, p.12)

“Note: we have made redactions from our public Frontier Safety Roadmap for reasons including protecting sensitive IP and not giving too much information about our current protections to threat actors. The unredacted version is shared with all full-time employees as well as our board and Long-Term Benefit Trust (LTBT).” (Frontier Safety Roadmap)

4.5.3 The company has a strong speak-up culture (33.3%) 100%

Anthropic’s documents are industry-leading in terms of speak-up culture. The dedicated RSP Noncompliance Reporting and Anti-Retaliation Policy lays out in elaborate detail the “process by which employees may report concerns about RSP noncompliance”. It includes information on the different roles relevant to the whistleblowing procedures, what can be reported under the policy, the whistleblowing procedure, confidentiality protections, guidance on what to include in reports, the investigation process, and the anti-retaliation policy. Further, the policy enumerates subtle forms of retaliation—isolation, ostracism, exclusion from meetings, false performance accusations—signaling mature attention to how retaliation actually manifests. It also explicitly encourages vendors, independent contractors, and others temporarily performing work for Anthropic to report concerns.
Additionally, RSP itself requires multiple reporting recipients including at least one executive outside the RSO’s reporting line, mandatory quarterly Board updates on all noncompliance reports whether substantiated or not, and a unique commitment not to impose non-disparagement clauses that could chill public safety speech.

Quotes:

“While we encourage internal reporting and we hope you’ll come to us first, nothing in this policy or any other Anthropic agreement or policy prohibits you from reporting potential violations of law to appropriate government authorities without Anthropic’s authorization and without retaliation.” (RSP Noncompliance Reporting and Anti-Retaliation Policy, p.6)

“Our Safety & Compliance Report Hotline is powered by Navex EthicsPoint, our trusted third-party partner. […] The reporting system is designed to protect reporter confidentiality. Anthropic cannot unmask the identity of any reporter. The system allows anonymous correspondence with the RSO (or delegate) about reports throughout the investigation.” (RSP Noncompliance Reporting and Anti-Retaliation Policy, pp.4–5)

“Employee agreements: We will not impose contractual non-disparagement obligations on employees, candidates, or former employees in a way that could impede or discourage them from publicly raising safety concerns about Anthropic. If we offer agreements with a non-disparagement clause, that clause will not preclude raising safety concerns, nor will it preclude disclosure of the existence of that clause.” (RSP, p.15)

4.6 Transparency (5%) 63%

4.6.1 The company reports externally on what their risks are (33.3%) 90%

The RSP commits to publishing risk reports every 3–6 months, and include in them information on risks including information on relevant threat models, evidence about relevant capabilities and behaviors, mitigations employed, risk-benefit determination, and others. They state reasons for redactios, including legal compliance, IP protection, public safety considerations, and privacy. They further commit to putting “significant effort” into investigating risk and sharing what they find.

Quotes:

“We will publish Risk Reports discussing the risks of our systems and how we have made determinations about whether to continue AI development and deployment in light of the risks. These have significant content in common with system cards, but we are adding additional structure and process aimed at presenting our overall assessments of risk.” (RSP, p.10)

“A Risk Report will cover all publicly deployed models at the time of its publication. It will also cover internally deployed models when we determine that these models could pose significant risks beyond those posed by models that are covered by a prior Risk Report. […] Timing. We will publish a Risk Report every 3-6 months. […] Off-cycle updates. Separate from our publication of Risk Reports, we will publish an analysis of an individual model’s relevant risks (e.g., how the model’s capabilities and propensities affect or change prior analyses) in the following circumstances:
● When we publicly deploy a model that we determine is (1) significantly more capable than (2) all models for which we have publicly analyzed risks related to chemical/biological weapons production, high-stakes misalignment, or automated R&D (e.g., in a prior Risk Report, System Card, or other public artifact).
● Within 30 days of determining that we have an internally deployed model that (1) could pose risks related to high-stakes misalignment or automated R&D that (2) significantly exceed those of all models for which we have publicly analyzed such risks (e.g., in a prior Risk Report, System Card, or other public artifact).” (RSP, p.10)

“A Risk Report will document the following:

“Our analyses will include:
Threat-specific risk assessment […]
Overall risk assessment […]
Risk-benefit determination […]
Looking forward […].” (RSP, p.11)

“- We intend for our Risk Reports to be direct, candid, and informative about how we see the risks of our systems and our state of preparedness for them. – In particular, we will acknowledge when we view certain models as posing significant risks in absolute terms, even if our marginal contribution to overall ecosystem risk may be relatively limited […].
– We will put significant effort into investigating (for example, via capability evaluations) the case for risk, and into sharing what we find.” (RSP, p.10)

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 90%

The RSP has a distinct section on governance, which provides plenty of details on the governance structure and how it is integrated into the company’s broader AI governance program.

Quotes:

“4. Governance
We commit to the following governance measures to promote internal and external accountability.
1. Responsible Scaling Officer: We will maintain the position of RSO, a designated member of staff who is responsible for the implementation of this policy. The RSO’s duties will include (but are not limited to): (1) as needed, proposing updates to this policy; (2) approving relevant model development or deployment decisions based on our risk assessments; (3) reviewing major contracts (e.g., deployment partnerships) for consistency with this policy; (4) overseeing the implementation of this policy, including the allocation of sufficient resources; (5) receiving and addressing reports of potential instances of noncompliance; and (6) making judgment calls on policy interpretation and application.
2. Long Term Benefit Trust: We will regularly brief the LTBT on plans and developments related to our Risk Reports, including model training, capability evals, mitigations, and risk analyses. 3. Internal transparency: We will share final, unredacted Risk Reports with Anthropic’s regular-clearance staff.
3. Internal transparency: We will share final, unredacted Risk Reports with Anthropic’s regular-clearance staff.
4. Noncompliance reporting: […] We will provide quarterly updates to the Board regarding reports of potential noncompliance […]. If we determine that a report is (1) substantiated and (2) involves a material safety risk, we will promptly notify the Board […]. […] where a report concerns the conduct of the RSO, at least one recipient will be a member of the Board.
8. Policy changes: Changes to the RSP will be proposed by the CEO and RSO, and approved by the Board in consultation with the LTBT. […] We will maintain the current version of the RSP on our website.” (RSP, p.15)

4.6.3 The company shares information with industry peers and government bodies (33.3%) 10%

The Frameworks do not specify to a great degree whether and how much information is shared with industry peers and government bodies. The RSP states that policy recommendations for mitigating identified threat pathways will be shared with policymakers, and the FCF includes a commitment to “shar[ing] research findings”, but does not specify with whom exactly, and what kind of research findings are meant.

Quotes:

“Additionally, we will identify the most concerning specific threat pathways, create policy recommendations for early detection and response for such threats, and share this content with policymakers.” (RSP, p.7)

“We pursue this through extensive research on AI safety and alignment, rigorous model evaluation and testing to identify and mitigate potential risks before deployment, and active collaboration with the broader AI safety community to share research findings and contribute to industry-wide safety standards.” (FCF, p.3)

Anthropic

Introduction

Overview

1. Risk Identification

1.1 Classification of Applicable Known Risks (40%) 63%

1.1.1 Risks from literature and taxonomies are well covered (50%) 75%

Quotes:

1.1.2 Exclusions are clearly justified and documented (50%) 50%

Quotes:

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 10%

1.2.1 Internal open-ended red teaming (70%) 10%

Quotes:

1.2.2 Third party open-ended red teaming (30%) 10%

Quotes:

1.3 Risk modeling (40%) 35%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 50%

Quotes:

1.3.2 Risk modeling methodology (40%) 12%

1.3.2.1 Methodology precisely defined (70%) 10%

Quotes:

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 10%

Quotes:

1.3.2.3 Prioritization of severe and probable risks (15%) 25%

Quotes:

1.3.3 Third party validation of risk models (20%) 50%

Quotes:

2. Risk Analysis and Evaluation

2.1 Setting a Risk Tolerance (35%) 7%

2.1.1 Risk tolerance is defined (80%) 8%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 25%

Quotes:

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

Quotes:

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

Quotes:

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

Quotes:

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

Quotes:

2.2 Operationalizing Risk Tolerance (65%) 26%

2.2.1 Key Risk Indicators (KRI) (30%) 35%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

Quotes:

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%

Quotes:

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 10%

Quotes:

2.2.2 Key Control Indicators (KCI) (30%) 18%

2.2.2.1 Containment KCIs (35%) 18%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 25%

Quotes:

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 10%

Quotes:

2.2.2.2 Deployment KCIs (35%) 13%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 25%

Quotes:

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%

Quotes:

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 25%

Quotes:

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 25%

Quotes:

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 25%

Quotes:

3. Risk Treatment

3.1 Implementing Mitigation Measures (50%) 30%

3.1.1 Containment measures (35%) 25%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 25%

Quotes:

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

Quotes:

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 25%

Quotes:

3.1.2 Deployment measures (35%) 25%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 25%

Quotes:

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

Quotes:

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 25%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 50%