Anthropic

Weak 1.8/5

Click categories for more information

very weak

weak

moderate

substantial

strong

Risk Identification

Learn more

Risk Identification

26%

Risk Analysis and Evaluation

Learn more

Risk Analysis and Evaluation

21%

Risk Treatment

Learn more

Risk Treatment

41%

Risk Governance

Learn more

Risk Governance

49%

Best In Class

SEE FRAMEWORK

Anthropic stands out as the only company that specifies evaluation frequency in terms of fixed time intervals and compute power variation.
Anthropic’s framework is notable in that it has a dedicated risk officer in the form of the Responsible Scaling Officer, and it also has an additional governance body with its Long-Term Benefit Trust. They also uniquely commit to annual third-party compliance reviews of their framework.
Anthropic also scores highest for having a strong speak up culture, strong tone from the top, and strong information sharing procedures, including of evaluation results with many relevant stakeholders such as industry peers and government bodies.

Overview

Highlights relative to others

Clearer justification that elicitation methods are strong enough to match threat actors.

Commitment to soliciting external input for both safeguard and capability assessments.

Established system for monitoring, with prespecified thresholds for what behaviours are acceptable.

More clearly outlined escalation procedures.

Weaknesses relative to others

Lacking a risk tolerance.

Little evidence that their risk identification methodology is adequate. They do not engage in a process for identifying novel risks.

Vague on which assurance processes they plan to implement. No plans given to contribute to the development of these assurance processes.

Lacking a central risk team and a risk management committee.

Changes

If they hadn't made some changes to their framework, they would've attained a higher score.

Compared to the first version of their Responsible Scaling Policy, they:

1. Weakened their security threshold (ASL-3) for resistance to cyberattacks one week before releasing a model (Claude Opus 4) requiring ASL-3. This weakening, as well as lack of justification for the weakening, prevents a higher score.

2. Removed a more detailed escalation procedure, including mention of a quantitative safety buffer, which would have lead to a higher score. This also mentioned planning for a pause in scaling, which would’ve helped their score.

Anthropic

1. Risk Identification

Weak 26%

1.1 Classification of Applicable Known Risks (40%) 38%

1.1.1 Risks from literature and taxonomies are well covered (50%) 50%

Their capability thresholds, and hence risk assessment, cover risks such as CBRN weapons and Autonomous AI R&D. They also monitor cyber capabilities as a potential risk, to a lesser extent. However, they exclude loss of control risks and persuasion, and criterion 1.1.2 has a score less than 50%. This exclusion comes despite basing their risk identification from “commissioned research reports, discussions with domain experts, input from expert forecasters, public research”, which would raise loss of control risks as a potential risk domain to consider.

Quotes:

“Overall, our decision to prioritize the capabilities in the two tables above is based on commissioned research reports, discussions with domain experts, input from expert forecasters, public research, conversations with other industry actors through the Frontier Model Forum, and internal discussions.” (p. 5)

“We will also maintain a list of capabilities that we think require significant investigation and may require stronger safeguards than ASL-2 provides.” (p. 5)

“At present, we have identified one such capability: Cyber Operations…” (p. 5)

1.1.2 Exclusions are clearly justified and documented (50%) 25%

The framework acknowledges that there are other risks that are not considered, such as persuasion, with the justification that “this capability is not yet sufficiently understood to include in our current commitments.” However, this justification should probably motivate better risk modelling, rather than immediate dismissal; valid justification should refer to at least one of: academic literature/scientific consensus; internal threat modelling with transparency; third-party validation, with named expert groups and reasons for their validation.

They justify prioritizing Cyber Operations to a lesser extent, given that “it is also possible that by the time these capabilities [which pose serious risks] are reached, there will be evidence that such a standard [of risk mitigation] is not necessary (for example, because of the potential use of similar capabilities for defensive purposes).” However, more detail is needed for this justification of deprioritization, similar to the above paragraph.

There is no justification for excluding loss of control risks from their identified risks, despite “commissioned research reports, discussions with domain experts, input from expert forecasters, public research”, which would raise loss of control risks as a potential risk domain to consider.

Quotes:

“We will also maintain a list of capabilities that we think require significant investigation and may require stronger safeguards than ASL-2 provides… However, it is also possible that by the time these capabilities are reached, there will be evidence that such a standard is not necessary (for example, because of the potential use of similar capabilities for defensive purposes).” (p. 5)

“At present, we have identified one such capability: Cyber Operations… This will involve engaging with experts in cyber operations to assess the potential for frontier models to both enhance and mitigate cyber threats…” (p. 5)
“We recognize the potential risks of highly persuasive AI models. While we are actively consulting experts, we believe this capability is not yet sufficiently understood to include in our current commitments.” (Page 5, footnote)

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.2.2 Third party open-ended red teaming (30%) 0%

The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.3 Risk modeling (40%) 29%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%

There is an explicit mention of conducting threat modelling, including mapping out attack pathways, for each risk domain. Further, “we also make a compelling case that there does not exist a threat model that we are not evaluating that represents a substantial amount of risk” suggests a sincere attempt to map out the full space of risk models.

However, the risk models are not published, nor is the list of scenarios, experts involved or methodology.

Quotes:

“For models requiring comprehensive testing, we will assess whether the model is unlikely to reach any relevant Capability Thresholds absent surprising advances in widely accessible post-training enhancements. To make the required showing, we will need to satisfy the following criteria:
1. Threat model mapping: For each capability threshold, make a compelling case that we have mapped out the most likely and consequential threat models: combinations of actors (if relevant), attack pathways, model capability bottlenecks, and types of harms. We also make a compelling case that there does not exist a threat model that we are not evaluating that represents a substantial amount of risk.” (p. 6)

“[CBRN weapons] capability could greatly increase the number of actors who could cause this sort of damage, and there is no clear reason to expect an offsetting improvement in defensive capabilities.” (p. 4)

1.3.2 Risk modeling methodology (40%) 21%

1.3.2.1 Methodology precisely defined (70%) 25%

Details of the main components of the threat model (actors, pathways, use of MITRE ATT&CK framework) are given. However, important details are lacking, such as how “the compelling case that we have mapped out the most likely and consequential threat model” will be made in practice, how the bottlenecks mentioned will be identified, and so on.

Quotes:

“Follow risk governance best practices, such as use of the MITRE ATT&CK Framework to establish the relationship between the identified threats, sensitive assets, attack vectors and, in doing so, sufficiently capture the resulting risks that must be addressed…” (p. 9)

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.

Quotes:

No relevant quotes found.

1.3.2.3 Prioritization of severe and probable risks (15%) 25%

Explicit mention that the company will prioritize monitoring capabilities with “the most likely and consequential threat models”: “For each capability threshold, make a compelling case we have mapped out the most likely and consequential threat models”. This implies that, among the full space of risk models, they then decide where to focus based on what risk models score highest on probability x severity.

However, importantly, they don’t provide an explanation into how likelihood and severity of risk models are determined, nor are these scores published.

They do indicate that external input helps prioritize these capabilities (“commissioned research reports, discussions with domain experts, input from expert forecasters, public research, conversations with other industry actors through the Frontier Model Forum, and internal discussions”). More detail on how this input influenced prioritization, as well as severity and probability scoring, would be an improvement.

Quotes:

“Threat model mapping: For each capability threshold, make a compelling case that we have mapped out the most likely and consequential threat models: combinations of actors (if relevant), attack pathways, model capability bottlenecks, and types of harms. We also make a compelling case that there does not exist a threat model that we are not evaluating that represents a substantial amount of risk” (p. 6)

“Overall, our decision to prioritize the capabilities in the two tables above is based on commissioned research reports, discussions with domain experts, input from expert forecasters, public research, conversations with other industry actors through the Frontier Model Forum, and internal discussions” (p. 5)

1.3.3 Third party validation of risk models (20%) 50%

For security standards, they mention third parties validate risk models: “we expect this to include independent validation of threat modeling”. However, the framework use weak language: “we expect”. Details on third party expertise is not detailed. To improve, they should have risk modeling be validated as models for how models can realize harms, rather than just their security programs.

Quotes:

For ASL-3: “Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.” (p. 10)

Anthropic

2. Risk Analysis and Evaluation

Weak 21%

2.1 Setting a Risk Tolerance (35%) 7%

2.1.1 Risk tolerance is defined (80%) 8%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 25%

They mention that the framework aims to “keep risks below acceptable levels”, but no qualitative definition is given of these acceptable levels.

Implicitly, the capability thresholds define a proto-risk tolerance. For instance, “CBRN-3: The ability to significantly help individuals or groups with basic technical backgrounds (e.g., undergraduate STEM degrees) create/obtain and deploy CBRN weapons.” To improve, they must set out the maximum amount of risk the company is willing to accept, for each risk domain (though they need not differ between risk domains), ideally expressed in terms of probabilities and severity (economic damages, physical lives, etc), and separate from KRIs.

Quotes:

“In September 2023, we released our Responsible Scaling Policy (RSP), a public commitment not to train or deploy models capable of causing catastrophic harm unless we have implemented safety and security measures that will keep risks below acceptable levels.” (Executive Summary)

“The Required Safeguards for each Capability Threshold are intended to mitigate risk to acceptable levels.” (Executive Summary)

“CBRN-3: The ability to significantly help individuals or groups with basic technical backgrounds (e.g., undergraduate STEM degrees) create/obtain and deploy CBRN weapons.” (p. 6)

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

The risk tolerance, implicit or otherwise, is not expressed fully or partly quantitatively. To improve, the risk tolerance should be expressed fully quantitatively or as a combination of scenarios with probabilities.

Quotes:

No relevant quotes found.

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

No mention of quantitative risk tolerance.

Quotes:

No relevant quotes found.

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

Whilst they mention external input in the framework overall, it is important for the risk tolerance to specifically be developed with input from the public or regulators.

Quotes:

“We extend our sincere gratitude to the many external groups that provided invaluable guidance on the development and refinement of our Responsible Scaling Policy.”
“This policy is designed in the spirit of the Responsible Scaling Policy (RSP) framework introduced by the non-profit AI safety organization METR, as well as emerging government policy proposals in the UK, EU, and US.”

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 29%

2.2.1 Key Risk Indicators (KRI) (30%) 33%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

For each risk domain, two qualitative KRIs are defined. They could be grounded in risk modelling, but this is hard to tell given risk models are not clear. To improve, Anthropic could more precisely define these KRIs to reduce further the ambiguity when deciding upon specific thresholds on evaluation results. To improve, KRIs should map directly to evaluation tests performed. Further, KRI thresholds should be more granular for different attack pathways – for instance, “ability to significantly help individuals or groups with basic technical backgrounds (e.g., undergraduate STEM degrees) create/obtain and deploy CBRN weapons” is quite broad, as CBRN covers four different weapon types.

It would significantly improve their KRI thresholds if they provided reasoning via risk modelling. For instance, they note that “We will consider it sufficient to rule out the possibility that a model has surpassed the two Autonomous AI R&D Capability Thresholds by considering an earlier (i.e., less capable) checkpoint: the ability to autonomously perform a wide range of 2-8 hour software engineering tasks.” However, justification could be given for why 2-8 hour software engineering tasks is an appropriate checkpoint.

Quotes:

“A Capability Threshold is a prespecified level of AI capability that, if reached, signals (1) a meaningful increase in the level of risk if the model remains under the existing set of safeguards (2) a corresponding need to upgrade the safeguards to a higher ASL Standard.”
“Chemical, Biological, Radiological, and Nuclear (CBRN) weapons. The ability to significantly help individuals or groups with basic technical backgrounds (e.g., undergraduate STEM degrees) create/obtain and deploy CBRN weapons.”
“Autonomous AI Research and Development (AI R&D): The ability to fully automate the work of an entry-level, remote-only Researcher at Anthropic.”
“Cyber Operations: The ability to significantly enhance or automate sophisticated destructive cyber attacks.”

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%

AI R&D-5 is quite precisely defined in quantitative terms of what constitutes a “dramatic acceleration in the rate of effective scaling” but they do not indicate how they will measure this. KRIs should map to the actual evaluation tests performed. Further, the other capability thresholds do not have quantitative thresholds.

Quotes:

“AI R&D-4: The ability to fully automate the work of an entry-level, remote-only Researcher at
Anthropic.

AI R&D-5: The ability to cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world’s most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 35^2 = ~1000x.” (p. 16)

“CBRN-3: The ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons. We assess this by comparing what potential attackers could achieve with full model access versus 2023-level online resources, assuming they have funding and up to one year of time to invest, but no initial specialized expertise.

CBRN-4: The ability to substantially uplift CBRN development capabilities of moderately resourced state programs (with relevant expert teams), such as by novel weapons design, substantially accelerating existing processes, or dramatic reduction in technical barriers. We currently define this as uplifting a team of people with skills equivalent to entry-level PhD biologists to be able to approximate the capabilities of a world-class, state-backed bioweapons team.” (p. 16)

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%

The KRIs only mention model capabilities. They do mention vulnerability reporting channels, which could feasibly lead to KRIs which satisfy this criterion.

Quotes:

ASL-2 Deployment Standard: “Vulnerability reporting channels: Clearly indicated paths within the product for users to report harmful or dangerous model outputs, as well as a bug bounty for universal jailbreaks.” (p. 15)

2.2.2 Key Control Indicators (KCI) (30%) 24%

2.2.2.1 Containment KCIs (35%) 30%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 50%

The ASL-3 KCI threshold here is: “highly protected against most attackers’ attempts at stealing model weights”. This is a good qualitative definition, though it could be defined more precisely. For instance, they could reference the security levels of (Nevo et al., 2023), or what “most” means here.

ALS-4 standards are not defined yet.

Quotes:

“The ASL-3 Security Standard is required, although we expect a higher security standard (which would protect against model-weight theft by state-level adversaries) will be required, especially in the case of dramatic acceleration.”
“When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.”
We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains)

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 10%

Anthropic only does a vague quantitative estimation of how many resources should be devoted to containment measures, which gives some indication of the level of security required. However, they should include an actual quantitative threshold that containment KCI measures should meet.

Quotes:

“Resourcing: Investing sufficient resources in security. We expect meeting this standard of security to require roughly 5-10% of employees being dedicated to security and security-adjacent work.”

2.2.2.2 Deployment KCIs (35%) 30%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 50%

High level qualitative definition of the KCI threshold for ASL-3: deployment KCI measures should “make us robust to persistent attempts to misuse the capability in question”. In addition, deployment KCI measures should make it such that “threat actors with realistic access levels and resources are highly unlikely to be able to consistently elicit information from any generally accessible systems that greatly increases their ability to cause catastrophic harm relative to other available tools.”

The KCI thresholds would be further better defined if the threat actor was identified (as done for the containment threshold – i.e., the “realistic access levels and resources” should be further fleshed out).

The framework doesn’t define any deployment KCIs for the autonomous AI R&D KRIs (which should include internal deployment measures), and it doesn’t define ASL-4 KCI thresholds either.

Quotes:

“When a model must meet the ASL-3 Deployment Standard, we will evaluate whether the measures we have implemented make us robust to persistent attempts to misuse the capability in question.”
“Red-teaming: Conduct red-teaming that demonstrates that threat actors with realistic access levels and resources are highly unlikely to be able to consistently elicit information from any generally accessible systems that greatly increases their ability to cause catastrophic harm relative to other available tools.”

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 10%

Whilst “this might entail achieving a high overall recall rate using harm refusal techniques” (p. 8) demonstrates some awareness of quantitative deployment KCI thresholds, there are no actual quantitative deployment KCI thresholds.

Quotes:

“Defense in depth: Use a “defense in depth” approach by building a series of defensive layers, each designed to catch misuse attempts that might pass through previous barriers. As an example, this might entail achieving a high overall recall rate using harm refusal techniques. This is an area of active research, and new technologies may be added when ready” (p. 8)

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 10%

The framework says that they expect that an affirmative case will be required for higher security levels to show they have “mitigated these [misalignment] risks to acceptable levels”. However, they do not indicate what “acceptable levels” constitutes, which is necessary for satisfying this criterion.

Quotes:

For models crossing AI R&D-4: “In addition, we will develop an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.” (p. 4)

For AI R&D-5: “We also expect an affirmative case will be required.”

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%

To satisfy this criterion, there needs to be a pre-emptive justificaiton, grounded in risk modeling, that the KCI thresholds given will be sufficient to reduce residual risk below the risk tolerance (if the corresponding KRI is crossed). In the context of Anthropic’s RSP v2, this means showing that the required safeguards will sufficiently mitigate risk for the relevant capability threshold.

There is a mention of threat modelling for the ASL-3 deployment KCI threshold, showing an effort toward risk modeling. However, there is little justification for why the chosen KCI thresholds will be sufficient to mitigate residual risk below the risk tolerance. For instance, a claim such as the “[we] consider mitigating risks from highly sophisticated state-compromised insiders to be out of scope for ASL-3” should be justified with risk models.

Finally, their risk tolerance (i.e, required level of KCI safeguards) is contingent on other companies’: “It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario […] we might decide to lower the Required Safeguards.” This does not follow the criterion; the required level of safeguards should be relative to their pre-determined risk tolerance.

Quotes:

“Threat modeling: Make a compelling case that the set of threats and the vectors through which an adversary could catastrophically misuse the deployed system have been sufficiently mapped out, and will commit to revising as necessary over time.” (p. 8)

“This capability could greatly increase the number of actors who could cause this sort of damage, and there is no clear reason to expect an offsetting improvement in defensive capabilities. The ASL-3 Deployment Standard and the ASL-3 Security Standard, which protect against misuse and model-weight theft by non-state adversaries, are required.”

“It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.” (Footnote 17, p. 13)

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 50%

Explicit commitments to find interim solutions in case the KCI thresholds are not met: “The CEO and Responsible Scaling Officer may approve the use of interim measures that provide the same level of assurance as the relevant ASL-3 Standard but are faster or simpler to implement.” However, these interim solutions are not pre defined so it creates significant discretionary authority, which is against the criterion; to improve, an explicit threshold at which risk becomes unacceptable and development is put on hold must be given. Further, conditions and process for dedeployment should be given.

Quotes:

“We will not train models with comparable or greater capabilities to the one that requires the ASL-3 Security Standard. This is achieved by monitoring the capabilities of the model in pretraining and comparing them against the given model. If the pretraining model’s capabilities are comparable or greater, we will pause training until we have implemented the ASL-3 Security Standard and established it is sufficient for the model.”

“In any scenario where we determine that a model requires ASL-3 Required Safeguards but we are unable to implement them immediately, we will act promptly to reduce interim risk to acceptable levels until the ASL-3 Required Safeguards are in place.”

“In the unlikely event that we cannot implement interim measures to adequately mitigate risk, we will impose stronger restrictions. In the deployment context, we will de-deploy the model and replace it with a model that falls below the Capability Threshold.”

Anthropic

3. Risk Treatment

Moderate 41%

3.1 Implementing Mitigation Measures (50%) 32%

3.1.1 Containment measures (35%) 40%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 50%

The RSP provides detailed containment measures for ASL-2 (in Appendix B) with specific requirements across six categories. However, even though the ASL-3 containment measures are the ones applied to latest Anthropic models, they are described more as high-level outcomes and examples rather than precise definitions.

Quotes:

“ASL-2 Security Standard: A security system that can likely thwart most opportunistic attackers. 1. Supply chain… 2. Offices… 3. Workforce… 4. Compartmentalization… 5. Infrastructure… 6. Operations…” (Appendix B, Page 15)
“Containment measures are largely information security measures that allow controlling access to the model for various stakeholders. For the potential loss of control risks, containment also includes containing an agentic AI model. Examples include extreme isolation of weight storage, strict application allow-listing, and advanced insider threat programs” (p. 8)

“Perimeters and access controls: Building strong perimeters and access controls around sensitive assets to ensure AI models and critical systems are protected from unauthorized access. We expect this will include a combination of physical security, encryption, cloud security, infrastructure policy, access management, and weight access minimization and monitoring” (p. 8)

“Existing guidance: Aligning where appropriate with existing guidance on securing model weights, including Securing AI Model Weights, Preventing Theft and Misuse of Frontier Models (2024); security recommendations like Deploying AI Systems Securely (CISA/NSA/FBI/ASD/CCCS/GCSB /GCHQ), ISO 42001, CSA’s AI Safety Initiative, and CoSAI; and standards frameworks like SSDF, SOC 2, NIST 800-53.” (p. 10)

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

The RSP provides a high-level description of a process for evaluating whether containment measures meet requirements, but does not detail the proof or evidence for why they believe their measures will likely be sufficient. Instead, this evidence need only be collated when ASL-3 requirements need to be implemented. To improve, they should prove ex ante that the requirements are sufficient, to leave as little discretion as possible and ensure risk levels remain below the risk tolerance at all times.

Quotes:

“Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence.” (p. 10)
“If, after the evaluations above, we determine that we have met the ASL-3 Required Safeguards, then we may proceed with deploying and training models above the Capability Threshold” (p. 10)

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 25%

In the containment measures section of their framework (i.e. the description of ASL-3), Anthropic describes comprehensive third-party assessment of their containment measures, but does not explicitly commit to it: “we expect this to include independent validation of threat modeling and risk assessment results…”

To improve, the framework should detail the actual intended process for verifying that the containment measures meet the containment KCI threshold, and ideally in advance as possible (e.g. prior to development) to ensure the KRI threshold is not crossed before the KCI measures are decided.

In a separate section dedicated to transparency and external input, they state that they will solicit input from external experts, but it’s unclear whether this applies specifically to containment measures.

Quotes:

“Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.” (p. 10)

“Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments.” (p. 13)

3.1.2 Deployment measures (35%) 40%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 50%

Similar to containment measures, the RSP provides detailed deployment measures for ASL-2 but only high-level criteria for ASL-3. For instance, ASL-3 measures must satisfy certain evaluation criteria and principles (like “defense in depth”), but precisely defined deployment measures for all KCI thresholds are not given. Instead, they focus more on outcomes. To improve, the measures they will implement for ASL-3 Deployment Standard should be detailed – this is especially necessary given they currently have models which are deployed under ASL-3.

Quotes:

“ASL-2 Deployment Standard: 1. Acceptable use policies and model cards… 2. Harmlessness training and automated detection… 3. Fine-tuning protections… 4. Vulnerability reporting channels…” (Appendix B, p. 15)

“When a model must meet the ASL-3 Deployment Standard, we will evaluate whether the measures we have implemented make us robust to persistent attempts to misuse the capability in question […] To make the required showing, we will need to satisfy the following criteria: 1. Threat modeling… 2. Defense in depth… 3. Red-teaming… 4. Rapid remediation… 5. Monitoring… 6. Trusted users… 7. Third-party environments…” (p. 8)

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

The RSP provides a high-level description of a red teaming process for evaluating whether deployment measures meet requirements. However, it doesn’t provide actual proof or evidence that the deployment measures are sufficient ex ante. Instead, it relies on Antrhopic’s judgment at the time when ASL-3 deployment standards need to be implemented, making the decision vulnerable to discretion.

Quotes:

“Red-teaming: Conduct red-teaming that demonstrates that threat actors with realistic access levels and resources are highly unlikely to be able to consistently elicit information from any generally accessible systems that greatly increases their ability to cause catastrophic harm relative to other available tools.” (p. 8)

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 10%

In a general section of their framework dedicated to transparency and external input, Anthropic state that they will solicit input from external experts, but it’s unclear whether this applies specifically to deployment measures.

Quotes:

“We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments.” (p. 13)

3.1.3 Assurance processes (30%) 14%

3.1.3.1 Credible plans towards the development of assurance properties (40%) 25%

Anthropic acknowledges that assurance processes don’t yet exist and commits to developing them for advanced AI R&D capabilities. They promise to create an “affirmative case” that identifies alignment risks and explains their mitigations, and say they’ll “continue to research potential risks and next-generation mitigation techniques.”
However, these are vague, high-level commitments without concrete details. Despite noting that current mitigations are insufficient for future capabilities, they provide no roadmap for how they’ll actually develop the necessary assurance processes.

Quotes:

“AI R&D-4: The ability to fully automate the work of an entry-level, remote-only Researcher at Anthropic. The ASL-3 Security Standard is required. In addition, we will develop an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels.” (p. 4)

“Since the frontier of AI is rapidly evolving, we cannot anticipate what safety and security measures will be appropriate for models far beyond the current frontier. We will thus regularly measure the capability of our models and adjust our safeguards accordingly. Further, we will continue to research potential risks and next-generation mitigation techniques.” (p. 1)

3.1.3.2 Evidence that the assurance properties are enough to achieve their corresponding KCI thresholds (40%) 10%

Quotes:

3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 10%

The framework mentions that the “affirmative case” for assurance processes’ efficacy will include “overall reasoning,” which would presumably encompass underlying assumptions. However, no concrete implementation of this is provided, nor examples of assumptions made. To improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.

Quotes:

“In addition, we will develop an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.” (p. 4)

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 51%

3.2.1 Monitoring of KRIs (40%) 64%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 75%

The framework acknowledges the need to match realistic attacker capabilities and lists some elicitation methods used (scaffolding, fine tuning, expert prompting). However, it doesn’t provide quantitative specifics, such as how much compute is used for fine tuning. More detail could be added on which elicitation methods they anticipate would be used by different threat actors, under realistic settings, to justify their elicitation method.

Quotes:

“Elicitation: Demonstrate that, when given enough resources to extrapolate to realistic attackers, researchers cannot elicit sufficiently useful results from the model on the relevant tasks. We should assume that jailbreaks and model weight theft are possibilities, and therefore perform testing on models without safety mechanisms (such as harmlessness training) that could obscure these capabilities.” (p. 6)

“We will also consider the possible performance increase from using resources that a realistic attacker would have access to, such as scaffolding, finetuning, and expert prompting. At minimum, we will perform basic finetuning for instruction following, tool use, minimizing refusal rates.” (p. 6)

“By ‘widely accessible,’ we mean techniques that are available to a moderately resourced group (i.e., do not involve setting up large amounts of custom infrastructure or using confidential information).” (Footnote 6, p. 6)

3.2.1.2 Evaluation frequency (25%) 100%

The framework clearly specifies evaluation frequency in terms of effective computing power (4x increase triggers comprehensive assessment). This is a quantitative threshold that directly addresses the criterion with appropriate detail. The framework clearly specifies a 6-month fixed time interval for accounting for post-training enhancements.

Quotes:

“The term ‘notably more capable’ is operationalized as at least one of the following: 1. The model is notably more performant on automated tests in risk-relevant domains (defined as 4x or more in Effective Compute).” (pp. 5-6)

“Adjusted evaluation cadence: We adjusted the comprehensive assessment cadence to 4x Effective Compute or six months of accumulated post-training enhancements (this was previously three months).” (p. 17)

“Six months’ worth of finetuning and other capability elicitation methods have accumulated. This is measured in calendar time, since we do not yet have a metric to estimate the impact of these improvements more precisely.” (p. 6)

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 50%

The policy acknowledges the importance of accounting for “widely accessible” post-training enhancements in capability assessments. They ensure this buffer for post-training enhancements takes account of both the possibility of enhancing the capabilities of Anthropic’s deployments, as well as the capabilities of Anthropic’s models if they are stolen: “We include headroom to account for the possibility that the model is either modified via one of our own finetuning products or stolen in the months following testing, and used to create a model that has reached a Capability Threshold.” These both show nuance.

However, whilst they focus on “widely accessible post-training enhancements”, to improve they should qualify why they focus only on these enhancements.

It is commendable that they note that “exploring ways to integrate [post-training enhancements] into an overall metric is an ongoing area of research”, though an improvement would be to commit to contribute to this research.

Further, more detail could be added on how they account(ed) for how post-training enhancements’ risk profiles change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.

Quotes:

“For models requiring comprehensive testing, we will assess whether the model is unlikely to reach any relevant Capability Thresholds absent surprising advances in widely accessible post-training enhancements” (p. 6) and “By “widely accessible,” we mean techniques that are available to a moderately resourced group (i.e., do not involve setting up large amounts of custom infrastructure or using confidential information). We include headroom to account for the possibility that the model is either modified via one of our own finetuning products or stolen in the months following testing, and used to create a model that has reached a Capability Threshold. That said, estimating these future effects is very difficult given the state of research today” (Footnote 6, p. 6)

A model is notably capable if: “Six months’ worth of finetuning and other capability elicitation methods have accumulated. This is measured in calendar time, since we do not yet have a metric to estimate the impact of these improvements more precisely” (p. 6) and “Exploring ways to integrate these types of improvements into an overall metric is an ongoing area of research.” (Footnote 5, p. 6)

3.2.1.4 Vetting of protocols by third parties (15%) 10%

The policy mentions soliciting external expert input in developing and conducting capability assessments, which partially addresses protocol vetting. However, this is general input rather than a strong commitment specific to the vetting of evaluation protocols. Further, details on the level of expertise required, and why their chosen experts satisfy this criteria, should be detailed.

Quotes:

“Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments.” (p. 13)

3.2.1.5 Replication of evaluations by third parties (15%) 50%

The framework mentions in multiple stages of the risk assessment process that they will share materials related to the evaluation and seek input from experts, but not that these expert will reproduce or audit the results directly. To improve, a process for having evaluations externally assessed or audited should be detailed.

Quotes:

“To advance the public dialogue on the regulation of frontier AI model risks and to enable examination of our actions, we will also publicly release key materials related to the evaluation and deployment of our models with sensitive information removed and solicit input from external experts in relevant domains.” (p. 13)

“We will solicit both internal and external expert feedback on the [Capability] report” (p. 7)

“Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments. We may also solicit external expert input prior to making final decisions on the capability and safeguards assessments.” (p. 13)

3.2.2 Monitoring of KCIs (40%) 43%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 50%

The framework provides a high-level description of monitoring procedures for deployment measures, with examples such as “jailbreak bounties, doing historical analysis or background monitoring, and any necessary retention of logs for these activities.” To improve, they should define what monitoring “on a reasonable cadence” is defined as.

They also mention that they will develop plans to audit the implementation of containment measures, but there is no commitment to audit assurance processes. They describe redteaming of the model with deployment measures to “[demonstrate] that threat actors with realistic access levels and resources are highly unlikely to be able to consistently elicit information from any generally accessible systems that greatly increases their ability to cause catastrophic harm relative to other available tools”.

However, they note that “This criterion does not attempt to specify the exact red-teaming protocol (e.g., number of hours, level of access, or pass-fail criteria). Setting a principled pass-fail threshold will depend on other factors, such as the quality of our monitoring and ability to respond to jailbreaks rapidly.” An improvement would be to specify the protocol as much as possible, to ensure transparency, and to conduct this redteaming continuously or at regular cadence.

It is commendable that they note the importance of “prespecify[ing] empirical evidence that would show the system is oeprating within the accepted risk range” for monitoring.

However, to improve, the framework should describe systematic, ongoing monitoring to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will still be met, when required.

Quotes:

“Monitoring: Prespecify empirical evidence that would show the system is operating within the accepted risk range and define a process for reviewing the system’s performance on a reasonable cadence. Process examples include monitoring responses to jailbreak bounties, doing historical analysis or background monitoring, and any necessary retention of logs for these activities.” (p. 8)

3.2.2.2 Vetting of protocols by third parties (30%) 25%

The policy mentions soliciting external expert input in developing and conducting safeguard assessments, which partially addresses protocol vetting. However, this is general input rather than a strong commitment specific to the vetting of evaluation protocols.

Quotes:

3.2.2.3 Replication of evaluations by third parties (30%) 50%

The framework mentions that they will share materials related to KCI evaluations (i.e. safeguard assessments) and seek input from experts, but not that experts will reproduce the results directly.

Quotes:

3.2.3 Transparency of evaluation results (10%) 77%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 90%

The policy demonstrates strong commitment to sharing evaluation results with multiple stakeholders: the public (summaries), government entities, internal staff, Board of Directors, and the Long-Term Benefit Trust. Multiple channels and levels of disclosure are specified. There is a commitment to notifying a relevant authority if “a model requires stronger protections than the ASL-2 Standard.” They commit to publicly releasing “key information related to the evaluation and deployment of our models” – to improve, they should commit to publishing all KRI and KCI assessments (with sensitive information redacted).

Quotes:

“Public disclosures: We will publicly release key information related to the evaluation and deployment of our models (not including sensitive details). These include summaries of related Capability and Safeguards reports when we deploy a model” (p. 13)

“U.S. Government notice: We will notify a relevant U.S. Government entity if a model requires stronger protections than the ASL-2 Standard.” (p. 13)

“We will share summaries of Capability Reports and Safeguards Reports with Anthropic’s regular-clearance staff, redacting any highly-sensitive information.” (p. 12)

“The CEO and RSO decide to proceed with deployment, they will share their decision–as well as the underlying Capability Report, internal feedback, and any external feedback–with the Board of Directors and the Long-Term Benefit Trust before moving forward.” (p. 7)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 5%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 0%

Despite noting that “for each capability threshold, [we will] make a compelling case that we have mapped out the most likely and consequential threat models: combinations of actors (if relevant), attack pathways, model capability bottlenecks, and types of harms. We also make a compelling case that
there does not exist a threat model that we are not evaluating that represents a substantial amount of risk”, there does not appear to be a process for identifying novel risks post-deployment which could signal alternative threat models. Hence, their risk modelling appears to be mostly informed a priori, rather than from empirical data of the model in deployment. To improve, they could establish a process for actively searching for novel risks or changed risk profiles of models.

They do mention “periodic, broadly scoped, and independent testing with expert red-teamers” for auditing their security programs. However, this is not for the purpose of arising novel risk profiles, so credit is not given.

Quotes:

No relevant quotes found.

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 10%

There is an indication that if novel risk models could exist that they had not considered, an effort will be made to incorporate this into their risk assessment. One indication of this is the general intention to incorporate findings from evaluations, which may uncover new risk profiles of models: “Findings from partner organizations and external evaluations of our models (or similar models) should also be incorporated into the final assessment, when available.”

They also mention that “as our understanding evolves, we may identify additional [capability] thresholds.” However, this statement does not explicitly commit to incorporating novel risks into their risk identification and prioritization process. They note that they will maintain a “list of capabilities that we think require significant investigation” and “could pose serious risks, but the exact Capability Threshold and the Required Safeguards are not clear at present.” This somewhat gives an indication that additional risks may be incorporated into their risk assessment. To improve, they could commit to engaging in risk modelling to map out the potential harms from risk profiles that may have changed, in order to ensure they are keeping in touch with the evolving risk landscape.

Quotes:

“Findings from partner organizations and external evaluations of our models (or similar models) should also be incorporated into the final assessment, when available.” (p. 6)

“These Capability Thresholds represent our current understanding of the most pressing catastrophic risks. As our understanding evolves, we may identify additional thresholds. For each threshold, we will identify and describe the corresponding Required Safeguards as soon as feasible, and at minimum before training or deploying any model that reaches that threshold.”

“We will also maintain a list of capabilities that we think require significant investigation and may require stronger safeguards than ASL-2 provides. This group of capabilities could pose serious risks, but the exact Capability Threshold and the Required Safeguards are not clear at present. These capabilities may warrant a higher standard of safeguards, such as the ASL-3 Security or Deployment Standard. However, it is also possible that by the time these capabilities are reached, there will be evidence that such a standard is not necessary (for example, because of the potential use of similar capabilities for defensive purposes). Instead of prespecifying particular thresholds and safeguards today, we will conduct ongoing assessments of the risks with the goal of determining in a future iteration of this policy what the Capability Thresholds and Required Safeguards would be.” (p. 5)

Anthropic

4. Risk Governance

Moderate 49%

4.1 Decision-making (25%) 50%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 50%

The company has the unique position of Responsible Scaling Officer, which is positive. However, it is not specified if they are the risk owner for all AI-related risks.

Quotes:

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No mention of a management risk committee.

Quotes:

No relevant quotes found.

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 75%

The company outlines clear protocols for their decision-making, including who makes the decisions and on what basis. It could benefit from providing additional details on the basis for the risk evaluation, whether it is inherent or residual risk that is evaluated and the timing of decisions.

Quotes:

“If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. The process for making such a determination is as follows…” (p. 7)
“The report will be escalated to the CEO and the Responsible Scaling Officer, who will … make the ultimate determination as to whether we have sufficiently established that we are unlikely to reach the Capability Threshold and … decide any deployment-related issues.” (p. 7)
“If the CEO and RSO decide to proceed with deployment and training, they will share their decision–as well as the underlying Capability Report, internal feedback, and any external feedback–with the Board of Directors and the Long-Term Benefit Trust before moving forward.” (p. 7)
“We may deploy or store a model if either of the following criteria are met: (1) the model’s capabilities are sufficiently far away from the existing Capability Thresholds, making the current ASL-2 Standard appropriate; or (2) the model’s capabilities have surpassed the existing Capabilities Threshold, but we have implemented the ASL-3 Required Safeguards and confirmed that the model is sufficiently far away from the next set of Capability Thresholds as to make the model ASL-3 Standard appropriate.” (p. 11)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 75%

The policy commendably includes procedures for incidents, but states that these will be developed rather than that they exist.

Quotes:

“Readiness: We will develop internal safety procedures for incident scenarios. Such scenarios include (1) pausing training in response to reaching Capability Thresholds; (2) responding to a security incident involving model weights; and (3) responding to severe jailbreaks or vulnerabilities in deployed models, including restricting access in safety emergencies that cannot otherwise be mitigated. We will run exercises to ensure our readiness for incident scenarios.” (p. 12)

4.2. Advisory and Challenge (20%) 35%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 75%

The company, uniquely, has the position of the Responsible Scaling Officer, which is very commendable. However, there is some uncertainty regarding whether the RSO position sits in the first line (decision-making) or second (advise and challenge). As per the Three Lines model, these should be kept distinct to ensure independence and sufficient challenge to decision-making.

Quotes:

“Responsible Scaling Officer: We will maintain the position of Responsible Scaling Officer, a designated member of staff who is responsible for reducing catastrophic risk, primarily by ensuring this policy is designed and implemented effectively.” (p. 12)
“The Responsible Scaling Officer’s duties will include (but are not limited to): (1) as needed, proposing updates to this policy to the Board of Directors; (2) approving relevant model training or deployment decisions based on capability and safeguard assessments; (3) reviewing major contracts (i.e., deployment partnerships) for consistency with this policy; (4) overseeing implementation of this policy, including the allocation of sufficient resources…” (p. 12)

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 10%

The company says it will solicit feedback on conclusions. However, it does not have a committee that provides advice on an ongoing basis, which is a best practice.

Quotes:

“…we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions…” (p. 7)

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 50%

The company describes some details of the monitoring it is doing of risks. However, to improve, it should provide more details so that an assessment can be made of its comprehensiveness, how it aggregates risk data and whether it covers capabilities only.

Quotes:

“We will routinely test models to determine whether their capabilities fall sufficiently far below the Capability Thresholds such that we are confident that the ASL-2 Standard remains appropriate.” (p. 5)
“Monitoring: Prespecify empirical evidence that would show the system is operating within the accepted risk range and define a process for reviewing the system’s performance on a reasonable cadence. Process examples include monitoring responses to jailbreak bounties, doing historical analysis or background monitoring, and any necessary retention of logs for these activities” (p. 8)

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 25%

The policy shows some evidence of challenge, especially for “high-stakes issues”. However, the use of the words “likely” and “in some circumstances” makes this less strong.

Quotes:

“…we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions to inform future refinements to our methodology.” (p. 7)
“For high-stakes issues, however, the CEO and RSO will likely solicit internal and external feedback on the report prior to making any decisions.” (p. 7)
“Internal review: For each Capabilities or Safeguards Report, we will solicit feedback from internal teams with visibility into the relevant activities, with the aims of informing future refinements to our methodology and, in some circumstances, identifying weaknesses and informing the CEO and RSO’s decisions.” (p. 12)

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 50%

The policy clearly states that the company reports relevant risk information to senior management and the board. However, to improve it should further specify how that data is aggregated and how often it is reported.

Quotes:

“The report will be escalated to the CEO and the Responsible Scaling Officer, who will … make the ultimate determination as to whether we have sufficiently established that we are unlikely to reach the Capability Threshold and … decide any deployment-related issues.” (p. 7)
“…they will share their decision – as well as the underlying Capability Report, internal feedback, and any external feedback – with the Board of Directors and the Long-Term Benefit Trust.” (p. 7)
“We will compile a Capability Report that documents the findings from the comprehensive assessment, makes an affirmative case for why the Capability Threshold is sufficiently far away, and advances recommendations on deployment decisions.” (p. 7)
“The Safeguards Report(s) will be escalated to the CEO and the Responsible Scaling Officer, who will (1) make the ultimate determination as to whether we have satisfied the Required Safeguards and (2) decide any deployment-related issues.” (p. 10)

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 50%

4.3.1 The company has an internal audit function involved in AI governance (50%) 25%

While the company does not specify having an internal audit function, the specification in the policy of independent validation and audits of controls are positive. Note that this is for the security program and also the use of the word “expect”. These lower the impact somewhat. To improve, there should be audits with a broader remit.

Quotes:

“Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.” (p. 10)

4.3.2 The company involves external auditors (50%) 75%

The policy states that there will be an annual third-party review of compliance, which is very positive. It also lists several other places where external independent testing will occur. To improve further, there should be additional external assessment that is ensured to be fully independent.

Quotes:

“Procedural compliance review: On approximately an annual basis, we will commission a third-party review that assesses whether we adhered to this policy’s main procedural commitments. This review will focus on procedural compliance, not substantive outcomes.” (p. 13)
“…we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions to inform future refinements to our methodology.” (p. 7)
“We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments.” (p. 13)
“We may also solicit external expert input prior to making final decisions on the capability and safeguards assessments.” (p. 13)
“Audits…independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.” (p. 10)

4.4 Oversight (20%) 50%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 25%

The policy shows that the Board plays a clear role in governance. However, to improve the company should follow the best practice of having a designated committee of the Board overseeing matters of risk.

Quotes:

“If the CEO and RSO decide to proceed with deployment, they will share their decision … with the Board of Directors and the Long-Term Benefit Trust before moving forward.” (p. 7)
“Policy changes: Changes to this policy will be proposed by the CEO and the Responsible Scaling Officer and approved by the Board of Directors, in consultation with the Long-Term Benefit Trust.” (p. 13)
“Anthropic’s Board of Directors approves the RSP and receives Capability Reports and Safeguards Reports.” (p. 14)

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 75%

The company, uniquely, has its LTBT as an additional governance body. To improve further, it could be more clear on its exact role and relationship to the Board of Directors.

Quotes:

“If the CEO and RSO decide to proceed with deployment, they will share their decision … with the Board of Directors and the Long-Term Benefit Trust before moving forward.” (p. 7)
“Long-Term Benefit Trust (LTBT): Anthropic’s Board of Directors approves the RSP and receives Capability Reports and Safeguards Reports. The LTBT is an external body that is consulted on policy changes and also provided with Capability Reports and Safeguards Reports.” (p. 14)

4.5 Culture (10%) 63%

4.5.1 The company has a strong tone from the top (33.3%) 50%

The policy contains clear statements regarding the need to consider the risks that accompany AI capabilities. The company gets credit for being the first to release a risk management policy in the form of their RSP. In order to have a higher score, the company could include more detail on how senior management consistently signals the need to consider risks as well as benefits in day-to-day operations.

Quotes:

“At Anthropic, we are committed to developing AI responsibly and transparently. Since our founding, we have recognized the importance of proactively addressing potential risks as we push the boundaries of AI capability and of clearly communicating about the nature and extent of those risks.” (p. 1)
“In September 2023, we released our Responsible Scaling Policy (RSP), a first-of-its-kind public commitment not to train or deploy models capable of causing catastrophic harm unless we have implemented safety and security measures that will keep risks below acceptable levels.” (p. 1)

4.5.2 The company has a strong risk culture (33.3%) 50%

Some factors that build risk culture, such as training, are outlined, but only in regards to cybersecurity. To improve further, risk culture-building aspects should be detailed for all aspects of risk.

Quotes:

“Workforce: People-critical processes must represent a key aspect of cybersecurity. Mandatory periodic infosec training educates all employees on secure practices, like proper system configurations and strong passwords, and fosters a proactive “security mindset.” Fundamental infrastructure and policies promoting secure-by-design and secure-by-default principles should be incorporated into the engineering process. An insider risk program should tie access to job roles.” (p. 15)

4.5.3 The company has a strong speak-up culture (33.3%) 90%

The policy includes very strong support for elements of speak-up culture.

Quotes:

“We will maintain a process through which Anthropic staff may anonymously notify the Responsible Scaling Officer of any potential instances of noncompliance … and we will track and investigate any reported … potential instances of noncompliance with this policy.” (p. 12)
“We will not impose contractual non-disparagement obligations on employees, candidates, or former employees in a way that could impede or discourage them from publicly raising safety concerns about Anthropic.” (p. 13)
“We will also establish a policy governing noncompliance reporting, which will (1) protect reporters from retaliation and (2) set forth a mechanism for escalating reports to one or more members of the Board of Directors in cases where the report relates to conduct of the Responsible Scaling Officer.” (p. 12)

4.6 Transparency (5%) 72%

4.6.1 The company reports externally on what their risks are (33.3%) 50%

The framework is commendably clear on the nature and level of the risks that are included in the framework and what information will be provided on externally on the risks and their safeguards. Further information could be provided on why these are the specific risks in question and how other risks would be considered.

Quotes:

“This update to our RSP provides specifications for Capabilities Thresholds related to Chemical, Biological, Radiological, and Nuclear (CBRN) weapons and Autonomous AI Research and Development (AI R&D) and identifies the corresponding Required Safeguards.” (Executive Summary)
“To advance the public dialogue on the regulation of frontier AI model risks” (p. 13)
“Public disclosures: We will publicly release key information related to the evaluation and deployment of our models (not including sensitive details). These include summaries of related Capability and Safeguards reports when we deploy a model as well as plans for current and future comprehensive capability assessments and deployment and security safeguards. We will also periodically release information on internal reports of potential instances of non-compliance and other implementation challenges we encounter.” (p. 13)

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 75%

The policy includes details on the governance structure, in a Governance and Transparency section.

Quotes:

“The current version of the RSP is accessible at www.anthropic.com/rsp. We will update the public version of the RSP before any changes take effect and record any differences from the prior draft in a change log.” (p. 13)
“To facilitate the effective implementation of this policy across the company, we commit to several internal governance measures, including maintaining the position of Responsible Scaling Officer, establishing a process through which Anthropic staff may anonymously notify the Responsible Scaling Officer of any potential instances of noncompliance, and developing internal safety procedures for incident scenarios.” (p. 12)

4.6.3 The company shares information with industry peers and government bodies (33.3%) 90%

The policy contains clear examples of relevant information sharing, with both peers, other external groups and government.

Quotes:

“U.S. Government notice: We will notify a relevant U.S. Government entity if a model requires stronger protections than the ASL-2 Standard.” (p. 13)
“We currently expect that if we do not deploy the model publicly and instead proceed with training or limited deployments, we will likely instead share evaluation details with a relevant U.S. Government entity.” (p. 13, footnote)
“We treat these lists as sensitive, but we plan to share them with organizations such as AI Safety Institutes and the Frontier Model Forum, and keep these lists updated.” (p. 16, footnote)
“We extend our sincere gratitude to the many external groups that provided invaluable guidance on the development and refinement of our Responsible Scaling Policy. We actively welcome feedback on our policy and suggestions for improvement from other entities engaged in frontier AI risk evaluations or safety and security standards.” (p. 2)

Anthropic

Best In Class

Overview

If they hadn't made some changes to their framework, they would've attained a higher score.

1. Risk Identification

1.1 Classification of Applicable Known Risks (40%) 38%

1.1.1 Risks from literature and taxonomies are well covered (50%) 50%

Quotes:

1.1.2 Exclusions are clearly justified and documented (50%) 25%

Quotes:

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

Quotes:

1.2.2 Third party open-ended red teaming (30%) 0%

Quotes:

1.3 Risk modeling (40%) 29%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%

Quotes:

1.3.2 Risk modeling methodology (40%) 21%

1.3.2.1 Methodology precisely defined (70%) 25%

Quotes:

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

Quotes:

1.3.2.3 Prioritization of severe and probable risks (15%) 25%

Quotes:

1.3.3 Third party validation of risk models (20%) 50%

Quotes:

2. Risk Analysis and Evaluation

2.1 Setting a Risk Tolerance (35%) 7%

2.1.1 Risk tolerance is defined (80%) 8%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 25%

Quotes:

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

Quotes:

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

Quotes:

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

Quotes:

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

Quotes:

2.2 Operationalizing Risk Tolerance (65%) 29%

2.2.1 Key Risk Indicators (KRI) (30%) 33%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

Quotes:

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%

Quotes:

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%

Quotes:

2.2.2 Key Control Indicators (KCI) (30%) 24%

2.2.2.1 Containment KCIs (35%) 30%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 50%

Quotes:

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 10%

Quotes:

2.2.2.2 Deployment KCIs (35%) 30%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 50%

Quotes:

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 10%

Quotes:

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 10%

Quotes:

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%

Quotes:

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 50%

Quotes:

3. Risk Treatment

3.1 Implementing Mitigation Measures (50%) 32%

3.1.1 Containment measures (35%) 40%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 50%

Quotes:

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

Quotes:

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 25%

Quotes:

3.1.2 Deployment measures (35%) 40%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 50%

Quotes:

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

Quotes:

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%