anthropic logo
Total Score:
Moderate

2.2/5

Non Existent
Strong

Risk Identification

Strong
Substantial
Moderate
Weak
 Very
Weak
Moderate  

2.5/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong

Risk Tolerance & Analysis

 Very
Weak
Weak
Moderate
Substantial
Strong
Weak  

2/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong

Risk Mitigation

Strong
Substantial
Moderate
Weak
 Very
Weak
Moderate  

2.2/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Identification

‍In risk identification, we assess whether an AI developer is: 

  • Approaching in an appropriate way risks outlined by the literature.
  • Doing extensive open-ended red teaming to identify new risks.
  • Leveraging a diverse range of risk identification techniques, including threat modeling when appropriate, to adequately identify new threats.
Risk Identification
info icon
0 : No information available.

1 : Some unclear links are made & some risks are mentioned. Some efforts of open-ended red teaming are reported, along with some very basic threat modelling.

2 : Links are made for a number of risks, but missing important ones. Significant efforts of open-ended red teaming are reported, along with significant threat modelling efforts.

3 : Links are made for most of the important & commonly discussed risks. Consequential red teaming is precisely reported, along with significant threat modelling and structured risk identification techniques usage.

4 : Links are clearly made for most risks. Almost all risks are mentioned. There’s a methodology outlining how structured risk identification across the lifecycle is done, precisely characterized red teaming (including from external parties) is done, along with advanced and broad threat modelling.

5 : There’s a comprehensive, continued and detailed effort to ensure all risks are found and covered. The red teaming effort is extremely extensive and jointly integrated with structured risk identification efforts, quantified, and done with third parties
Strong
Substantial
Moderate
Weak
Very Weak
Moderate  

2.5/5

Best-in-Class

Highlights

  • Anthropic covers a number of different risk types well and conducts threat and risk modeling for some, including biorisk, biases, and deception with their work on sleeper agents.
  • In “Responsible Scaling Policy Evaluations Report – Claude 3 Opus”, Anthropic describes their risk management procedure for Cybersecurity, CBRN information, and Model Autonomy.
  • In “The Claude 3 Model Family: Opus, Sonnet, Haiku”, Anthropic describes open-ended red teaming: “The team engaged the model in multi-turn conversations about sensitive or harmful topics to analyze responses, identify areas for improvement, and establish a baseline for evaluating models over time. Examples of tested topics include, but are not limited to: child safety, dangerous weapons and technology, hate speech, violent extremism, fraud, and illegal substances.”

Weaknesses

  • The open-ended red teaming procedures to identify risks, described in the Responsible Scaling Policy, lack crucial details, in particular regarding their integration with risk identification procedures. It is unclear what is the amount of threat and risk modeling conducted for novel vulnerabilities identified during red teaming, as well as for certain key risks (e.g., cyber offense).
  • Anthropic does not address all high-severity risks. For example, their Responsible Scaling Policy does not address LLM persuasion capabilities.
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Tolerance
& Analysis

In risk tolerance and analysis, we assess whether the AI developers have defined:

  • A global risk tolerance.
  • Operational capabilities thresholds and their equivalent risk. Those have to be defined with precision and breadth.
  • Corresponding objectives of risk mitigation measures: AI developers should establish clear objectives for risk mitigation measures. These objectives should be grounded in strong rationales, including threat modeling, to justify that they are sufficient to address the identified risks and align with the organization's risk tolerance.
  • Evaluation protocols detailing procedures for measuring the model's capabilities and ensuring that capability thresholds are not exceeded without detection.
Global Risk Tolerance
info icon
0 - No information available.

1 - Global risk tolerance is qualitatively defined.
E.g., “Our system should not increase the likelihood of extinction risks”.

2 - Global risk tolerance is quantitatively defined for casualties.

3 - Global risk tolerance is quantitatively defined for casualties and economic damages, with adequate ranges and rationale for the decision.

4 - Global risk tolerance is quantitatively defined for casualties, economic damages, and other high-severity risks (e.g., large-scale manipulation of public opinion), with robust methodology and decision-making processes to decide the tolerance (e.g., public consultation).

5 - Global risk tolerance is clearly and quantitatively defined for all significant threats and risks known in the literature. Any significant deviations in risk tolerance from industry norms are clearly justified and explained (e.g., through a comprehensive benefit/cost analysis).
Strong
Substantial
Moderate
Weak
Very Weak
Very Weak  

0.5/5

Global Risk Tolerance

Highlights

  • In “Responsible Scaling Policy Evaluations Report – Claude 3 Opus”, Anthropic states: "Anthropic's Responsible Scaling Policy (RSP) aims to ensure we never train, store, or deploy models with catastrophically dangerous capabilities, except under a safety and security standard that brings risks to society below acceptable levels."

Weaknesses

  • While Anthropic acknowledges a tolerance relative to “catastrophically dangerous capabilities”, we encourage them to focus their statement on risk (e.g., "catastrophic levels of risk") rather than capabilities, and to increase the specificity on what constitutes "acceptable levels" of risk.
Operational Risk Tolerance
info icon
0 - No information available.

1 - Some important capability thresholds are qualitatively defined and their corresponding mitigation objectives are qualitatively defined as well.

2 - Some important capability thresholds are precisely defined, and their corresponding mitigations are precisely defined as well.

3 - Almost all important hazardous capability thresholds and their corresponding mitigation objectives are precisely defined and grounded in extensive threat and risk modeling.

4 - All hazardous capabilities are precisely defined. The corresponding mitigation objectives are quantitatively defined and grounded in extensive threat and risk modeling. Assurance property targets are operationalized.
 
5 - All hazardous capabilities have a precisely defined threshold. Corresponding mitigation objectives are quantified and grounded in comprehensive threat and risk modeling with a clear and in-depth methodology. Assurance property targets are operationalized and justified.
Strong
Substantial
Moderate
Weak
Very Weak
Moderate  

2.5/5

Operational Risk Tolerance

Best-in-class

  • Anthropic maintains an anonymous line to report misconduct in the application of their policy.

Highlights

  • Anthropic quantitatively characterizes the ASL-3 self-replication thresholds: "The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]".
  • They define "Yellow Line" indicators for each risk area, which, if crossed, require either new tests, higher security measures, or pausing training and deployment. For CBRN and cyber-related risks, Yellow Lines are quantitatively defined: a >25% increase in accuracy on CBRN risk questions compared to using Google alone, >20% success rate on demanding cyber evaluations, and a ~25% jump on low-intensity misuse evaluations compared to previous models.
  • Anthropic partially characterizes the level of risk-targeted post-mitigation through red-teaming operationalization: "Successfully pass red-teaming: World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse".
  • Anthropic partially operationalizes interpretability qualitatively in Core Views on AI Safety, and in Interpretability Dreams.

Weaknesses

  • Anthropic outlines a careful approach to setting its misuse capability thresholds: “Based on conversations with global experts, it is difficult to define strict pass/fail criteria for ASL-3 misuse evaluations with high confidence. Instead, we set the bar relatively low, such that passing the misuse evaluations would trigger discussion with relevant experts and extensive transcript reviewing to determine whether the model presents a true risk or the thresholds are too conservative.” However, they do not provide transparent details on the methodology used to determine that their chosen threshold is conservative.
  • Anthropic lacks threat modeling to justify the sufficiency of its information security goal to guarantee that misuse risks remain below the defined bar: "non-state attackers are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense."
Evaluation Protocols
info icon
0 - No information available.

1 - Elements of the evaluation methodologies are described. The testing frequency is defined in terms of multiples of compute.

2 - The testing frequency is defined in terms of multiples of compute and there is a commitment to following it. The evaluation protocol is well-defined and includes relevant elicitation techniques. Independent third parties conduct pre-deployment evaluations with API access.

3 - The testing frequency is defined in terms of both multiples of compute and time and there is a commitment to following it. The evaluation protocol is well-defined and incorporates state-of-the-art elicitation techniques. A justification is provided demonstrating that these techniques are comprehensive enough to elicit capabilities that could be found and exercised by external actors. AI developers implement and justify measures (such as appropriate safety buffers), to ensure protocols can effectively detect capability threshold crossings.  Independent third parties conduct pre-deployment evaluations with fine-tuning access.

4 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and provides a rationale for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes state-of-the-art elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing.

5 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and a rationale is provided for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes relevant elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing and third parties are granted permission and resources to independently run their own evaluations, to verify the accuracy of the evaluation results.
Strong
Substantial
Moderate
Weak
Very Weak
Moderate  

2.5/5

Tolerance & Analysis Score = 1/4 × Global Risk Tolerance + 1/2 × Operational Risk Tolerance + 1/4  × Evaluation Protocols
Evaluation Protocols

Best-in-class

  • Anthropic is the only developer to commit to running a suite of evaluations on their systems every three months, enabling them to avoid surprises from post-training enhancement improvements.
  • Anthropic is the only developer to implement safety buffers in their evaluation protocol. 

Highlights

  • Anthropic defines test frequency in both compute and time: "Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements."
  • In the Claude 3.5 Sonnet Model Card Addendum, Anthropic states: “we defined quantitative “thresholds of concern” that, if passed, would be a conservative indication of proximity to our ASL-3 threshold of concern. If the model had exceeded our preset thresholds during testing, we planned to convene a council consisting of our Responsible Scaling Officer, evaluations leads, and external subject matter experts to determine whether the model’s capabilities were close enough to a threshold of concern to warrant either more intensive evaluations or an increase in safety and security protections.”
  • Anthropic implements a safety buffer in their evaluation protocol to mitigate the risk of overshooting thresholds: "We have aimed to set the size of our safety buffer to 6x (larger than our 4x evaluation interval) so model training can continue safely while evaluations take place."
  • Anthropic has worked with the UK AISI to do third-party pre-deployment evaluations.

Weaknesses

  • Anthropic acknowledges significant limitations in its elicitation methodologies: “Our current prompting and scaffolding techniques are likely far from optimal, especially for our CBRN evaluations. As a result, we could be substantially underestimating the capabilities that external actors could elicit from our models.” However, we commend Anthropic's transparency about this limitation.
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Mitigation

In risk mitigation, we assess whether:

  • The proposed risk mitigation measures, which include both deployment and containment strategies, are well-planned and clearly specified.
  • There is a strong case for assurance properties to actually reduce risks, and the assumptions these properties are operating under are clearly stated.
Containment Measures
info icon
0 - No information available.

1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
 
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.

3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.

4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.

5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
Strong
Substantial
Moderate
Weak
Very Weak
Weak  

2/5

Containment Measures

Best-in-class

Highlights

  • Anthropic offers great detail on ASL-3 information security measures.
  • They reference third-party assessment of mitigations: "Ongoing configuration management, compliance drills, integrated security approaches and mandatory external reviews should embed security within regular operations and harden processes during organizational changes."
  • Anthropic clearly commits to implementing some containment measures: "the containment measures we commit to implementing prior to training ASL-3 models primarily concern security: [...]".

Weaknesses

  • Anthropic does not provide explicit justification that the measures will suffice to meet the containment objectives.
  • In their Responsible Scaling Policy, Anthropic states that their containment measures are informed by external expert reports, including those from Sella Nevo, RAND. However, they do not explain their decision-making process for adopting or excluding specific recommendations from these reports. This lack of transparency undermines the claim of external validation for their containment measures.
Deployment Measures
info icon
0 - No information available.

1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
 
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.

3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.

4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.

5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
Strong
Substantial
Moderate
Weak
Very Weak
Weak  

2/5

Deployment Measures

Best-in-class

Highlights

  • Anthropic has already implemented countermeasures for ASL-2, and commits to countermeasures for ASL-3 on red teaming evaluations, automated misuse detection, internal usage controls, tiered access, vulnerability/incident disclosure and rapid response to model vulnerabilities.
  • They reference third-party assessment of ASL-3 mitigations: "World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse."
  • Anthropic clearly commits to implementing some deployment measures: "deployment commitments include: [...]"

Weaknesses

  • Anthropic does not provide systematic evidence demonstrating that such deployment measures suffice to achieve the intended goal.
  • Some key evidence is lacking, such as monitoring data demonstrating the absence of API misuse, the effectiveness of their monitoring to catch malicious actors, or the effectiveness of their controls against harmful outputs.
Assurance Properties
info icon
0 - No information available.

1 - Limited pursuit of some assurance properties, sparse evidence of how promising they are to reduce risks.

2 - Pursuit of some assurance properties along with research results indicating that they may be promising. Some of the key assumptions the assurance properties are operating under are stated.

3 - Pursuit of assurance properties, some evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. The assumptions the assurance properties are operating under are stated but some important ones are missing.

4 - Pursuit of assurance properties, solid evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. All the assumptions the assurance properties are operating under are stated.

5 - Broad consensus that one assurance property is likely to work, is being strongly pursued, and there is a strong case for it to be sufficient. All the assumptions the assurance properties are operating under are clearly stated and justified.
Strong
Substantial
Moderate
Weak
Very Weak
Weak  

2.5/5

For Risk Mitigation, all the grades: Containment Mitigation, Deployment Mitigation and Assurance Properties have the same weights.
Assurance Properties

Best-in-class

  • The Anthropic interpretability team has been the largest driver of advances in interpretability research.
  • Anthropic's theoretical work on influence functions in LLMs at scale develops an ability to explain the causal relationship between training data and model behaviors.

Highlights

Weaknesses

  • While Anthropic covers a broad range of possibilities, they should state the mainline assumptions they operate under more clearly.
  • Anthropic rightly considers three possible scenarios of capabilities development, we would encourage to outline how they are allocating resources and under which mainline plan they’re operating.  

Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.
Sections

Best-in-class: These are elements where the company outperforms all the others. They represent industry-leading practices.
Highlights: These are the company's strongest points within the category, justifying its current grade.
Weaknesses: These are the areas that prevent the company from achieving a higher score.

References

The main source of information is their Responsible Scaling Policies. Unless otherwise specified, all information and references are derived from this document.