1.9/5
Risk Identification
Weak
2.5/5
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Risk Tolerance & Analysis
Weak
1.75/5
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Risk Mitigation
Weak
1.5/5
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Your PDF will be sent soon
Risk Identification
In risk identification, we assess whether an AI developer is:
- Approaching in an appropriate way risks outlined by the literature.
- Doing extensive open-ended red teaming to identify new risks.
- Leveraging a diverse range of risk identification techniques, including threat modeling when appropriate, to adequately identify new threats.
1 : Some unclear links are made & some risks are mentioned. Some efforts of open-ended red teaming are reported, along with some very basic threat modelling.
2 : Links are made for a number of risks, but missing important ones. Significant efforts of open-ended red teaming are reported, along with significant threat modelling efforts.
3 : Links are made for most of the important & commonly discussed risks. Consequential red teaming is precisely reported, along with significant threat modelling and structured risk identification techniques usage.
4 : Links are clearly made for most risks. Almost all risks are mentioned. There’s a methodology outlining how structured risk identification across the lifecycle is done, precisely characterized red teaming (including from external parties) is done, along with advanced and broad threat modelling.
5 : There’s a comprehensive, continued and detailed effort to ensure all risks are found and covered. The red teaming effort is extremely extensive and jointly integrated with structured risk identification efforts, quantified, and done with third parties
2.5/5
Best-in-Class
- Anthropic was the first to share novel open-ended red teaming practices to discover new risks.
Highlights
- Anthropic covers a number of different risk types well and conducts threat and risk modeling for some, including biorisk, biases, and deception with their work on sleeper agents.
- In “Responsible Scaling Policy Evaluations Report – Claude 3 Opus”, Anthropic describes their risk management procedure for Cybersecurity, CBRN information, and Model Autonomy.
- In “The Claude 3 Model Family: Opus, Sonnet, Haiku”, Anthropic describes open-ended red teaming: “The team engaged the model in multi-turn conversations about sensitive or harmful topics to analyze responses, identify areas for improvement, and establish a baseline for evaluating models over time. Examples of tested topics include, but are not limited to: child safety, dangerous weapons and technology, hate speech, violent extremism, fraud, and illegal substances.”
Weaknesses
- The open-ended red teaming procedures to identify risks, described in the Responsible Scaling Policy, lack crucial details, in particular regarding their integration with risk identification procedures. It is unclear what is the amount of threat and risk modeling conducted for novel vulnerabilities identified during red teaming, as well as for certain key risks (e.g., cyber offense).
- Anthropic does not address all high-severity risks. For example, their Responsible Scaling Policy does not address LLM persuasion capabilities.
Your PDF will be sent soon
Risk Tolerance
& Analysis
In risk tolerance and analysis, we assess whether the AI developers have defined:
- A global risk tolerance.
- Operational capabilities thresholds and their equivalent risk. Those have to be defined with precision and breadth.
- Corresponding objectives of risk mitigation measures: AI developers should establish clear objectives for risk mitigation measures. These objectives should be grounded in strong rationales, including threat modeling, to justify that they are sufficient to address the identified risks and align with the organization's risk tolerance.
- Evaluation protocols detailing procedures for measuring the model's capabilities and ensuring that capability thresholds are not exceeded without detection.
1 - Global risk tolerance is qualitatively defined.
E.g., “Our system should not increase the likelihood of extinction risks”.
2 - Global risk tolerance is quantitatively defined for casualties.
3 - Global risk tolerance is quantitatively defined for casualties and economic damages, with adequate ranges and rationale for the decision.
4 - Global risk tolerance is quantitatively defined for casualties, economic damages, and other high-severity risks (e.g., large-scale manipulation of public opinion), with robust methodology and decision-making processes to decide the tolerance (e.g., public consultation).
5 - Global risk tolerance is clearly and quantitatively defined for all significant threats and risks known in the literature. Any significant deviations in risk tolerance from industry norms are clearly justified and explained (e.g., through a comprehensive benefit/cost analysis).
0.5/5
Highlights
- In “Responsible Scaling Policy Evaluations Report – Claude 3 Opus”, Anthropic states: "Anthropic's Responsible Scaling Policy (RSP) aims to ensure we never train, store, or deploy models with catastrophically dangerous capabilities, except under a safety and security standard that brings risks to society below acceptable levels."
Weaknesses
- While Anthropic acknowledges a tolerance relative to “catastrophically dangerous capabilities”, we encourage them to focus their statement on risk (e.g., "catastrophic levels of risk") rather than capabilities, and to increase the specificity on what constitutes "acceptable levels" of risk.
1 - Some important capability thresholds are qualitatively defined and their corresponding mitigation objectives are qualitatively defined as well.
2 - Some important capability thresholds are precisely defined, and their corresponding mitigations are precisely defined as well.
3 - Almost all important hazardous capability thresholds and their corresponding mitigation objectives are precisely defined and grounded in extensive threat and risk modeling.
4 - All hazardous capabilities are precisely defined. The corresponding mitigation objectives are quantitatively defined and grounded in extensive threat and risk modeling. Assurance property targets are operationalized.
5 - All hazardous capabilities have a precisely defined threshold. Corresponding mitigation objectives are quantified and grounded in comprehensive threat and risk modeling with a clear and in-depth methodology. Assurance property targets are operationalized and justified.
2/5
Best-in-class
- Anthropic maintains an anonymous line to report misconduct in the application of their policy.
Highlights
- Anthropic defines thresholds of concern and capability thresholds. Thresholds of concern are quantitatively defined and well-operationalized, serving as initial indicators. If a threshold of concern is passed, the model undergoes further testing to assess whether it meets the corresponding capability threshold, which is less precisely operationalized.
- For example, in Responsible Scaling Policy Evaluations Report – Claude 3 Opus, they define "Yellow Line" indicators for each risk area which correspond to thresholds of concern. For CBRN and cyber-related risks, Yellow Lines are quantitatively defined: a >25% increase in accuracy on CBRN risk questions compared to using Google alone, >20% success rate on demanding cyber evaluations, and a ~25% jump on low-intensity misuse evaluations compared to previous models.
- In Responsible Scaling Policy (October 2024 version), Anthropic define qualitatively the CBRN misuse capability threshold: “The ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons.” They define a bit more quantitatively the Autonomous AI Research and Development capability threshold: the model cause “cause dramatic acceleration in the rate of effective scaling”, operationalized as a one year scale up of ~1000x.
- Anthropic partially characterizes the level of risk-targeted post-mitigation through red-teaming operationalization: "Conduct red-teaming that demonstrates that threat actors with realistic access levels and resources are highly unlikely to be able to consistently elicit information from any generally accessible systems that greatly increases their ability to cause catastrophic harm relative to other available tools”.
- Anthropic partially operationalizes interpretability qualitatively in Core Views on AI Safety, and in Interpretability Dreams.
Weaknesses
- Anthropic's approach of not quantitatively defining capability thresholds may lead to potential inconsistencies in risk assessment. While thresholds of concern are well-defined, the subsequent demonstration that a model's capabilities remain under the capability thresholds lacks precise criteria. The absence of exact goals for these demonstrations introduces a risk of "moving the goalposts." As incentives to continue scaling increase, there may be a temptation to adjust the interpretation of capability thresholds, potentially compromising the integrity of the risk assessment process.
- Even though they say that they will remediate it in the future, Anthropic lacks threat modeling to justify the sufficiency of its information security goal to guarantee that misuse risks remain below the defined bar. The ASL-3 security standards are not sufficient to protect against the following actors: "state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks."
1 - Elements of the evaluation methodologies are described. The testing frequency is defined in terms of multiples of compute.
2 - The testing frequency is defined in terms of multiples of compute and there is a commitment to following it. The evaluation protocol is well-defined and includes relevant elicitation techniques. Independent third parties conduct pre-deployment evaluations with API access.
3 - The testing frequency is defined in terms of both multiples of compute and time and there is a commitment to following it. The evaluation protocol is well-defined and incorporates state-of-the-art elicitation techniques. A justification is provided demonstrating that these techniques are comprehensive enough to elicit capabilities that could be found and exercised by external actors. AI developers implement and justify measures (such as appropriate safety buffers), to ensure protocols can effectively detect capability threshold crossings. Independent third parties conduct pre-deployment evaluations with fine-tuning access.
4 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and provides a rationale for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes state-of-the-art elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing.
5 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and a rationale is provided for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes relevant elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing and third parties are granted permission and resources to independently run their own evaluations, to verify the accuracy of the evaluation results.
2.5/5
Best-in-class
- Anthropic is the only developer to commit to running a suite of evaluations on their systems every six months, enabling them to limit surprises from post-training enhancement improvements.
Highlights
- Anthropic defines test frequency in both compute and time. They will conduct evaluations every 4x of compute increase and every 6 months to account for post-training enhancement.
- Anthropic implements a form of safety buffer by establishing "thresholds of concern" that are designed to be more conservative than their capability thresholds.
- Anthropic state that their comprehensive assessment must demonstrate that their elicitation techniques are sufficient to extrapolate the capabilities of realistic potential attackers. They also mention that they will use forecasting to understand if post-training enhancement might make the model reach the thresholds between test times.
- Anthropic has worked with the UK AISI to do third-party pre-deployment evaluations.
Weaknesses
- Anthropic acknowledges significant limitations in its elicitation methodologies: “Our current prompting and scaffolding techniques are likely far from optimal, especially for our CBRN evaluations. As a result, we could be substantially underestimating the capabilities that external actors could elicit from our models.” However, we commend Anthropic's transparency about this limitation.
Your PDF will be sent soon
Risk Mitigation
In risk mitigation, we assess whether:
- The proposed risk mitigation measures, which include both deployment and containment strategies, are well-planned and clearly specified.
- There is a strong case for assurance properties to actually reduce risks, and the assumptions these properties are operating under are clearly stated.
1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.
3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.
4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.
5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
1/5
Highlights
- Anthropic presents some high-level measures such as perimeters and access controls, lifecycle security, and monitoring. They also say that they will invest significant resources in security: “We expect meeting this standard of security to require roughly 5-10% of employees being dedicated to security and security-adjacent work.”
- Anthropic will do threat modeling to justify that the measures they implement are sufficient to meet the containment objective.
Weaknesses
- Anthropic lacks specificity in defining the containment measures they intend to implement. Notably, the current version of their Responsible Scaling Policy provides less detailed information about these measures compared to the previous iteration, representing a step back in transparency.
- In their Responsible Scaling Policy, Anthropic states that their containment measures are informed by external expert reports, including those from Sella Nevo, RAND. However, they do not explain their decision-making process for adopting or excluding specific recommendations from these reports. This lack of transparency undermines the claim of external validation for their containment measures.
1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.
3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.
4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.
5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
1/5
Best-in-class
- An Anthropic model is topping the LLM Safety leaderboard.
Highlights
- Anthropic will do threat modeling to justify that the measures they implement are sufficient to meet the containment objective.
- Anthropic provide a high level qualitative description of their deployment measure: “Defense in depth: Use a “defense in depth” approach by building a series of defensive layers, each designed to catch misuse attempts that might pass through previous barriers. As an example, this might entail achieving a high overall recall rate using harm refusal techniques. This is an area of active research, and new technologies may be added when ready.”
Weaknesses
- The description of deployment measures provided by Anthropic lacks specificity. Moreover, they don’t give any preliminary evidence of their effectiveness.
1 - Limited pursuit of some assurance properties, sparse evidence of how promising they are to reduce risks.
2 - Pursuit of some assurance properties along with research results indicating that they may be promising. Some of the key assumptions the assurance properties are operating under are stated.
3 - Pursuit of assurance properties, some evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. The assumptions the assurance properties are operating under are stated but some important ones are missing.
4 - Pursuit of assurance properties, solid evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. All the assumptions the assurance properties are operating under are stated.
5 - Broad consensus that one assurance property is likely to work, is being strongly pursued, and there is a strong case for it to be sufficient. All the assumptions the assurance properties are operating under are clearly stated and justified.
2.5/5
Best-in-class
- The Anthropic interpretability team has been the largest driver of advances in interpretability research.
- Anthropic's theoretical work on influence functions in LLMs at scale develops an ability to explain the causal relationship between training data and model behaviors.
Highlights
- Anthropic provides a moderately detailed case in defense of interpretability as a research direction:
- Some evidence at small scale that they may have found a solution to the superposition problem, one of the major problems to reach LLM interpretability.
- Evidence that their approach to dictionary learning scales, to increase the monosemanticity of extracted features from LLMs.
- Evidence of attention heads’ interpretability.
- An argument for how interpretability may become an assurance property.
- A discussion of what hypothetical scenarios Anthropic is operating under.
- Anthropic provides an early case and substantive evidence to support the relevance of influence functions:
Weaknesses
- While Anthropic covers a broad range of possibilities, they should state the mainline assumptions they operate under more clearly.
- Anthropic rightly considers three possible scenarios of capabilities development, we would encourage to outline how they are allocating resources and under which mainline plan they’re operating.
Your PDF will be sent soon
Sections
Best-in-class: These are elements where the company outperforms all the others. They represent industry-leading practices.
Highlights: These are the company's strongest points within the category, justifying its current grade.
Weaknesses: These are the areas that prevent the company from achieving a higher score.
References
The main source of information is their Responsible Scaling Policies. Unless otherwise specified, all information and references are derived from this document.