3.2.1 Monitoring of KRIs (40%) 64%
3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 75%
The framework acknowledges the need to match realistic attacker capabilities and lists some elicitation methods used (scaffolding, fine tuning, expert prompting). However, it doesn’t provide quantitative specifics, such as how much compute is used for fine tuning. More detail could be added on which elicitation methods they anticipate would be used by different threat actors, under realistic settings, to justify their elicitation method.
Quotes:
“Elicitation: Demonstrate that, when given enough resources to extrapolate to realistic attackers, researchers cannot elicit sufficiently useful results from the model on the relevant tasks. We should assume that jailbreaks and model weight theft are possibilities, and therefore perform testing on models without safety mechanisms (such as harmlessness training) that could obscure these capabilities.” (p. 6)
“We will also consider the possible performance increase from using resources that a realistic attacker would have access to, such as scaffolding, finetuning, and expert prompting. At minimum, we will perform basic finetuning for instruction following, tool use, minimizing refusal rates.” (p. 6)
“By ‘widely accessible,’ we mean techniques that are available to a moderately resourced group (i.e., do not involve setting up large amounts of custom infrastructure or using confidential information).” (Footnote 6, p. 6)
3.2.1.2 Evaluation frequency (25%) 100%
The framework clearly specifies evaluation frequency in terms of effective computing power (4x increase triggers comprehensive assessment). This is a quantitative threshold that directly addresses the criterion with appropriate detail. The framework clearly specifies a 6-month fixed time interval for accounting for post-training enhancements.
Quotes:
“The term ‘notably more capable’ is operationalized as at least one of the following: 1. The model is notably more performant on automated tests in risk-relevant domains (defined as 4x or more in Effective Compute).” (pp. 5-6)
“Adjusted evaluation cadence: We adjusted the comprehensive assessment cadence to 4x Effective Compute or six months of accumulated post-training enhancements (this was previously three months).” (p. 17)
“Six months’ worth of finetuning and other capability elicitation methods have accumulated. This is measured in calendar time, since we do not yet have a metric to estimate the impact of these improvements more precisely.” (p. 6)
3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 50%
The policy acknowledges the importance of accounting for “widely accessible” post-training enhancements in capability assessments. They ensure this buffer for post-training enhancements takes account of both the possibility of enhancing the capabilities of Anthropic’s deployments, as well as the capabilities of Anthropic’s models if they are stolen: “We include headroom to account for the possibility that the model is either modified via one of our own finetuning products or stolen in the months following testing, and used to create a model that has reached a Capability Threshold.” These both show nuance.
However, whilst they focus on “widely accessible post-training enhancements”, to improve they should qualify why they focus only on these enhancements.
It is commendable that they note that “exploring ways to integrate [post-training enhancements] into an overall metric is an ongoing area of research”, though an improvement would be to commit to contribute to this research.
Further, more detail could be added on how they account(ed) for how post-training enhancements’ risk profiles change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.
Quotes:
“For models requiring comprehensive testing, we will assess whether the model is unlikely to reach any relevant Capability Thresholds absent surprising advances in widely accessible post-training enhancements” (p. 6) and “By “widely accessible,” we mean techniques that are available to a moderately resourced group (i.e., do not involve setting up large amounts of custom infrastructure or using confidential information). We include headroom to account for the possibility that the model is either modified via one of our own finetuning products or stolen in the months following testing, and used to create a model that has reached a Capability Threshold. That said, estimating these future effects is very difficult given the state of research today” (Footnote 6, p. 6)
A model is notably capable if: “Six months’ worth of finetuning and other capability elicitation methods have accumulated. This is measured in calendar time, since we do not yet have a metric to estimate the impact of these improvements more precisely” (p. 6) and “Exploring ways to integrate these types of improvements into an overall metric is an ongoing area of research.” (Footnote 5, p. 6)
3.2.1.4 Vetting of protocols by third parties (15%) 10%
The policy mentions soliciting external expert input in developing and conducting capability assessments, which partially addresses protocol vetting. However, this is general input rather than a strong commitment specific to the vetting of evaluation protocols. Further, details on the level of expertise required, and why their chosen experts satisfy this criteria, should be detailed.
Quotes:
“Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments.” (p. 13)
3.2.1.5 Replication of evaluations by third parties (15%) 50%
The framework mentions in multiple stages of the risk assessment process that they will share materials related to the evaluation and seek input from experts, but not that these expert will reproduce or audit the results directly. To improve, a process for having evaluations externally assessed or audited should be detailed.
Quotes:
“To advance the public dialogue on the regulation of frontier AI model risks and to enable examination of our actions, we will also publicly release key materials related to the evaluation and deployment of our models with sensitive information removed and solicit input from external experts in relevant domains.” (p. 13)
“We will solicit both internal and external expert feedback on the [Capability] report” (p. 7)
“Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments. We may also solicit external expert input prior to making final decisions on the capability and safeguards assessments.” (p. 13)
3.2.2 Monitoring of KCIs (40%) 43%
3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 50%
The framework provides a high-level description of monitoring procedures for deployment measures, with examples such as “jailbreak bounties, doing historical analysis or background monitoring, and any necessary retention of logs for these activities.” To improve, they should define what monitoring “on a reasonable cadence” is defined as.
They also mention that they will develop plans to audit the implementation of containment measures, but there is no commitment to audit assurance processes. They describe redteaming of the model with deployment measures to “[demonstrate] that threat actors with realistic access levels and resources are highly unlikely to be able to consistently elicit information from any generally accessible systems that greatly increases their ability to cause catastrophic harm relative to other available tools”.
However, they note that “This criterion does not attempt to specify the exact red-teaming protocol (e.g., number of hours, level of access, or pass-fail criteria). Setting a principled pass-fail threshold will depend on other factors, such as the quality of our monitoring and ability to respond to jailbreaks rapidly.” An improvement would be to specify the protocol as much as possible, to ensure transparency, and to conduct this redteaming continuously or at regular cadence.
It is commendable that they note the importance of “prespecify[ing] empirical evidence that would show the system is oeprating within the accepted risk range” for monitoring.
However, to improve, the framework should describe systematic, ongoing monitoring to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will still be met, when required.
Quotes:
“Monitoring: Prespecify empirical evidence that would show the system is operating within the accepted risk range and define a process for reviewing the system’s performance on a reasonable cadence. Process examples include monitoring responses to jailbreak bounties, doing historical analysis or background monitoring, and any necessary retention of logs for these activities.” (p. 8)
“Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence” (p. 10)
3.2.2.2 Vetting of protocols by third parties (30%) 25%
The policy mentions soliciting external expert input in developing and conducting safeguard assessments, which partially addresses protocol vetting. However, this is general input rather than a strong commitment specific to the vetting of evaluation protocols.
Quotes:
“Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments. We may also solicit external expert input prior to making final decisions on the capability and safeguards assessments.” (p. 13)
“Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results” (p. 10)
3.2.2.3 Replication of evaluations by third parties (30%) 50%
The framework mentions that they will share materials related to KCI evaluations (i.e. safeguard assessments) and seek input from experts, but not that experts will reproduce the results directly.
Quotes:
“To advance the public dialogue on the regulation of frontier AI model risks and to enable examination of our actions, we will also publicly release key materials related to the evaluation and deployment of our models with sensitive information removed and solicit input from external experts in relevant domains.” (p. 13)
“Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments. We may also solicit external expert input prior to making final decisions on the capability and safeguards assessments.” (p. 13)
“Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results” (p. 10)
3.2.3 Transparency of evaluation results (10%) 77%
3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 90%
The policy demonstrates strong commitment to sharing evaluation results with multiple stakeholders: the public (summaries), government entities, internal staff, Board of Directors, and the Long-Term Benefit Trust. Multiple channels and levels of disclosure are specified. There is a commitment to notifying a relevant authority if “a model requires stronger protections than the ASL-2 Standard.” They commit to publicly releasing “key information related to the evaluation and deployment of our models” – to improve, they should commit to publishing all KRI and KCI assessments (with sensitive information redacted).
Quotes:
“Public disclosures: We will publicly release key information related to the evaluation and deployment of our models (not including sensitive details). These include summaries of related Capability and Safeguards reports when we deploy a model” (p. 13)
“U.S. Government notice: We will notify a relevant U.S. Government entity if a model requires stronger protections than the ASL-2 Standard.” (p. 13)
“We will share summaries of Capability Reports and Safeguards Reports with Anthropic’s regular-clearance staff, redacting any highly-sensitive information.” (p. 12)
“The CEO and RSO decide to proceed with deployment, they will share their decision–as well as the underlying Capability Report, internal feedback, and any external feedback–with the Board of Directors and the Long-Term Benefit Trust before moving forward.” (p. 7)
3.2.3.2 Commitment to non-interference with findings (15%) 0%
No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.
Quotes:
No relevant quotes found.
3.2.4 Monitoring for novel risks (10%) 5%
3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 0%
Despite noting that “for each capability threshold, [we will] make a compelling case that we have mapped out the most likely and consequential threat models: combinations of actors (if relevant), attack pathways, model capability bottlenecks, and types of harms. We also make a compelling case that
there does not exist a threat model that we are not evaluating that represents a substantial amount of risk”, there does not appear to be a process for identifying novel risks post-deployment which could signal alternative threat models. Hence, their risk modelling appears to be mostly informed a priori, rather than from empirical data of the model in deployment. To improve, they could establish a process for actively searching for novel risks or changed risk profiles of models.
They do mention “periodic, broadly scoped, and independent testing with expert red-teamers” for auditing their security programs. However, this is not for the purpose of arising novel risk profiles, so credit is not given.
Quotes:
No relevant quotes found.
3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 10%
There is an indication that if novel risk models could exist that they had not considered, an effort will be made to incorporate this into their risk assessment. One indication of this is the general intention to incorporate findings from evaluations, which may uncover new risk profiles of models: “Findings from partner organizations and external evaluations of our models (or similar models) should also be incorporated into the final assessment, when available.”
They also mention that “as our understanding evolves, we may identify additional [capability] thresholds.” However, this statement does not explicitly commit to incorporating novel risks into their risk identification and prioritization process. They note that they will maintain a “list of capabilities that we think require significant investigation” and “could pose serious risks, but the exact Capability Threshold and the Required Safeguards are not clear at present.” This somewhat gives an indication that additional risks may be incorporated into their risk assessment. To improve, they could commit to engaging in risk modelling to map out the potential harms from risk profiles that may have changed, in order to ensure they are keeping in touch with the evolving risk landscape.
Quotes:
“Findings from partner organizations and external evaluations of our models (or similar models) should also be incorporated into the final assessment, when available.” (p. 6)
“These Capability Thresholds represent our current understanding of the most pressing catastrophic risks. As our understanding evolves, we may identify additional thresholds. For each threshold, we will identify and describe the corresponding Required Safeguards as soon as feasible, and at minimum before training or deploying any model that reaches that threshold.”
“We will also maintain a list of capabilities that we think require significant investigation and may require stronger safeguards than ASL-2 provides. This group of capabilities could pose serious risks, but the exact Capability Threshold and the Required Safeguards are not clear at present. These capabilities may warrant a higher standard of safeguards, such as the ASL-3 Security or Deployment Standard. However, it is also possible that by the time these capabilities are reached, there will be evidence that such a standard is not necessary (for example, because of the potential use of similar capabilities for defensive purposes). Instead of prespecifying particular thresholds and safeguards today, we will conduct ongoing assessments of the risks with the goal of determining in a future iteration of this policy what the Capability Thresholds and Required Safeguards would be.” (p. 5)