2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%
The RSP defines capability/usage thresholds for four risk domains—misaligned AI systems in high-stakes settings (previously “high-stakes sabotage opportunities”), automated AI R&D in key domains, and novel & non-novel chemical/biological weapons production. The risk report provides qualitatively described risk models for each of those thresholds. For the latter three, the risk report includes information on specific evaluations used to inform the risk evaluation; for sabotage, it provides a long, structured argument, partitioned into four key claims, and drawing on past alignment research as well as evaluations for specific models like Claude Opus 4.6. While a lot of effort has evidently flown into the evaluation of those risks, the documents do not provide clear evaluation-level red-lines, and thus the determination of the acceptability of risks seems mostly discretionary.
Further, the FCF defines risk tiers for harmful manipulation and cyber offense. However, as those are not assigned (un)acceptability statements, relevant evaluation, or required mitigations, it is unclear how they are used in practice to determine risk levels.
To improve, Anthropic should establish clear links between evaluations conducted and their KRIs to reduce allow external scrutiny of risk determination, and provide context around how the risk tiers for cyber offense and harmful manipulation in the FCF inform risk assessment in practice.
Quotes:
“The left column identifies capability thresholds that would call for heightened mitigations.” (RSP, p.4)
“Non-novel chemical/biological weapons production. AI systems with the ability to significantly help individuals or groups with basic technical backgrounds (e.g., undergraduate STEM degrees) create/obtain and deploy chemical and/or biological weapons with serious potential for catastrophic damages.” (RSP, p.6)
“Novel chemical/biological weapons production. AI systems that can functionally substitute for the scarce human expertise that is currently the primary barrier to novel development of chemical and biological weapons with potential for catastrophic harm. That is, a well-resourced team could, using the model, accomplish the end-to-end agent design and deployment (including, as relevant, agent design, verification and validation, formulation, and dissemination) that would otherwise require recruiting one of a small number of world-leading specialists.” (RSP, p.7)
“Misaligned AI systems in high-stakes settings. AI systems that are highly relied on and have extensive access to sensitive assets as well as moderate capacity for autonomous, goal-directed operation and subterfuge—such that it is plausible these AI systems could (if directed toward this goal, either deliberately or inadvertently) carry out sabotage leading to irreversibly and substantially higher odds of a later global catastrophe.” (RSP, pp.7–8)
“Automated R&D in key domains. AI systems that can fully automate, or otherwise dramatically accelerate, the work of large, top-tier teams of human researchers in domains where fast progress could cause threats to international security and/or rapid disruptions to the global balance of power—for example, energy, robotics, weapons development and AI itself. […] We will consider this threshold to be met if we determine that either (1) our models would be able to fully substitute for our entire set of Research Scientists and Research Engineers, at competitive costs (i.e., within a factor of 5); or (2) there is “dramatic acceleration” of the pace of AI progress for reasons that likely relate to the automation of AI R&D.” (RSP, pp.8–9)
“Tier 1: Meaningful technical assistance for active cyber operations using known attack techniques and methodologies. Some automation is involved, but still requires human input to complete successful large cyber-operations.
Tier 2: Completely autonomous cyber operations with novel offensive capability development and adaptive persistence. For example, autonomous discovery/exploitation of previously unknown vulnerability classes, self-directed campaign orchestration adapting to defenses, or sustained operations evolving without human intervention.” (FCF, p.6)
“Given its nascency as a systemic risk area within our risk management framework, our approach to Harmful Manipulation is exploratory and we expect it to evolve as we continue to explore and conduct further research. […]
Tier 1: Campaign Infrastructure Enablement: Escalation to systematic infrastructure enabling sophisticated scaled influence campaigns. Automates >50% of steps normally requiring multiple sophisticated adversarial actors […]
Tier 2: Autonomous adaptive techniques and campaign execution: End-to-end automation of deceptive influence operations with systematic targeting capabilities, requiring <10% human oversight.” (FCF, p.8)
“Tier 1: Misaligned AI systems in high-stakes settings. AI systems that write large amounts of critical code and/or are otherwise in a position where they are highly relied on and have extensive access to sensitive assets, as well as moderate capacity for autonomous, goal-directed operation and subterfuge […]
Tier 2: Automated R&D in key domains. AI systems that can fully automate, or otherwise dramatically accelerate, the work of large, top-tier teams of human researchers in domains where fast progress could cause threats to international security and/or rapid disruptions to the global balance of power — particularly in energy, robotics, weapons development and AI itself. For the time being, we use AI R&D capabilities as a proxy for broader R&D capabilities, as this domain likely plays to AI systems’ current strengths and is more tractable to assess than capabilities in other domains. Additionally, AI R&D alone could cause acceleration in AI capabilities improvements, to the point where all of the threats listed above (and more) develop very quickly. We will consider this threshold to be met if we determine that either (1) our models would be able to fully substitute for our entire set of Research Scientists and Research Engineers, at competitive costs (i.e., within a factor of 5); or (2) there is “dramatic acceleration” of the pace of AI progress for reasons that likely relate to the automation of AI R&D. We would consider scenario (2) to have occurred where (a) we observe or expect double the rate of progress in AI aggregate capabilities compared to the rate we’d expect in the absence of significant AI contributions to AI R&D and (b) it is plausible that this doubling is substantially attributable to the automation of research and/or engineering (as opposed to other factors, such as increased headcount, compute, or general productivity), such that continuation of the trend in AI progress could lead to even greater acceleration. Our working operationalization is to trigger this risk threshold at the point where we determine that a model could compress two years of 2018–2024 AI progress into a single year. It may be sensible to add earlier, and/or easier-to-measure, thresholds that trigger less demanding versions of the mitigations for this threshold.” (FCF, pp.9–10)
“We believe, in light of the above analysis, that: Some of our models do meet the left column’s threshold: they ‘write large amounts of critical code and/or are otherwise in a position where they are highly relied on and have extensive access to sensitive assets, as well as moderate capacity for autonomous, goal-directed operation.'” (Risk Report, p.56)
“Our pre-deployment alignment assessment reports the following, reproduced from the Claude Opus 4.6 System Card” (Risk Report, p.19)
2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%
The RSP includes two high-level description of quantitative KRIs for automated AI R&D, one defined as a doubling in the rate of progress of AI capabilities, and the other as an upper bound for the cost of replacing Anthropic’s entire research engineer team by AI. It is, however, unclear how exactly these would be measured in practice. Another quantiative KRI, for kernel optimization, is included under the sabotage section but actually comes from the autonomous AI R&D section of Claude Opus 4.6’s system card, and we will thus treat as an AI R&D rather than a sabotage KRI.
Quantitative ASL-3 thresholds appear for two evaluations in the non-novel CB section (2.8× uplift and 0.8 pass@5), but are absent from the majority of evaluations reported in the same table and from all other threat model sections. Their presence appears to reflect carry-over from the prior ASL-based framework rather than systematic threshold-setting under RSP 3.3, which has moved away from prescriptive ASL level specifications.
The FCF provides two high-level KRIs for harmful manipulation, one in terms of percentage of tasks the AI can complete in autonomically, and the other in terms of the percentage of tasks requiring human oversight; it is not clear how Anthropic intends to measure those.
There are no quantitative KRIs given for cyber offense, chemical, novel biological, radiological and nuclear weapons, or loss-of-control risks. On cyber offense, the FCF states that its risk tier system “quantifies model capabilities against cybersecurity threat metrics”, but the risk tiers given are exclusively qualitative.
To improve, Anthropic should provide at least one clear quantitative KRI threshold for each risk domain that can be used to determine the risk stemming from their models in that domain, and provide at least a list of evaluations that they will use to measure cyber offense, chemical, novel biological, radiological and nuclear weapons, or loss-of-control risks.
Quotes:
“We will consider this threshold to be met if we determine that either (1) our models would be able to fully substitute for our entire set of Research Scientists and Research Engineers, at competitive costs (i.e., within a factor of 5); or (2) there is “dramatic acceleration” of the pace of AI progress for reasons that likely relate to the automation of AI R&D. We would consider scenario (2) to have occurred where (a) we observe or expect double the rate of progress in AI aggregate capabilities compared to the rate we’d expect in the absence of significant AI contributions to AI R&D and (b) it is plausible that this doubling is substantially attributable to the automation of research and/or engineering (as opposed to other factors, such as increased headcount, compute, or general productivity), such that continuation of the trend in AI progress could lead to even greater acceleration.” (RSP, p.9)
“‘Double the rate of progress’ means ‘as much progress in one year as one would see in two years at baseline.’ For example, if baseline progress involved a 3x scaleup in compute and a 3x improvement in algorithmic efficiency (for a 9x ‘effective scaleup’), ‘double the rate of progress’ would entail something like an 81x effective scaleup.This is not the same idea as ‘doubling researchers’ productivity,’ since doubling inputs does not necessarily double the rate of progress.” (RSP, p.9)
“Bioweapons acquisition uplift trial: […] Score: 63% ± 13%; Uplift: 2.53× (vs 25% control) […] Results for Claude Opus 4 as we did not conduct it for Claude Opus 4.6. ASL-3 threshold: 2.8× uplift.” (Risk Report, p.66)
“Long-form virology tasks: […] Virology Task 2: 0.912 (pass@5) […] ASL-3 threshold: 0.8.” (Risk Report, p.67)
“Internal AI research evaluation suite 2: […] Claude Opus 4.6 scored 0.613, surpassing our rule-out threshold (for equivalence to an entry-level research scientist or research engineer, set based on the evaluation suite designer’s judgment) of 0.6.” (Risk Report, p.59)
“On one evaluation, kernel optimization, Opus 4.6 achieved a 427× speedup using a novel scaffold, far exceeding the 300x threshold for 40 human-expert-hours of work and more than doubling performance under our standard setup.” (Risk Report, p.31)
“Tier 1: Campaign Infrastructure Enablement: […] Automates >50% of steps normally requiring multiple sophisticated adversarial actors. […]
Tier 2: Autonomous adaptive techniques and campaign execution: End-to-end automation of deceptive influence operations with systematic targeting capabilities, requiring <10% human oversight.” (FCF, p.8)