1.1.1 Risks from literature and taxonomies are well covered (50%) 75%
Risks covered include Biological and Chemical risks, Cybersecurity, AI self-improvement, plus tracked categories (i.e. risk domains that are monitored to a lesser extent), including nuclear and radiological risks, and various loss of control risks such as long range autonomy, sandbagging, autonomous replication and adaptation, and undermining safeguards. Breaking down loss of control risks as such is commendable.
They exclude persuasion as a research or tracked category.
There is some mention of referencing literature through “internal research”, and risk identification “incorporates feedback from academic researchers”, though no specific, structured approach is given nor documents referenced.
1.1.2 is not greater than 50% and persuasion is excluded.
Quotes:
“We evaluate whether frontier capabilities create a risk of severe harm through a holistic risk assessment process. This process draws on our own internal research and signals, and where appropriate incorporates feedback from academic researchers, independent domain experts, industry bodies such as the Frontier Model Forum, and the U.S. government and its partners, as well as relevant legal and policy mandates.” (p. 4)
Tracked Categories include (pp. 5-6):
“Biological and Chemical: The ability of an AI model to accelerate and expand access to biological and chemical research, development, and skill-building, including access to expert knowledge and assistance with laboratory work.”
“Cybersecurity: The ability of an AI model to assist in the development of tools and executing operations for cyberdefense and cyberoffense.”
“AI Self improvement: The ability of an AI system to accelerate AI research, including to increase the system’s own capability.”
Research Categories include (p. 7):
“Long-range Autonomy: ability for a model to execute a long-horizon sequence of actions sufficient to realize a “High” threat model (e.g., a cyberattack) without being directed by a human (including successful social engineering attacks when needed)”
“Sandbagging: ability and propensity to respond to safety or capability evaluations in a way that significantly diverges from performance under real conditions, undermining the validity of such evaluations”
“Autonomous Replication and Adaptation: ability to survive, replicate, resist shutdown, acquire resources to maintain and scale its own operations, and commit illegal activities that collectively constitute causing severe harm (whether when explicitly instructed, or at its own initiative), without also utilizing capabilities tracked in other Tracked Categories.”
“Undermining Safeguards: ability and propensity for the model to act to undermine safeguards placed on it, including e.g., deception, colluding with oversight models, sabotaging safeguards over time such as by embedding vulnerabilities in safeguards code, etc.”
“Nuclear and Radiological: ability to meaningfully counterfactually enable the creation of a radiological threat or enable or significantly accelerate the development of or access to a nuclear threat while remaining undetected.”
1.1.2 Exclusions are clearly justified and documented (50%) 50%
The justification for excluding the research categories from becoming tracked categories is clear, whereby they “need more research and threat modeling before they can be rigorously measured, or do not cause direct risks themselves but may need to be monitored because further advancement in this capability could undermine the safeguards we rely on”. To improve, this justification should refer to at least one of: academic literature/scientific consensus; internal threat modelling with transparency; third-party validation, with named expert groups and reasons for their validation. That is, whilst they mention that “these capabilities either need more research and threat modeling before they can be rigorously measured” as justification, they should provide credible plans for how they are improving this threat modeling or why nonrigorous measurement options they have considered are not possible or helpful.
Some of their exclusion criteria, however, is quite commendable. For instance their justification for why nuclear and radiological capabalities are now a research category clearly links to risk models. Nonetheless, expert endorsement or more detailed reasoning could be an improvement.
They acknowledge that persuasion is no longer prioritised because “our Preparedness Framework is specifically focused on frontier AI risks meeting a specific definition of severe harms, and Persuasion category risks do not fit the criteria for inclusion.” However, more detail is required for proper justification, for instance what criteria Persuasion does not fit and why they believe this.
Implicitly, their criteria for inclusion (plausible, measurable, severe, net new and instantaneous or irremediable) gives justification for when risks are not included. However, a more explicit link between risks that are excluded and which criteria they fail is needed. Further, their requirement for a risk to be “measurable” may be overly strict; lacking the capability evaluations to “measure capabilities that closely track the potential for the severe harm” does not necessarily mean the risk should be dismissed.
They do mention that they will “periodically review the latest research and findings for each Research Category”, but a more structured process should be given.
Quotes:
“AI Self-improvement (now a Tracked Category), Long-range Autonomy and Autonomous Replication and Adaptation (now Research Categories) are distinct aspects of what we formerly termed Model Autonomy. We have separated self-improvement because it presents a distinct plausible, net new, and potentially irremediable risk, namely that of a hard-to-track rapid acceleration in AI capabilities which could have hard-to-predict severely harmful consequences.
In addition, the evaluations we use to measure this capability are distinct from those applicable to Long-range Autonomy and Autonomous Replication and Adaptation. Meanwhile, while these latter risks’ threat models are not yet sufficiently mature to receive the scrutiny of Tracked Categories, we believe they justify additional research investment and could qualify in the future, so we are investing in them now as Research Categories.
Nuclear and Radiological capabilities are now a Research Category. While basic information related to nuclear weapons design is available in public sources, the information and expertise needed to actually create a working nuclear weapon is significant, and classified. Further, there are significant physical barriers to success, like access to fissile material, specialized equipment, and ballistics. Because of the significant resources required and the legal controls around information and equipment, nuclear weapons development cannot be fully studied outside a lassified context. Our work on nuclear risks also informs our efforts on the related but distinct risks posed by radiological weapons. We build safeguards to prevent our models from assisting with high-risk queries related to building weapons, and evaluate performance on those refusal policies as part of our safety process. Our analysis suggests that nuclear risks are likely to be of substantially greater severity and therefore we will prioritize research on nuclear-related risks. We will also engage with US national security stakeholders on how best to assess these risks.” (pp. 7–8)
“Within our wider safety stack, our Preparedness Framework is specifically focused on frontier AI risks meeting a specific definition of severe harms, and Persuasion category risks do not fit the criteria for inclusion.” (p. 8)
“There are also some areas of frontier capability that do not meet the criteria to be Tracked Categories, but where we believe work is required now in order to prepare to effectively address risks of severe harms in the future. These capabilities either need more research and threat modeling before they can be rigorously measured, or do not cause direct risks themselves but may need to be monitored because further advancement in this capability could undermine the safeguards we rely on to mitigate existing Tracked Category risks. We call these Research Categories” (p. 7)
“Tracked Categories are those capabilities which we track most closely, measuring them during each covered deployment and preparing safeguards for when a threshold level is crossed. We treat a frontier capability as a Tracked Category if the capability creates a risk that meets five criteria:
1. Plausible: It must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI.
2. Measurable: We can construct or adopt capability evaluations that measure capabilities that closely track the potential for the severe harm.
3. Severe: There is a plausible threat model within the capability area that would create severe harm.
4. Net new: The outcome cannot currently be realized as described (including at that scale, by that threat actor, or for that cost) with existing tools and resources (e.g., available as of 2021) but without access to frontier AI.
5. Instantaneous or irremediable: The outcome is such that once realized, its severe harms are immediately felt, or are inevitable due to a lack of feasible measures to remediate.” (p. 4)
“We will periodically review the latest research and findings for each Research Category” (p. 7)