1.1.1 Risks from literature and taxonomies are well covered (50%) 75%
Risks covered include Biological and Chemical, Cybersecurity, and AI self-improvement as tracked categories, plus research categories (monitored to a lesser extent) covering nuclear and radiological risks and various loss of control risks such as long-range autonomy, sandbagging, autonomous replication and adaptation, and undermining safeguards. Breaking down loss of control as such is commendable.
The FGF additionally adds harmful manipulation as a covered systemic risk category (influence operations, election interference, coordinated opinion manipulation), though its treatment remains exploratory and is handled via post-deployment monitoring rather than pre-deployment evaluation.
Coverage is grounded in “internal research” and “feedback from academic researchers”, but no specific risk taxonomies are referenced and no structured domain-selection process is given.
Quotes:
“We evaluate whether frontier capabilities create a risk of severe harm through a holistic risk assessment process. This process draws on our own internal research and signals, and where appropriate incorporates feedback from academic researchers, independent domain experts, industry bodies such as the Frontier Model Forum, and the U.S. government and its partners, as well as relevant legal and policy mandates.” (PF, p.4)
Tracked Categories include (PF, pp.5-6):
“Biological and Chemical: The ability of an AI model to accelerate and expand access to biological and chemical research, development, and skill-building, including access to expert knowledge and assistance with laboratory work.”
“Cybersecurity: The ability of an AI model to assist in the development of tools and executing operations for cyberdefense and cyberoffense.”
“AI Selfimprovement: The ability of an AI system to accelerate AI research, including to increase the system’s own capability.”
Research Categories include (PF, p.7):
“Long-range Autonomy: ability for a model to execute a long-horizon sequence of actions sufficient to realize a “High” threat model (e.g., a cyberattack) without being directed by a human (including successful social engineering attacks when needed)”
“Sandbagging: ability and propensity to respond to safety or capability evaluations in a way that significantly
diverges from performance under real conditions, undermining the validity of such evaluations”
“Autonomous Replication and Adaptation: ability to survive, replicate, resist shutdown, acquire resources to maintain and scale its own operations, and commit illegal activities that collectively constitute causing severe harm (whether when explicitly instructed, or at its own initiative), without also utilizing capabilities tracked in other Tracked Categories.”
“Undermining Safeguards: ability and propensity for the model to act to undermine safeguards placed on it, including e.g., deception, colluding with oversight models, sabotaging safeguards over time such as by embedding vulnerabilities in safeguards code, etc.”
“Nuclear and Radiological: ability to meaningfully counterfactually enable the creation of a radiological threat or enable or significantly accelerate the development of or access to a nuclear threat while remaining undetected.”
“this FGF definition currently addresses the following systemic risk categories: Cyber offense […] Chemical, biological, radiological & nuclear (CBRN) […] Harmful manipulation […] Loss of control” (FGF, p.4)
“Harmful manipulation: Risks stemming from the strategic distortion of human behavior, including the use of model capabilities to conduct influence operations, election interference, or other coordinated campaigns to manipulate public opinion or undermine democratic processes.” (FGF, p.4)
“OpenAI’s approach to harmful manipulation remains exploratory due to its nascency as a systemic risk area. This risk tier is subject to further research and may be substantially changed over time.” (FGF, p.9)
1.1.2 Exclusions are clearly justified and documented (50%) 50%
The justification for excluding research categories from becoming tracked categories is clear, namely that they “need more research and threat modeling before they can be rigorously measured, or do not cause direct risks themselves but may need to be monitored because further advancement in this capability could undermine the safeguards we rely on”. To improve, this should refer to at least one of: academic literature or scientific consensus; transparent internal threat modelling; third-party validation with named expert groups. They should also give credible plans for advancing this threat modelling, or explain why non-rigorous measurement options they considered are unworkable.
Some exclusion reasoning is commendable. The justification for treating nuclear and radiological capabilities as a research category links clearly to risk models, though expert endorsement or more detailed reasoning would strengthen it.
The FGF adds harmful manipulation as a covered systemic risk category, and gives a rationale for its lighter treatment: these risks are “best addressed through system level mitigations, such as post-deployment monitoring, rather than model evaluations before deployment”. This is more justification than the PF offers, but the FGF concedes the approach “remains exploratory” and still does not ground the decision in literature or named expert input.
Their PF inclusion criteria (plausible, measurable, severe, net new, instantaneous or irremediable) implicitly justify when risks are excluded, but an explicit link between each excluded risk and the criteria it fails is needed. The “measurable” requirement may also be overly strict: lacking evaluations that “closely track the potential for the severe harm” does not mean a risk should be dismissed. Note too that the FGF operates on a separate severity definition (“>50 fatalities or $1 billion”, FGF, p.3) and does not restate these five criteria, so the two documents apply different scoping logic.
They commit to “periodically review the latest research and findings for each Research Category”, but a more structured process and explicit promotion criteria should be given.
Quotes:
“Within our wider safety stack, our Preparedness Framework is specifically focused on frontier AI risks
meeting a specific definition of severe harms, and Persuasion category risks do not fit the criteria
for inclusion.” (PF, p.8)
“There are also some areas of frontier capability that do not meet the criteria to be Tracked Categories,
but where we believe work is required now in order to prepare to effectively address risks of severe
harms in the future. These capabilities either need more research and threat modeling before they can
be rigorously measured, or do not cause direct risks themselves but may need to be monitored because
further advancement in this capability could undermine the safeguards we rely on to mitigate existing
Tracked Category risks. We call these Research Categories” (PF, p.7)
“Tracked Categories are those capabilities which we track most closely, measuring them during each
covered deployment and preparing safeguards for when a threshold level is crossed. We treat a frontier
capability as a Tracked Category if the capability creates a risk that meets five criteria:
1. Plausible: It must be possible to identify a causal pathway for a severe harm in the capability area,
enabled by frontier AI.
2. Measurable: We can construct or adopt capability evaluations that measure capabilities that closely
track the potential for the severe harm.
3. Severe: There is a plausible threat model within the capability area that would create severe harm.
4. Net new: The outcome cannot currently be realized as described (including at that scale, by that
threat actor, or for that cost) with existing tools and resources (e.g., available as of 2021) but without
access to frontier AI.
5. Instantaneous or irremediable: The outcome is such that once realized, its severe harms are
immediately felt, or are inevitable due to a lack of feasible measures to remediate.” (PF, p.4)
“AI Self-improvement (now a Tracked Category), Long-range Autonomy and Autonomous Replication and Adaptation (now Research Categories) are distinct aspects of what we formerly termed Model Autonomy. We have separated self-improvement because it presents a distinct plausible, net new, and potentially irremediable risk, namely that of a hard-to-track rapid acceleration in AI capabilities which could have hard-to-predict severely harmful consequences.
In addition, the evaluations we use to measure this capability are distinct from those applicable to Long-range Autonomy and Autonomous Replication and Adaptation. Meanwhile, while these latter risks’ threat models are not yet sufficiently mature to receive the scrutiny of Tracked Categories, we believe they justify additional research investment and could qualify in the future, so we are investing in them now as Research Categories.
Nuclear and Radiological capabilities are now a Research Category. While basic information related to nuclear weapons design is available in public sources, the information and expertise needed to actually create a working nuclear weapon is significant, and classified. Further, there are significant physical barriers to success, like access to fissile material, specialized equipment, and ballistics. Because of the significant resources required and the legal controls around information and equipment, nuclear weapons development cannot be fully studied outside a lassified context. Our work on nuclear risks also informs our efforts on the related but distinct risks posed by radiological weapons. We build safeguards to prevent our models from assisting with high-risk queries related to building weapons, and evaluate performance on those refusal policies as part of our safety process. Our analysis suggests that nuclear risks are likely to be of substantially greater severity and therefore we will prioritize research on nuclear-related risks. We will also engage with US national security stakeholders on how best to assess these risks.” (PF, pp.7–8)
“We will periodically review the latest research and findings for each Research Category” (PF, p.7)
“Many of the risks stemming from harmful manipulation, such as the use of model capabilities to conduct influence operations, are best addressed through system level mitigations, such as post-deployment monitoring, rather than model evaluations before deployment.” (FGF, p.5)