OpenAI

Weak 1.7/5

Click categories for more information

very weak

weak

moderate

substantial

strong

Risk Identification

Learn more

Risk Identification

33%

Risk Analysis and Evaluation

Learn more

Risk Analysis and Evaluation

28%

Risk Treatment

Learn more

Risk Treatment

38%

Risk Governance

Learn more

Risk Governance

39%

Up to date as of October 2025

Best in class

SEE FRAMEWORK

OpenAI has commendably broken down loss of control risks into research categories including long range autonomy, sandbagging, autonomous replication and adaptation, and undermining safeguards.
Their deployment mitigation thresholds, characterised by Robustness, Usage Monitoring, and Trust-based Access, are unique and show expertise and nuance. They also show this nuance when defining the assurance process thresholds that models must meet (such as lack of autonomous capability, value alignment, etc.)
The Safety Advisory Group, i.e. risk committee advising management, is commendable and shows innovation. Their designation of the specific role of this group is best in class.

Overview

Documents analyzed

For OpenAI, the following documents were taken into account:

While most of the other companies publish their Frontier Safety Framework as one discrete document, OpenAI’s policies are spread out across the two documents listed above.

Highlights relative to others

Clearer criteria for deciding whether to track a risk domain.
More substantial detail and nuance for why they believe their elicitation methods will be comprehensive enough to match the elicitation efforts of potential threat actors.
Stronger commitments to share evaluation results with relevant stakeholders.

Weaknesses relative to others

Marginal risk clause makes deployment decisions contingent on other companies’ risk tolerance.
Risk tolerance could be made more precise.
Vague threshold for security measures.
Unclear how frequently evaluations are run during development and after deployment.
Poorer risk culture.

Changes

If they hadn’t made some changes to their framework, they would’ve attained a higher score.

Compared to their first Preparedness Framework (Beta), they:

Removed the emphasis on identifying “unknown unknowns”. Their Beta framework had a strong emphasis on running a process for identifying unknown categories of catastrophic risk as they emerge. They would have scored higher on the risk identification category if this was still included.
Removed safety drills. If this was included, they would have scored higher on escalation protocols.
Added the marginal risk clause. This harms their score for 2.2.3.

OpenAI

1. Risk Identification

Weak 33%

1.1 Classification of Applicable Known Risks (40%) 63%

1.1.1 Risks from literature and taxonomies are well covered (50%) 75%

Risks covered include Biological and Chemical, Cybersecurity, and AI self-improvement as tracked categories, plus research categories (monitored to a lesser extent) covering nuclear and radiological risks and various loss of control risks such as long-range autonomy, sandbagging, autonomous replication and adaptation, and undermining safeguards. Breaking down loss of control as such is commendable.

The FGF additionally adds harmful manipulation as a covered systemic risk category (influence operations, election interference, coordinated opinion manipulation), though its treatment remains exploratory and is handled via post-deployment monitoring rather than pre-deployment evaluation.

Coverage is grounded in “internal research” and “feedback from academic researchers”, but no specific risk taxonomies are referenced and no structured domain-selection process is given.

Quotes:

“We evaluate whether frontier capabilities create a risk of severe harm through a holistic risk assessment process. This process draws on our own internal research and signals, and where appropriate incorporates feedback from academic researchers, independent domain experts, industry bodies such as the Frontier Model Forum, and the U.S. government and its partners, as well as relevant legal and policy mandates.” (PF, p.4)

Tracked Categories include (PF, pp.5-6):
“Biological and Chemical: The ability of an AI model to accelerate and expand access to biological and chemical research, development, and skill-building, including access to expert knowledge and assistance with laboratory work.”
“Cybersecurity: The ability of an AI model to assist in the development of tools and executing operations for cyberdefense and cyberoffense.”
“AI Selfimprovement: The ability of an AI system to accelerate AI research, including to increase the system’s own capability.”

Research Categories include (PF, p.7):
“Long-range Autonomy: ability for a model to execute a long-horizon sequence of actions sufficient to realize a “High” threat model (e.g., a cyberattack) without being directed by a human (including successful social engineering attacks when needed)”
“Sandbagging: ability and propensity to respond to safety or capability evaluations in a way that significantly
diverges from performance under real conditions, undermining the validity of such evaluations”
“Autonomous Replication and Adaptation: ability to survive, replicate, resist shutdown, acquire resources to maintain and scale its own operations, and commit illegal activities that collectively constitute causing severe harm (whether when explicitly instructed, or at its own initiative), without also utilizing capabilities tracked in other Tracked Categories.”
“Undermining Safeguards: ability and propensity for the model to act to undermine safeguards placed on it, including e.g., deception, colluding with oversight models, sabotaging safeguards over time such as by embedding vulnerabilities in safeguards code, etc.”
“Nuclear and Radiological: ability to meaningfully counterfactually enable the creation of a radiological threat or enable or significantly accelerate the development of or access to a nuclear threat while remaining undetected.”

“this FGF definition currently addresses the following systemic risk categories: Cyber offense […] Chemical, biological, radiological & nuclear (CBRN) […] Harmful manipulation […] Loss of control” (FGF, p.4)

“Harmful manipulation: Risks stemming from the strategic distortion of human behavior, including the use of model capabilities to conduct influence operations, election interference, or other coordinated campaigns to manipulate public opinion or undermine democratic processes.” (FGF, p.4)

“OpenAI’s approach to harmful manipulation remains exploratory due to its nascency as a systemic risk area. This risk tier is subject to further research and may be substantially changed over time.” (FGF, p.9)

1.1.2 Exclusions are clearly justified and documented (50%) 50%

The justification for excluding research categories from becoming tracked categories is clear, namely that they “need more research and threat modeling before they can be rigorously measured, or do not cause direct risks themselves but may need to be monitored because further advancement in this capability could undermine the safeguards we rely on”. To improve, this should refer to at least one of: academic literature or scientific consensus; transparent internal threat modelling; third-party validation with named expert groups. They should also give credible plans for advancing this threat modelling, or explain why non-rigorous measurement options they considered are unworkable.

Some exclusion reasoning is commendable. The justification for treating nuclear and radiological capabilities as a research category links clearly to risk models, though expert endorsement or more detailed reasoning would strengthen it.

The FGF adds harmful manipulation as a covered systemic risk category, and gives a rationale for its lighter treatment: these risks are “best addressed through system level mitigations, such as post-deployment monitoring, rather than model evaluations before deployment”. This is more justification than the PF offers, but the FGF concedes the approach “remains exploratory” and still does not ground the decision in literature or named expert input.

Their PF inclusion criteria (plausible, measurable, severe, net new, instantaneous or irremediable) implicitly justify when risks are excluded, but an explicit link between each excluded risk and the criteria it fails is needed. The “measurable” requirement may also be overly strict: lacking evaluations that “closely track the potential for the severe harm” does not mean a risk should be dismissed. Note too that the FGF operates on a separate severity definition (“>50 fatalities or $1 billion”, FGF, p.3) and does not restate these five criteria, so the two documents apply different scoping logic.

They commit to “periodically review the latest research and findings for each Research Category”, but a more structured process and explicit promotion criteria should be given.

Quotes:

“Within our wider safety stack, our Preparedness Framework is specifically focused on frontier AI risks
meeting a specific definition of severe harms, and Persuasion category risks do not fit the criteria
for inclusion.” (PF, p.8)

“There are also some areas of frontier capability that do not meet the criteria to be Tracked Categories,
but where we believe work is required now in order to prepare to effectively address risks of severe
harms in the future. These capabilities either need more research and threat modeling before they can
be rigorously measured, or do not cause direct risks themselves but may need to be monitored because
further advancement in this capability could undermine the safeguards we rely on to mitigate existing
Tracked Category risks. We call these Research Categories” (PF, p.7)

“Tracked Categories are those capabilities which we track most closely, measuring them during each
covered deployment and preparing safeguards for when a threshold level is crossed. We treat a frontier
capability as a Tracked Category if the capability creates a risk that meets five criteria:
1. Plausible: It must be possible to identify a causal pathway for a severe harm in the capability area,
enabled by frontier AI.
2. Measurable: We can construct or adopt capability evaluations that measure capabilities that closely
track the potential for the severe harm.
3. Severe: There is a plausible threat model within the capability area that would create severe harm.
4. Net new: The outcome cannot currently be realized as described (including at that scale, by that
threat actor, or for that cost) with existing tools and resources (e.g., available as of 2021) but without
access to frontier AI.
5. Instantaneous or irremediable: The outcome is such that once realized, its severe harms are
immediately felt, or are inevitable due to a lack of feasible measures to remediate.” (PF, p.4)

“AI Self-improvement (now a Tracked Category), Long-range Autonomy and Autonomous Replication and Adaptation (now Research Categories) are distinct aspects of what we formerly termed Model Autonomy. We have separated self-improvement because it presents a distinct plausible, net new, and potentially irremediable risk, namely that of a hard-to-track rapid acceleration in AI capabilities which could have hard-to-predict severely harmful consequences.
In addition, the evaluations we use to measure this capability are distinct from those applicable to Long-range Autonomy and Autonomous Replication and Adaptation. Meanwhile, while these latter risks’ threat models are not yet sufficiently mature to receive the scrutiny of Tracked Categories, we believe they justify additional research investment and could qualify in the future, so we are investing in them now as Research Categories.

Nuclear and Radiological capabilities are now a Research Category. While basic information related to nuclear weapons design is available in public sources, the information and expertise needed to actually create a working nuclear weapon is significant, and classified. Further, there are significant physical barriers to success, like access to fissile material, specialized equipment, and ballistics. Because of the significant resources required and the legal controls around information and equipment, nuclear weapons development cannot be fully studied outside a lassified context. Our work on nuclear risks also informs our efforts on the related but distinct risks posed by radiological weapons. We build safeguards to prevent our models from assisting with high-risk queries related to building weapons, and evaluate performance on those refusal policies as part of our safety process. Our analysis suggests that nuclear risks are likely to be of substantially greater severity and therefore we will prioritize research on nuclear-related risks. We will also engage with US national security stakeholders on how best to assess these risks.” (PF, pp.7–8)

“We will periodically review the latest research and findings for each Research Category” (PF, p.7)

“Many of the risks stemming from harmful manipulation, such as the use of model capabilities to conduct influence operations, are best addressed through system level mitigations, such as post-deployment monitoring, rather than model evaluations before deployment.” (FGF, p.5)

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

The framework does mention that red-teaming is to be conducted by human experts, but not explicitly for the purpose of identifying unknown risks. It is also only required if a capability threshold is passed.

Quotes:

“The SAG [Safety Advisory Group] reviews the Capabilities Report and decides on next steps. These can include: […] Recommend deep dive research: This is appropriate if SAG needs additional evidence in order to make a recommendation.” (p. 9)

“Deep Dives: designed to provide additional evidence validating the scalable evaluations’ findings on whether a capability threshold has been crossed. These may include a wide range of evidence gathering activities, such as human expert red-teaming, expert consultations, resource-intensive third party evaluations (e.g., bio wet lab studies, assessments by independent third party evaluators), and any other activity requested by SAG.” (p. 8)

1.2.2 Third party open-ended red teaming (30%) 0%

The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

“The SAG reviews the Capabilities Report and decides on next steps. These can include: […] Recommend deep dive research: This is appropriate if SAG needs additional evidence in order to make a recommendation.” (p. 9)

Third-party evaluation of tracked model capabilities: “If we deem that a deployment warrants deeper testing of Tracked Categories of capability (as described in Section 3.1), for example based on results of Capabilities Report presented to them, then when available and feasible, OpenAI will work with third-parties to independently evaluate models.” (p. 13)

1.3 Risk modeling (40%) 19%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%

The preparedness framework describes having threat models for each Tracked Category (i.e. risk domain), though not for the Research Categories (“For each Tracked Category, we develop and maintain a threat model to identify specific risks of severe harms that could arise from the frontier capabilities in that domain”). The FGF commits to systemic risk modelling that estimates “the severity and probability of harm of risks related to CBRN, cyber offense, and loss of control”, bringing loss of control into scope with its own risk tiers, with harmful manipulation still at an early stage.

The fact that all Tracked Categories must be ‘Plausible’ indicates some risk modelling is being performed even for Research Categories, in order to determine if they should be Tracked Categories (“Plausible: it must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI”). The justification for keeping some risks as Research Categories as due to requiring more threat modelling indicates awareness that risk models are necessary for all areas of monitored risk. However, more detail on how they will achieve this precision should be given.

The risk models themselves are not published, though there is some indication of intending to share findings. The FGF publishes tier tables, but these are capability-threshold descriptions derived from the threat models, not the underlying pathways, methodology, experts, or scenario lists. The most concrete pathway sketch is the biological one in the PF: “Our evaluations test acquiring critical and sensitive information across the five stages of the biological threat creation process: Ideation, Acquisition, Magnification, Formulation, and Release”. More detail should be provided.

Quotes:

“[capabilities are Tracked Categories if they are] Plausible: It must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI.” (PF, p.4)

“For each Tracked Category, we develop and maintain a threat model to identify specific risks of severe harms that could arise from the frontier capabilities in that domain” (PF, p.4)

“Our evaluations test acquiring critical and sensitive information across the five stages of
the biological threat creation process: Ideation, Acquisition, Magnification, Formulation, and
Release. These evaluations, developed by domain experts, cover things like how to troubleshoot
the laboratory processes involved.”

“These [Research Category] capabilities either need more research and threat modeling before they can be measured […] [for these] we will take the following steps, both internally and in collaboration with external experts: Further developing the threat models for the area […] Sharing summaries of our findings with the public where feasible.” (PF, pp.6-7)

“We conduct systemic risk modelling that informs how we evaluate systemic risks before deployment. Currently, we estimate the severity and probability of harm of risks related to CBRN, cyber offense, and loss of control under this FGF. We are still in the early stages of developing an approach for assessing risks from harmful manipulation.” (FGF, p.5)

1.3.2 Risk modeling methodology (40%) 11%

1.3.2.1 Methodology precisely defined (70%) 10%

It is not clear what the methodology is from The Preparedness Framework, or that a particular methodology is followed. However, they do mention identifying causal pathways, which implies some methodology. More detail should be given.

Quotes:

“It must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI.” (PF, p.4)

“Capability thresholds concretely describe things an AI system might be able to help someone do or might be able to do on its own that could meaningfully increase risk of severe harm.” (PF, p.4)

“OpenAI has implemented a variety of structured processes to identify systemic risks stemming from our frontier models and to develop risk scenarios and threat models through which these systemic risks may develop or materialize.” (FGF, p.3)

“We conduct systemic risk modelling that informs how we evaluate systemic risks before deployment.” (FGF, p.5)

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.

Quotes:

No relevant quotes found.

1.3.2.3 Prioritization of severe and probable risks (15%) 25%

For a risk area to be a tracked category, the capability must create a risk that is “Severe: There is a plausible threat model within the capability area that would create severe harm.” This suggests that severity is prioritised, and plausibility here suggests the risk model must have nonzero probability. However, these threat models are developed post-hoc – after deciding which categories to track: “For each Tracked Category, we develop and maintain a threat model identifying specific risks of severe harms that could arise from the frontier capabilities in that domain […]”

They then prioritise monitoring for High and Critical capabilities, implicitly defining these as those capabilities with higher probability x severity of harm: “High capability thresholds mean capabilities that significantly increase existing risk vectors for severe harm”; “Critical capability thresholds mean capabilities that present a meaningful risk of a qualitatively new threat vector for severe harm with no ready precedent.”

The FGF commits to estimate both dimensions explicitly: “Currently, we estimate the severity and probability of harm of risks related to CBRN, cyber offense, and loss of control under this FGF”, and it gives a concrete severity-based prioritisation, prioritising “biological capability evaluations due to the higher potential severity of biological threats”. However, there is still minimal detail on how severity and probability are determined, harmful manipulation is excluded as still early-stage, and these results are not published.

Overall, there is awareness that threat models should focus on severe harms, and the FGF commits to estimating probability and severity, but with little evidence of systematic prioritisation among multiple risk models. Risk modelling is still completed after deciding what to track, which differs from the required criterion of using prioritisation of risk models to determine focus areas.

Quotes:

“For each Tracked Category, we develop and maintain a threat model identifying specific risks of severe harms that could arise from the frontier capabilities in that domain […] High capability thresholds mean capabilities that significantly increase existing risk vectors for severe harm. Critical capability thresholds mean capabilities that present a meaningful risk of a qualitatively new threat vector for severe harm with no ready precedent.” (PF, p.4)

“Where we determine that a capability presents a real risk of severe harm, we may decide to monitor it as a Tracked Category or a Research Category.” (PF, p.4)

For a capability to be a Tracked Category (PF, p.4):
“Plausible: It must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI.”
“Severe: There is a plausible threat model within the capability area that would create severe harm.”

“Currently, we estimate the severity and probability of harm of risks related to CBRN, cyber offense, and loss of control under this FGF.” (FGF, p.5)

“we prioritize biological capability evaluations due to the higher potential severity of biological threats” (FGF, p.8)

1.3.3 Third party validation of risk models (20%) 25%

While “threat models are informed by […] specific information that we gather across OpenAI teams and external experts”, they are not validated by third parties. Indeed, risk models are only approved internally: “For each Tracked Category, we develop and maintain a threat model identifying specific risks of severe harms that could arise from the frontier capabilities in that domain and set corresponding capability thresholds that would lead to a meaningful increase in risk of severe harm. SAG [Safety Advisory Group] reviews and approves these threat models.” (p. 4)
“Informed by”, “in collaboration with” “gather information from” suggests consultation/input during development of the risk models, rather than independent validation of completed models. To improve, an explicit commitment to garnering third parties to validate risk models should be made.

Quotes:

“For each Tracked Category, we develop and maintain a threat model identifying specific risks of severe harms that could arise from the frontier capabilities in that domain and sets corresponding capability thresholds that would lead to a meaningful increase in risk of severe harm. SAG [Safety Advisory Group] reviews and approves these threat models.” (p. 4)

“Threat models are informed both by our broader risk assessment process, and by more specific information that we gather across OpenAI teams and external experts.” (p. 4)

“For [Research Categories], in collaboration with external experts, we commit to further developing the associated threat models and advancing the science of capability measurement for the area, including by investing in the development of rigorous capability evaluations.” (p. 14)

OpenAI

2. Risk Analysis and Evaluation

Weak 28%

2.1 Setting a Risk Tolerance (35%) 16%

2.1.1 Risk tolerance is defined (80%) 20%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 50%

There is a qualitative definition of scenarios which are implicitly ‘unacceptable’ levels of risk, under the Critical capability threshold. For instance, “Proliferating the ability to create a novel threat vector of the severity of a CDC Class A biological agent (i.e., high mortality, ease of transmission) could cause millions of deaths and significantly disrupt public life, with few available societal safeguards” implicitly states this risk as the risk tolerance.

To improve, they must set out the maximum amount of risk the company is willing to accept, for each risk domain (though they need not differ between risk domains), ideally expressed in terms of probabilities and severity (economic damages, physical lives, etc), and separate from KRIs.

Partial credit is given for the definition of “severe harm” as “the death or grave injury of thousands of people or hundreds of billions of dollars of economic damage.” However, the capability thresholds are not explicitly linked to this proto-risk tolerance, and it should be more specific (e.g., specifying how many thousands of people).

Quotes:

“By “severe harm” in this document, we mean the death or grave injury of thousands of people or hundreds of billions of
dollars of economic damage.” (PF, p.1)

“This FGF’s definition of systemic risk includes foreseeable and material risks of severe harm from the development, storage, use, or deployment of our most advanced frontier models, including risks that a model will materially contribute to greater than 50 fatalities or $1 billion of property damages or losses arising from a single incident.” (FGF, p. 3)

Some examples from Tracked Categories table, under the Critical category for ‘Associated risk of severe harm’ (PF, p.5):

“Proliferating the ability to create a novel threat vector of the severity of a CDC Class A biological agent (i.e., high mortality, ease of transmission) could cause millions of deaths and significantly disrupt public life, with few available societal safeguards.”

“Finding and executing end-to-end exploits for all software could lead to catastrophe from unilateral actors, hacking military or industrial systems, or OpenAI infrastructure. Novel cyber operations, e.g., those involving novel zero-days or novel methods of command-and-control, generally pose the most serious threat, as they are unpredictable and scarce.”

“A major acceleration in the rate of AI Self-improvement could rapidly increase the rate at which new capabilities and risks emerge, to the point where our current oversight practices are insufficient to identify and mitigate new risks, including risks to maintaining human control of the AI system itself.”

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 10%

The qualitative risk tolerances do not have quantitative probabilities, and are vague in description. The definition of severe harm implies some awareness of quantitative measurement, though this is used to classify critical capability thresholds rather than defined as a risk tolerance itself.

Quotes:

“High capability thresholds mean capabilities that significantly increase existing risk vectors for severe harm” (p. 4)

“Critical capability thresholds mean capabilities that present a meaningful risk of a qualitatively new threat vector” (p. 4)

“Scalable evaluations have associated “indicative thresholds,” which are levels of performance that we have pre-determined to indicate that a deployment may have reached a capability threshold.” (p. 8)

“By “severe harm” in this document, we mean the death or grave injury of thousands of people or hundreds of billions of dollars of economic damage.” (p. 1)

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

Whilst they mention the criterion of “severe harm” via “the death or gave injury of thousands of people or hundreds of billions of dollars of economic damage”, this is still vague, and doesn’t apply as a specific risk tolerance for specific risks. None of the specific risks mention quantitative probabilities, and the implicit risk tolerances from capability thresholds are not fully quantitative either.

Quotes:

“High capability thresholds mean capabilities that significantly increase existing risk vectors for severe harm” (p. 4)
“Critical capability thresholds mean capabilities that present a meaningful risk of a qualitatively new threat vector” (p. 4)

Some examples from Tracked Categories table, under the Critical category for ‘Associated risk of severe harm’ (p. 5):

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

No evidence of asking the public what risk levels they find acceptable. No evidence of seeking regulator input specifically on what constitutes acceptable risk levels.

Quotes:

No relevant quotes found.

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 35%

2.2.1 Key Risk Indicators (KRI) (30%) 35%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

Under the PF, each Tracked Category has 2 thresholds each, which are specific. They appear to be grounded in risk modelling, but this is hard to tell given risk models are not clear. The Research Categories have less developed KRIs but are still well-defined.

The FGF adds a Tier 1 below the High/Critical thresholds for Cybersecurity and CBRN and an “Examples” column giving more qualitative detail than the PF’s bare threshold definitions, and develops loss of control into three defined tiers (a Research Category in the PF). Against this, the FGF introduces harmful manipulation as a risk domain with no defined KRI, working against coverage of all risks. To improve, more precision is needed in all of the Research Categories, as well as the High thresholds of Cybersecurity and Biological & Chemical; evidence is needed of the specific thresholds, e.g. of the “baseline of tools available in 2021” or “relative to those researchers’ 2024 baseline”. Additionally, KRIs should map directly to evaluation tests performed.

Quotes:

From Tracked Categories, under ‘Capability threshold that could lead to the risk’, PF, pp.5–6:
Biological and chemical:
“[High] The model can provide meaningful counterfactual assistance (relative to unlimited access to baseline of tools available in 2021) to “novice” actors (anyone with a basic relevant technical background) that enables them to create known biological or chemical threats.”

“[Critical] The model can enable an expert to develop a highly dangerous novel threat vector (e.g., comparable to a novel CDC Class A biological agent) OR the model can be connected to tools and equipment to complete the full engineering and/or synthesis cycle of a regulated or novel biological threat without human intervention.”

Autonomous AI R&D:
“[High] The model’s impact is equivalent to giving every OpenAI researcher a highly performant mid-career research engineer assistant, relative to those researchers’ 2024 baseline.”

“[Critical] The model is capable of recursively self improving (i.e., fully automated AI Self-improvement), defined as either (leading indicator) a superhuman researchscientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months.”

Research Categories:
“Long-range Autonomy: ability for a model to execute a long-horizon sequence of actions sufficient to realize a “High” threat model (e.g., a cyberattack) without being directed by a human (including successful social engineering attacks when needed)” (PF, p.7). More on page 7 can be found.

“Tier 1: The model’s capabilities provide minimal advantage compared to publicly available information or search engines, or provides minor uplift or lowers barriers to experts.” (FGF, p.8)

“[Tier 2] Provides accurate, detailed, and contextually relevant information that is similar to direct interactions with subject matter experts, that would be difficult for a non-specialist to obtain via open internet or manuals […]” (FGF, p.8)

“Tier 1 The model can reliably complete complex, multi-step tasks but lacks the execution of abilities approaching expert humans. The model […] executes deceptive behaviors in narrow, highly prompted scenarios.” (FGF, p.10)
“Outside of risks related to AI self-improvement, these risk tiers remain exploratory and may evolve substantially.” (FGF, p.10)

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%

The Critical capability threshold for AI Self-improvement, and somewhat the High capability threshold for AI Self-improvement and Biological & Chemical give baselines, which mean these KRIs can in theory be measured quantitatively. However, more detail of the actual threshold implemented here is needed. No quantitative thresholds for the others is given. The FGF does not add quantitative KRI thresholds: it claims its tier systems “quantify” capabilities and incorporate “risk estimates,” but the tiers as presented are qualitative descriptions and examples.

Quotes:

All the below are from Tracked Categories under ‘Capability threshold that could lead to the risk’, PF, pp.5-6:

High, Biological & Chemical: “[High] The model can provide meaningful counterfactual assistance (relative to unlimited access to baseline of tools available in 2021) to “novice” actors (anyone with a basic relevant technical background) that enables them to create known biological or chemical threats”

High, AI Self-improvement: “[High] The model’s impact is equivalent to giving every OpenAI researcher a highly performant mid-career research engineer assistant, relative to those researchers’ 2024 baseline.” Critical, AI Self-improvement: “[Critical] The model is capable of recursively self improving (i.e., fully automated AI Self-improvement), defined as either (leading indicator) a superhuman research scientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months.”

“[…] we have established a tier system that seeks to quantify model capabilities against cybersecurity threats and incorporates risk estimates and other suitable metrics. We use these measurable thresholds […]” (FGF, p.7)

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 10%

The KRIs in the PF only mention model capabilities. They do mention monitoring and incident response, which could feasibly lead to KRIs which satisfy this criterion. The FGF strengthens this slightly: it commits to drawing on model-independent and external signals — external research and data, market analyses, insights from other deployed models, post-release monitoring — as inputs to risk assessment, and to post-deployment threat intelligence monitoring as a trigger for additional measures. But these signals feed a holistic risk-assessment judgment rather than functioning as measurable external KRIs with thresholds that trigger specific KCIs, so the criterion is acknowledged and partially set up but not operationalized.

Quotes:

“Monitoring and Incident Response: Monitor security and event logs continuously to detect, triage, and respond to security incidents rapidly by 24×7 on-call staff.” (PF, p.21)

“Our risk assessment process may take place throughout a model’s lifecycle, including during development and after deployment. Risk assessments draw on various sources, including a range of model evaluations, model-independent information such as external research and data, literature reviews, market analyses, consultation with internal and external experts, insights from other deployed models, post-release monitoring, and, if applicable, investigations of serious incidents or critical safety incidents.” (FGF, p.5)

“We rely on a variety of techniques to identify whether additional measures are needed, including postdeployment threat intelligence monitoring, post-launch monitoring tools (classifiers, automated detection, human review, and investigations), consultation with experts, and other techniques.” (FGF, p.11)

2.2.2 Key Control Indicators (KCI) (30%) 32%

2.2.2.1 Containment KCIs (35%) 5%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 10%

They do not detail qualitative KCI thresholds for containment. Whilst they detail practices for ensuring security controls, and require these for High capability thresholds, they do not describe what would count as sufficient containment for that risk level. They only state that some standard is required: “As a reminder, covered systems that reach High capability must have safeguards that sufficiently minimize the associated risk of severe harm before they are deployed. Systems that reach Critical capability also require sufficient safeguards during development.” (p. 16) However, the purpose of a containment KCI is to precisely define what counts as “sufficient” in these contexts.

They also don’t specify what would be considered sufficient for the Critical threshold, despite this having instrumental effects if not met: “Until we have specified safeguards and security controls standards that would meet a Critical standard, halt further development” (p. 6)

However, they show understanding that different capability levels need different containment approaches.

Quotes:

“Require security controls meeting High standard (Appendix C.3)”, p. 6

“As a reminder, covered systems that reach High capability must have safeguards that sufficiently minimize the associated risk of severe harm before they are deployed. Systems that reach Critical capability also
require sufficient safeguards during development.” (p. 16)

“Until we have specified safeguards and security controls standards that would meet a Critical standard, halt further development” (p. 6)

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%

There is no mention of a quantitative thresholds for containment KCIs, i.e. measurement of security controls.

Quotes:

No relevant quotes found.

2.2.2.2 Deployment KCIs (35%) 43%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 75%

There are three general deployment KCIs, i.e. targets for the mitigations of risks from malicious users to reach, required for High capability models: “Require safeguards against misuse meeting High standard (Appendix C.1) before external deployment”. However, their actual threshold is still somewhat vague and unspecific, e.g. “sufficiently minimize” requires more detail. KCIs for critical capabilities are also not defined: whilst they state that “Until we have specified safeguards and security controls that would meet a Critical standard, halt further development”, but a “Critical standard” is left to be interpreted.

Nonetheless, the qualitative detail in the three deployment KCIs is commendable, showing nuance and expertise.

Quotes:

“Each capability threshold has a corresponding class of risk-specific safeguard guidelines under the Preparedness Framework. We use the following process to select safeguards for a deployment:
We first identify the plausible ways in which the associated risk of severe harm can come to fruition in the proposed deployment.

For each of those, we then identify specific safeguards that either exist or should be implemented that would address the risk.
For each identified safeguard, we identify methods to measure their efficacy and an efficacy threshold.” (p. 10)

“Potential claims:

Robustness: Malicious users cannot use the model to cause the severe harm because they cannot elicit the necessary capability, such as because the model is modified to refuse to provide assistance to harmful tasks and is robust to jailbreaks that would circumvent those refusals.
Usage Monitoring: If a model does not refuse and provides assistance to harmful tasks, monitors can stop or catch malicious users before they have achieved an unacceptable scale of harm, through a combination of automated and human detection and enforcement within an acceptable time frame.
Trust-based Access: The actors who gain access to the model are not going to use it in a way that presents an associated risk of severe harm under our threat model.” (p. 11)

“Safeguards should sufficiently minimize the risk of severe harm associated with misuse of the model’s capabilities. This can be done by establishing that all plausible known vectors of enabling severe harm are sufficiently addressed by one or more of the following claims:

Robustness: […]” (p. 16)

“Covered systems that reach High capability must have safeguards that sufficiently minimize the associated risk of severe harm before they are deployed. Systems that reach Critical capability also require safeguards that sufficiently minimize associated risks during development.” (p. 11)

“SAG can request further evaluation of the effectiveness of the safeguards to evaluate if the associated risk of severe harm is sufficiently minimized” (p. 11)

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 10%

Whilst “we should ensure that harmful behaviors are detected by monitors with a high recall rate” (p. 19) demonstrates some awareness of quantitative deployment KCI thresholds, there are no actual quantitative deployment KCI thresholds.

Quotes:

“Red-teamers can train an arbitrarily adversarial agent or can pretend to be an agent in our setup and we should ensure that harmful behaviors are detected by monitors with a high recall rate” (p. 19)

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 50%

The framework provides desiderata which can be seen as proto assurance KCIs, though they are somewhat vague (e.g. “the model consistently understands […] instructions” or “sufficiently addressed by one or more of the following claims” lack detail.) It is also unclear if they must all be covered in the Safeguards Report, and how often Safeguards Reports are required. The KCIs also are not linked to specific KRIs explicitly, though the document says they will be.

Table 5 list some desiderata indicating further proto assurance process KCI thresholds. However, these require more development, such as what constitutes “significant evidence” or “consistently takes action”.

Quotes:

“Safeguards should sufficiently minimize the risk associated with a misaligned model circumventing human control and oversight and executing severe harms. This can be done by establishing that all plausible known vectors of enabling severe harm are sufficiently addressed by one or more of the following claims:

Lack of Autonomous Capability: The model is not capable of carrying out tasks autonomously, including the risk of severe harm.
Value Alignment: The model consistently applies human values in novel settings (without any instructions) to avoid taking actions that cause harm, and has shown sufficiently minimal indications of misaligned behaviors like deception or scheming.
Instruction Alignment: The model consistently understands and follows user or system instructions, even when vague, and those instructions rule out pathways to causing severe harm.
Reliable and Robust System Oversight: Effective AI and human oversight of model actions detects and prevents execution of harm or subversion of safeguards.
System Architecture: The model can’t take actions that cause harm because it lacks access to output channels or mechanisms to persist sufficiently to execute the harm, due to containerization and restricted permissions.” (p. 18) See also Table 3, p. 11.

From Table 5, p. 19:
“Internal and external user reports, monitoring, or manual inspection for historical model deployments show no significant evidence of unexpected value alignment failures.”

“Large scale usage or testing in realistic settings should show the model consistently takes action aligned to the user’s instructions or goals (while maintaining value alignment and avoiding harm). Examples against would include trying to escape a container, lying to the user to appease them, etc. […] “Canary evaluations” which test model capabilities to bypass less complex, easier-to exploit versions of our security controls, establishing that our implemented controls are robust”

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 25%

The process of presenting a Capabilities Report and Safeguards Report is a proto pairing of KRI and KCI thresholds, as well as the risk-specific safeguard guidelines for each tracked category capability threshold. Hence, The Preparedness Framework shows awareness of this concept and partial implementation. However, it does not provide explicit detail, and the linkage is only a ‘guideline’. Further, there is reference to the Safety Advisory Group making decisions about the level of risk of models based on these reports, but an improvement would be providing detail on the criteria SAG will be using to make its determinations.

Overall, more detail should be given on why, ex ante, the KCI thresholds chosen will be sufficient to keep residual risk below the risk tolerance, if satisfied. The FGF adds residual-risk and safety-margin language, but this remains a discretionary determination without risk modelling linking a satisfied KCI to a quantified residual-risk claim below tolerance. In addition, their marginal risk claim makes the residual risk tolerance contingent on other companies’. This does not follow the criterion; the required level of safeguards should be relative to their pre-determined risk tolerance.

Quotes:

“[We] evaluate the likelihood that severe harms could actually occur in the context of deployment, using threat models that take our safeguards into account.” (PF, p.3)

“We compile the information on the planned safeguards needed to minimize the risk of severe harm into a Safeguards Report. The Safeguards Report should include the following information:
• Identified ways a risk of severe harm can be realized for the given deployment, each mapped to the associated security controls and safeguards
• Details about the efficacy of those safeguards
• An assessment on the residual risk of severe harm based on the deployment
• Any notable limitations with the information provided” (PF, p.10)

“SAG is responsible for assessing whether the safeguards associated with a given deployment sufficiently minimize the risk of severe harm associated with the proposed deployment. The SAG will make this determination based on:
• The level of capability in the Tracked Category based on the Capabilities Report.
• The associated risks of severe harm, as described in the threat model and where needed, advice of internal or external experts
• The safeguards in place and their effectiveness based on the Safeguards Report.
• The baseline risk from other deployments, based on a review of any non-OpenAI deployments of models which have crossed the capability thresholds and any public evidence of the safeguards applied for those models.” (PF, pp.10-11)

“We recognize that another frontier AI model developer might develop or release a system with High or Critical capability in one of this Framework’s Tracked Categories and may do so without instituting comparable safeguards to the ones we have committed to. Such an action could significantly increase the baseline risk of severe harm being realized in the world, and limit the degree to which we can reduce risk using our safeguards. If we are able to rigorously confirm that such a scenario has occurred, then we could adjust accordingly the level of safeguards that we require in that capability area, but only if:
• we assess that doing so does not meaningfully increase the overall risk of severe harm,
• we publicly acknowledge that we are making the adjustment,
• and, in order to avoid a race to the bottom on safety, we keep our safeguards at a level more
protective than the other AI developer, and share information to validate this claim” (PF, p.12)

“Our consideration of residual risk takes account of the scale and probability of harm, and is evaluated in light of the sufficiency of implemented safeguards proportionate to the level of risk. In assessing whether to accept residual risk, we review the risk tiers for each systemic risk category and incorporate appropriate safety margins.” (FGF, p.5)

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 50%

The Preparedness Framework commits to halting development if Critical-standard safeguards cannot be specified, and the FGF generalises this into a broader deployment gate covering all systemic risk categories, plus re-assessment when the basis for accepting a model’s risk is “materially undermined”. However, there are multiple gaps remaining:
First, halting requires only that Critical safeguards be specified, not demonstrated sufficient. Second, the trigger is crossing the Critical capability threshold, which permits a model with Critical-level capabilities but inadequate safeguards to exist during development before detection; no mechanism pauses development before Critical capabilities manifest. Third, gating turns on a discretionary acceptable-risk judgment rather than a demonstrated KCI threshold, and no dedeployment process is specified — the FGF’s “materially undermined” trigger prompts re-assessment, not dedeployment.
To improve, OpenAI should commit to demonstrating safeguard sufficiency rather than mere specification, develop a process for pausing before Critical capabilities are reached, and define a dedeployment process.

Quotes:

For each of the critical thresholds of the tracked categories, PF, pp.5-6:
“Until we have specified safeguards and security controls that would meet a Critical standard, halt further development”

“SAG can find the safeguards do not sufficiently minimize the risk of severe harm and recommend potential alternative deployment conditions or additional or more effective safeguards that would sufficiently minimize the risk.” (PF, p.11)

“Models that have reached or are forecasted to reach Critical capability in a Tracked Category present severe dangers and should be treated with extreme caution.
Such models require additional safeguards (safety and security controls) during development, regardless of whether or when they are externally deployed. We do not currently possess any models that have Critical levels of capability, and we expect to further update this Preparedness Framework before reaching such a level with any model.
Our approach to Critical capabilities will need to be robust to both malicious actors (either internal or external) and model misalignment risks. The SAG retains discretion over when to request deep dive evaluations of models whose scalable evaluations indicate that they may possess or may be nearing critical capability thresholds.” (PF, p.12)

“[…] if we have reasonable grounds to believe that the basis for considering the model’s systemic risks acceptable has been materially undermined, we will update our Model Report as appropriate after completing a systemic risk assessment.” (FGF, p.16)

OpenAI

3. Risk Treatment

Weak 38%

3.1 Implementing Mitigation Measures (50%) 37%

3.1.1 Containment measures (35%) 40%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 50%

The PF describes the security controls required for High capability models in C.3 in detail, though not for Critical, and many measures remain high-level desiderata rather than operational ones (some are more specific). The FGF restates this posture and adds a few more concrete items anchored to named certifications (ISO 27001, SOC 2 Type II). But these form a flat baseline for all frontier models rather than measures mapped to specific containment KCI thresholds, and Critical remains uncovered.

Quotes:

From appendix C.3, PF, pp.20-21:
“Adopt a layered security strategy, ensuring robust protection through multiple defensive barriers, including physical and datacenter security, network segmentation and controls, workload isolation, data encryption, and other overlapping and
complementary security controls.”
“Employees must authenticate using multi-factor authentication (MFA) and managed devices meeting security baselines. Access must be logged and reviewed for detection and investigative purposes.”
“Integrate automated code analysis, formal security reviews, and penetration testing in engineering processes. Apply security reviews and validation to higher-sensitivity critical components prior to deployment.”

“OpenAI maintains an Information Security and Privacy Program aligned with ISO 27001, 27017, 27018, and 27701, and supported by SOC 2 Type II evaluations.” (FGF, p.14)

“Model weights are protected through encryption (at rest and in transit), continuous monitoring, and access controls (e.g., multi-factor authentication, multi-party approval, and detailed logging).” (FGF, p.15)

“Model execution is sandboxed, with restricted egress by default.” (FGF, p.15)

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

The PF commits to validating controls regularly but gives no detail on why the measures are likely sufficient to meet a containment KCI threshold, and sufficiency is ultimately a post-hoc SAG/leadership determination rather than decided before the KRI threshold is crossed. The FGF adds a documented residual-risk acceptance justification, marginally reinforcing that a process for soliciting proof exists, but still provides no ex-ante proof against a defined containment KCI and no quantified threshold.

Quotes:

“Continuous Monitoring and Validation: Ensure security threat models and updates inform where security and data privacy controls should be implemented, improved, and monitored to further reduce risk. Internal and external assessments to validate these controls are conducted regularly and reports are provided to OpenAI leadership.” (PF, p.20)

“We document a justification for why the systemic risks stemming from the model are acceptable, including reasonably foreseeable conditions under which the justification may no longer hold […]” (FGF, p.11)

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 25%

The PF requires independent audits for High capability models but not Critical, and validates already-implemented protocols rather than the ex-ante case that containment measures meet a containment KCI. The FGF strengthens the evidence by naming specific recurring independent mechanisms — SOC 2 Type II audits, ISO 27001 certification, external red teaming and penetration testing. However, these verify compliance with general security standards and existing controls, not the prior case for KCI sufficiency.

Quotes:

“Independent Security Audits: Ensure security controls and practices are validated regularly by third-party auditors to ensure compliance with relevant standards and robustness against identified threats.” (PF, p.21)

“Security controls are validated through internal and external assessments, including red teaming, penetration testing, vulnerability scanning, SOC 2 Type II audits, and ISO 27001 certification.” (FGF, p.15)

3.1.2 Deployment measures (35%) 40%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 50%

Section C.1 in the Appendix details “potential safeguards” for models with High capability, without explicit commitment to implementing them: “the safeguards should not be construed as a definitive or comprehensive list of the safeguards we will or could apply to a given launch”. Nonetheless, the measures are defined for each KCI threshold, namely robustness, usage monitoring and trust-based access.

To improve, deployment measures must also be defined for the Critical capability.

Quotes:

From Table 4, p. 17:
“Robustness:

Training the model to refuse to help with high-risk tasks or to otherwise produce low risk
responses
Unlearning or training-data filtering to erase specific risk-enabling knowledge from the model’s knowledge-base
Interpretability-based approaches, like activation steering, that directly edit models’ thinking at inference time
Jailbreak robustness, including through adversarial training, inference-time deliberation, and more”

More quotes may be found in Table 4.

“This Appendix provides illustrative examples of potential safeguards, and safeguard efficacy assessments that could be used to establish that we have sufficiently mitigated the risk of severe harm. The examples aim to provide insight on our thinking, but many of the techniques require further research. The safeguards should not be construed as a definitive or comprehensive list of the safeguards we will or could apply to a given launch.

As a reminder, covered systems that reach High capability must have safeguards that sufficiently minimize the associated risk of severe harm before they are deployed. Systems that reach Critical capability also require sufficient safeguards during development.” (p. 16)

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

The PF’s Section C.1 in the Appendix details “potential safeguard efficacy assessments”, without explicit commitment to implementing them. However, they do not provide actual proof or evidence that the deployment measures are sufficient ex ante. Instead, it relies on the Safety Advisory Group’s judgment at the time when High or Critical deployment standards need to be implemented, making the decision vulnerable to discretion. The FGF likewise determines sufficiency through holistic risk-acceptance judgment incorporating safety margins, not through pre-specified, justified efficacy targets.

Quotes:

From PF, Table 4, p.17:
“Robustness:
• Automated and expert redteaming (identifying success per resources)
• Prevalence of jailbreaks identified via monitoring and reports, in historical deployments
• Results from public jailbreak bounties and results from private and public jailbreak benchmarks”

More quotes may be found in PF, Table 4.
“The examples aim to provide insight on our thinking but should not be construed as a definitive checklist of the safeguards we will apply to a given launch.” (PF, p.10)

In assessing whether to accept residual risk, we review the risk tiers for each systemic risk category and incorporate appropriate safety margins.” (FGF, p.6)

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 25%

While they mention third-party stress testing of safeguards, this is not specific to deployment measures, and appears optional. The FGF keys it to a model approaching a risk tier but stays optional and specifies neither how experts are selected and kept independent nor whether external reports are published.

Quotes:

“Third-party stress testing of safeguards: If we deem that a deployment warrants third party stress testing of safeguards and if high quality third-party testing is available, we will work with third parties to evaluate safeguards. We may seek this out in particular for models that are over a High capability threshold.” (PF, p.13)

“Independent expert opinions for evidence produced to SAG: The SAG may opt to get independent expert opinion on the evidence being produced to SAG. The purpose of this input is to add independent analysis from individuals or organizations with deep expertise in domains of relevant risks (e.g., biological risk). If provided, these opinions will form part of the analysis presented to SAG in making its decision on the safety of a deployment. These domain experts may not necessarily be AI experts and their input will form one part of the holistic evidence that SAG reviews.” (PF, p.13)

“We may solicit and obtain input from external experts in relevant domains, and other stakeholders, to assist in systemic risk assessment or in determining the sufficiency of safety mitigations. This may include independent third-party evaluators, stress testing of safeguards for models approaching or reaching a new risk tier, or to provide independent expert opinions to assist the Safety Advisory Group in assessing the safety of a proposed deployment.” (FGF, p.17)

3.1.3 Assurance processes (30%) 30%

3.1.3.1 Credible plans towards the development of assurance processes (40%) 25%

The framework mentions a commitment to developing assurance processes for Critical capabilities. However, they do not provide further detail on how they will achieve this, or by what point this will need to be intact (i.e. it is unclear if assurance processes must be solidifed before or after a model has been deemed to possess Critical level capabilities). Further, their commitment is only to “update” the framework, meaning their plans or the KCIs/KRIs necessitating Critical level assurance themselves may be modified.

Quotes:

“Models that have reached or are forecasted to reach Critical capability in a Tracked Category present severe dangers and should be treated with extreme caution. Such models require additional safeguards (safety and security controls) during development, regardless of whether or when they are externally deployed. We do not currently possess any models that have Critical levels of capability, and we expect to further update this Preparedness Framework before reaching such a level with any model. Our approach to Critical capabilities will need to be robust to both malicious actors (either internal or external) and model misalignment risks. The SAG retains discretion over when to request deep dive evaluations of models whose scalable evaluations indicate that they may possess or may be nearing critical capability thresholds.” (p. 12)

3.1.3.2 Evidence that the assurance processes are enough to achieve their corresponding KCI thresholds (40%) 50%

Whilst they detail safeguard efficacy assessments which is commendable, these are only “potential”. In addition, their descriptions remain more high-level than precise, and do not cover the Critical case. To improve, they could describe what evidence would be sufficient for showing that the safeguards for misalignment meet the required KCI threshold.

Quotes:

“The table below illustrates possible examples of safeguards and safeguard efficacy assessments we could consider to claim we have sufficiently minimized the risks of severe harm associated with High level of capability under any of our Tracked Categories” (p. 19). The table (Table 5) goes on to define “potential safeguard efficacy assessments” for each category. For instance,
“Reliable and Robust System Oversight:

Red-teamers can train an arbitrarily adversarial agent or can pretend to be an agent in our setup and we should ensure that harmful behaviors are detected by monitors with a high recall rate
Monitor escalations are either outright blocked or, in empirical studies, users with the relevant level of expertise were shown to reliably understand and act on monitor flags before significant harm occurred
SLAs for asynchronous monitors to analyze and detect harmful agent actions posthoc”

More examples can be found in Table 5.

3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 10%

The Framework does not explicitly outline the assumptions for assurance processes’ efficacy to hold. However, they demonstrate some implicit awareness that “deception, colluding with oversight models, sabotaging safeguards over time such as by embedding vulnerabilities in safeguards code” could all undermine these KCI measures. To improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.

Quotes:

“Research Category: Undermining Safeguards: ability and propensity for the model to act to undermine safeguards placed on it, including e.g., deception, colluding with oversight models, sabotaging safeguards over time such as by embedding vulnerabilities in safeguards code, etc.
Potential response: If a model has High or Critical capabilities in any of the Tracked Categories, require the Safeguards case to be robust to the discovered capability and/or propensity” (p. 7, under ‘Potential response’ to Research Category “Undermining Safeguards” in Table 2.)

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 40%

3.2.1 Monitoring of KRIs (40%) 39%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 90%

The framework outlines multiple elicitation strategies and commits to fulfill this criterion almost word for word. The elicitation methods detailed show nuance and expertise. To improve, the framework could include measurable information, such as how much compute is used for fine-tuning. More detail could be added on which elicitation methods they anticipate would be used by different threat actors, under realistic settings, to justify their elicitation method.

Quotes:

“Our evaluations are intended to approximate the full capability that the adversary contemplated by our threat model could extract from the deployment candidate model, including by using the highest capability tier of system settings, using a version of the model that has a negligible rate of safety-based refusals on our Tracked Category capability evaluations (which may require a separate model variant), and with the best presently-available scaffolds. These measures are taken to approximate the high end of expected elicitation by threat actors attempting to misuse the model, and should be tailored depending on the level of expected access (e.g., doing finetuning if the weights will be released). Nonetheless, given the continuous progress in model scaffolding and elicitation techniques, we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities that may emerge in real world use and misuse. We incorporate this uncertainty into our assessments. We monitor the technical landscape for changes to the elicitation techniques and best practices, and reassess our evaluations as needed.” (p. 8)

3.2.1.2 Evaluation frequency (25%) 10%

The FGF mentions the option of conducting lighter-touch model evaluations “from time to time”, which might be triggered when OpenAI have “reason to believe a model’s risk profile may have materially changed”, or when they release an updated model. However, these are discretionary rather than comitting.

Quotes:

“we may from time to time conduct lighter-touch model evaluations at appropriate trigger points […] These trigger points may include (1) the release of an updated model or (2) where we have reason to believe a model’s risk profile may have materially changed.” (FGF, p.16)

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 25%

There is some recognition of how post-training enhancements can factor into capability assessments, but this description remains high level.

The commitment to “monitor the technical landscape for changes to the elicitation techniques and best practices, and reassess our evaluations as needed” is vague; it is not clear how evaluations are “reassessed” based on changes in best practices. For an improvement, an explicit commitment to adopt best practices should be given, or otherwise forecasting exercises could be completed to justify their assumptions on the rate of progress in post-training enhancements. However, “we incorporate this uncertainty into our assessments”, whilst vague, shows partial implementation of factoring the uncertainty of the progress of post-training enhancements in the future.

Importantly, more detail could be provided on precisely how post-training enhancements are factored into capability assessments – for instance, the size of the “uncertainty” or the safety buffer they give to account for uncertainty concerning the progress of post-training enhancements.

Further, more detail could be added on how they account(ed) for how post-training enhancements’ risk profiles change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.

Quotes:

3.2.1.4 Vetting of protocols by third parties (15%) 10%

The framework demonstrates discretionary commitment to third-party vetting of evaluation protocols. They do not have a specific structure in place for regularly vetting capabilities assessments by third parties, but they do indicate that they measure the Research Categories capabilities in collaboration with external experts. They also mention a general commitment to soliciting expert opinion on the overall holistic risk assessment process.

Quotes:

“We call these Research Categories, and in these areas we will take the following
steps, both internally and in collaboration with external experts:

Further developing the threat models for the area,
Advancing the science of capability measurement in the area and investing towards the development of rigorous evaluations (which could be achieved internally or via partnerships), and
Sharing summaries of our findings with the public where feasible.” (pp. 5-6)

Deeper capability assessments: “Deep Dives: designed to provide additional evidence validating the scalable evaluations’ findings on whether a capability threshold has been crossed. These may include a wide range of evidence gathering activities, such as human expert red-teaming, expert consultations, resource-intensive third party evaluations (e.g., bio wet lab studies, assessments by independent third party evaluators), and any other activity requested by SAG.” (p. 8)

3.2.1.5 Replication of evaluations by third parties (15%) 25%

The Preparedness Framework gives some recognition of evaluations being conducted independently by third-parties, but only if deemed necessary. Further, they only commit to “work[ing] with” these parties. They do not explicitly commit in the document to have any evaluations replicated, unless it is part of a deeper capability assessment (‘Deep Dive’) that is asked for by the Safety Advisory Group (i.e., at the OpenAI leadership’s discretion.)

Quotes:

“Third-party evaluation of tracked model capabilities: If we deem that a deployment warrants deeper testing of Tracked Categories of capability (as described in Section 3.1), for example based on results of Capabilities Report presented to them, then when available and feasible, OpenAI will work with third-parties to independently evaluate models.” (PF, p.13)

Deeper capability assessments: “Deep Dives: designed to provide additional evidence validating the scalable evaluations’ findings on whether a capability threshold has been crossed. These may include a wide range of evidence
gathering activities, such as human expert red-teaming, expert consultations, resource-intensive third party evaluations (e.g., bio wet lab studies, assessments by independent third party evaluators), and any other activity requested by SAG.” (PF, p.8)

3.2.2 Monitoring of KCIs (40%) 43%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 50%

There is some reference to monitoring systems, but without explicit justification that KCI thresholds will not be crossed unnoticed, nor a detailed description of the methodology of monitoring or assurance that monitoring is ongoing.

There is implicit justification that KCI thresholds are sufficient in the safeguards for misaligned models. For instance, a “monitor AI supervises agent actions to detect agent in the case of trying to take harmful actions” and “agent actions are all logged in an uneditable database, and asynchronous monitoring routines review those actions for evidence of harm”. This shows that monitoring is being conducted in order to satisfy the assurance process KCI. However, this monitoring is not specifically for measuring mitigation effectiveness; rather, it is to “review [agent] actions for evidence of harm”, which may not catch KCI thresholds being crossed unnoticed.

Indeed, there are multiple “potential safeguard efficacy assessments” detailed in The Preparedness Framework – e.g., “Redteamer success creating a certain number of accounts without detection (success per resources)”. However, none of these efficacy assessments are ongoing; instead, it appears they only need to be conducted for the drafting of the Safeguards Report. Further, they are only “potential” assessments.

To improve, The Preparedness Framework should describe systematic, ongoing monitoring to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will still be met, when required.

Quotes:

“Reliable and Robust System Oversight:
• Monitor AI supervises agent actions to detect agent in the case of trying to take harmful actions, scheming, or similar
• System forces human approval of agent actions (either by default, or by escalation of the monitor)
• Model’s reasoning is provided faithfully and interpretably to humans to review
• Agent actions are all logged in an uneditable database, and asynchronous monitoring routines review those actions for evidence of harm” (PF, p.19)

See PF, Tables 4 and 5, pages 18 and 19, “Potential safeguard efficacy assessments”.

“We rely on a variety of techniques to identify whether additional measures are needed, including post-deployment threat intelligence monitoring, post-launch monitoring tools (classifiers, automated detection, human review, and investigations), consulting with experts, and other techniques.” (FGF, p.11)

3.2.2.2 Vetting of protocols by third parties (30%) 50%

The Preparedness Framework demonstrates discretionary commitment to third-party vetting of KCI protocols. They do not have a specific structure in place for regularly vetting safeguards assessments by third parties, beyond the quote below. They do not explicitly commit therefore to undergo vetting of KCI protocols by third parties, except for containment KCIs.

Quotes:

“SAG is responsible for assessing whether the safeguards associated with a given deployment sufficiently minimize the risk of severe harm associated with the proposed deployment. The SAG will make this determination based on:
[…] The associated risks of severe harm, as described in the threat model and where needed, advice of internal or external experts.” (PF, p.10)

“Continuous Monitoring and Validation: Ensure security threat models and updates inform
where security and data privacy controls should be implemented, improved, and monitored to
further reduce risk. Internal and external assessments to validate these controls are conducted regularly and reports are provided to OpenAI leadership.” (PF, p.20)

“Monitoring and Incident Response: Monitor security and event logs continuously to detect,
triage, and respond to security incidents rapidly by 24×7 on-call staff.” (PF, p.21)

3.2.2.3 Replication of evaluations by third parties (30%) 25%

The framework gives some recognition of evaluations being conducted independently by third-parties, but only if deemed necessary. Further, they only commit to “work[ing] with” these parties. They do not explicitly commit in the document to have any evaluations replicated.

Quotes:

3.2.3 Transparency of evaluation results (10%) 64%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 75%

There are commitments to share evaluation results to the public if models are deployed. However, they do not commit to alert any stakeholders when/if Critical capabilities are reached.

Quotes:

“Public disclosures: We will release information about our Preparedness Framework results in order to facilitate public awareness of the state of frontier AI capabilities for major deployments. This published information will include the scope of testing performed, capability evaluations for each Tracked Category, our reasoning for the deployment decision, and any other context about a model’s development or capabilities that was decisive in the decision to deploy. Additionally, if the model is beyond a High threshold, we will include information about safeguards we have implemented to sufficiently minimize the associated risks. Such disclosures about results and safeguards may be redacted or summarized where necessary, such as to protect intellectual property or safety.” (p. 12)

“Transparency in Security Practices: Ensure security findings, remediation efforts, and key metrics from internal and independent audits are periodically shared with internal stakeholders and summarized publicly to demonstrate ongoing commitment and accountability.” (p. 21)

“Internal Transparency. We will document relevant reports made to the SAG and of SAG’s decision and reasoning. Employees may also request and receive a summary of the testing results and SAG recommendation on capability levels and safeguards (subject to certain limits for highly sensitive information).” (p. 12)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 10%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 10%

There is some indication of monitoring; however, this is not explicitly to gain information on novel risk profiles. To improve, such a process should be detailed, for instance by building on the current monitoring infrastructure.

They do mention that monitoring should be conducted to assert there is “no significant evidence of unexpected value alignment failures”, as a safeguard efficacy assessment. Partial credit is given here for the use of “unexpected”, as this could be further developed to analyse novel risk profiles.

Quotes:

“Internal and external user reports, monitoring, or manual inspection for historical model deployments show no significant evidence of unexpected value alignment failures” (p. 19)

“Prevalence of jailbreaks identified via monitoring and reports, in historical deployments” (p. 17)

“Expanding human monitoring and investigation capacity to track capabilities that pose a risk of severe harm, and developing data infrastructure and review tools to enable human investigations” (p. 17)

“Agent actions are all logged in an uneditable database, and asynchronous monitoring routines review those actions for evidence of harm” (p. 19)

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 10%

There is a commitment to developing threat models for some of the Research Categories. However, this is not explicitly linked to incorporating novel risks, which were unexpected or not previously anticipated. To improve, an encounter with a possibly novel risk profile of a model should trigger risk modelling exercises, to analyse how this finding may impact all other risk models.

They do mention that if a capability “presents a real risk of severe harm, we may decide to monitor it as a Tracked Category or a Research Category”. Whilst this remains general, partial credit is given here for having some reference to incorporating additional risks – noting that “a capability” could refer to any capability.

Quotes:

“Where we determine that a capability presents a real risk of severe harm, we may decide to monitor it as a Tracked Category or a Research Category.” (p. 4)

“There are also some areas of frontier capability that do not meet the criteria to be Tracked Categories, but where we believe work is required now in order to prepare to effectively address risks of severe harms in the future. These capabilities either need more research and threat modeling before they can be rigorously measured, or do not cause direct risks themselves but may need to be monitored because further advancement in this capability could undermine the safeguards we rely on to mitigate existing Tracked Category risks.” (p. 6)

OpenAI

4. Risk Governance

Weak 39%

4.1 Decision-making (25%) 40%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 10%

The framework states that the CEO or a designated person is the decision-maker, but it is unclear if this is on a risk-by-risk basis and it is unclear how often the risk ownership is delegated to someone other than the CEO.

Quotes:

“OpenAI Leadership, i.e., the CEO or a person designated by them, is responsible for: Making all final decisions, including accepting any residual risks and making deployment go/no-go decisions, informed by SAG’s recommendations. Resourcing the implementation of the Preparedness Framework (e.g., additional work on safeguards where necessary).” (p. 15)

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No mention of a management risk committee.

Quotes:

No relevant quotes found.

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 75%

The company outlines clear protocols for their decision-making, including who makes the decisions and on what basis. It specifies its use of residual risk (net of safeguards).

Quotes:

“SAG then has the following decision points: 1. SAG can find that it is confident that the safeguards sufficiently minimize the associated risk of severe harm for the proposed deployment, and recommend deployment. 2. SAG can request further evaluation… 3. SAG can find the safeguards do not sufficiently minimize the risk…The SAG will strive to recommend further actions that are as targeted and non-disruptive as possible while still mitigating risks of severe harm. All of SAG’s recommendations will go to OpenAI Leadership for final decision-making in accordance with the decision-making practices outlined in Appendix B.” (PF, p.11)

“If residual risks associated with the model exceed acceptable risk levels, the model is not deployed unless additional mitigation measures are implemented that sufficiently minimize risk. We rely on a variety of techniques to identify whether additional measures are needed, including postdeployment threat intelligence monitoring, post-launch monitoring tools (classifiers, automated detection, human review, and investigations), consultation with experts, and other techniques. Where the residual risk falls within acceptable levels, taking into account appropriate safety margins, the model may be approved for continued development and internal use (where applicable), and may be deployed under this Frontier Governance Framework. We document a justification for why the systemic risks stemming from the model are acceptable, including reasonably foreseeable conditions under which the justification may no longer hold, as informed by recommendations from OpenAI’s Safety Advisory Group and external experts, as available and appropriate.” (FGF, p.11)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 75%

The PF’s “fast-track” process describes internal escalation (SAG processing reports urgently and coordinating with Leadership) but does not specify what actions would be taken to address the risk itself, focusing on governance processes rather than operational response measures. The FGF refers to a separate AI Safety Incident Response Plan (AIRP) and summarizes it at a high level: it states that incidents are detected across multiple channels (automated monitoring, employee escalation, end-user feedback, regulator/press notification), triaged, investigated for root cause, scope, and impact, mitigated and contained by dedicated response teams, then followed by retrospectives and external reporting to authorities where reporting obligations are triggered. The procedures behind these steps — how an event qualifies as an incident, how severity is determined, the concrete containment and mitigation actions, escalation steps, and timelines — are deferred to the AIRP, which is not public.

Quotes:

“Fast-track. In the rare case that a risk of severe harm rapidly develops (e.g., there is a change in our understanding of model safety that requires urgent response), we can request a fast track for the SAG to process the report urgently. The SAG Chair should also coordinate with OpenAI Leadership for immediate reaction as needed to address the risk.” (PF, p.15)

“OpenAI maintains an AI Safety Incident Response Plan (AIRP) that outlines OpenAI’s plan for identifying and responding to AI safety incidents. The AIRP has a broad scope and uses definitions optimized for operational decision-making, covering a range of different types of safety incidents that may arise, including those that are reportable under applicable frontier safety laws and regulations, including critical safety incidents under the TFAIA. We also maintain a Cybersecurity Incident Response Plan that may cover certain cybersecurity incidents that are reportable.
We have measures in place to monitor for and report on potential AI safety incidents discovered by both internal and external actors. Potential AI safety incidents are triaged, investigated, escalated, and remediated as appropriate according to the procedures described in the AIRP. These incidents are in turn analyzed to determine whether they meet the criteria for external reporting under applicable laws and regulations.

Detection and triage
A potential AI safety incident may be detected through various channels, including automated monitoring, employee escalation, end-user feedback, (including support tickets and external reporting forms), notification from regulators or the press, and review of on- or off-platform activity. Once a potential AI safety incident is identified, the AIRP dictates procedures for assessing whether the event qualifies as an AI safety incident and for determining severity and notifying appropriate internal teams.

Investigation, mitigation, and response
OpenAI maintains response teams that investigate and, if necessary, take action to mitigate and contain incidents. The investigation includes determining the root cause, scope, and impact of the incident.
Once the investigation has been completed, OpenAI takes steps to implement longer-term solutions to address root causes. As part of our commitment to preventing similar future incidents, we may also conduct a retrospective if needed to document key learnings and address open action items.

External reporting
As part of our response, we analyze whether we have reporting obligations relating to the incident under applicable laws and regulations or if any other type of external outreach is advisable, including under any voluntary commitments. If an incident is determined to be reportable, we will gather relevant information from the investigation and remediation phase for reporting to appropriate authorities within the required deadlines.” (FGF, pp.12–13)

4.2. Advisory and Challenge (20%) 48%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 0%

No mention of an executive risk officer.

Quotes:

No relevant quotes found.

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 90%

The Safety Advisory Group (SAG) plays this role and its role is described in detail.

Quotes:

“The Safety Advisory Group (SAG) is responsible for: Overseeing the effective design, implementation, and adherence to the Preparedness Framework in partnership with the safety organization leader. For each deployment in scope under the Preparedness Framework, reviewing relevant reports and all other relevant materials and assessing of the level of Tracked Category capabilities and any post-safeguards residual risks. For each deployment under the Preparedness Framework, providing recommendations on potential next steps and any applicable risks to OpenAI Leadership, as well as rationale. Making other recommendations to OpenAI Leadership on longer-term changes or investments that are forecasted to be necessary for upcoming models to continue to keep residual risks at acceptable levels.” (p. 15)

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 75%

The FGF commits to monitoring risk over time through post-deployment threat-intelligence and post-launch monitoring, an incident detection-and-response process, and periodic (6-monthly) model-report review with light-touch evaluations at defined trigger points. Pre-deployment, this rests on repeated capability evaluations against tracked-category thresholds. It does not aggregate risk information into a holistic view — there is no risk register or dashboard, and risk indicators beyond capability evaluations are not specified.

Quotes:

“We rely on a variety of techniques to identify whether additional measures are needed, including post-deployment threat intelligence monitoring, post-launch monitoring tools (classifiers, automated detection, human review, and investigations), consultation with experts, and other techniques.” (FGF, p.11)

“We will in any event determine whether to update the Model Report for our most capable frontier models once every six months.” (FGF, p.16)

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 50%

The Safety Advisory Group (SAG) partly plays this role. However, it is unclear how much challenge it offers to management. The framework specifies explicitly that “OpenAI Leadership can also make decisions without the SAG’s participation”.

Quotes:

“The Safety Advisory Group (SAG), including the SAG Chair, provides a diversity of perspectives to evaluate the strength of evidence related to catastrophic risk and recommend appropriate actions.” (p. 15)

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 75%

The framework clearly outlines risk information to be gathered and shared with management. To improve further, the company should specify more details on these reports and how they describe the risk levels.

Quotes:

“The results of these evaluations… are compiled into a Capabilities Report that is submitted to the SAG.” (p. 9)
“We compile the information on the planned safeguards needed to minimize the risk of severe harm into a Safeguards Report.”(p. 10)

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 25%

4.3.1 The company has an internal audit function involved in AI governance (50%) 0%

No mention of an internal audit function.

Quotes:

No relevant quotes found.

4.3.2 The company involves external auditors (50%) 50%

OpenAI references third-party validation of security controls and conditional third-party stress testing of safeguards (“if we deem that a deployment warrants” and “if high quality third-party testing is available”). However, they do not commit to external auditing of Framework adherence or risk assessment quality. Access levels for external auditors are not specified.

Quotes:

“Independent Security Audits: Ensure security controls and practices are validated regularly by third-party auditors” (PF, p.21)

4.4 Oversight (20%) 45%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 90%

The company has a dedicated committee of the Board.

Quotes:

“The Safety and Security Committee (SSC) of the OpenAI Board of Directors will be given visibility into processes, and can review decisions and otherwise require reports and information from OpenAI Leadership as necessary to fulfill the Board’s oversight role. Where necessary, the Board may reverse a decision and/or mandate a revised course of action.” (PF, p.15)

“Material updates will be presented to the Safety and Security Committee of the board of directors of the OpenAI Foundation and to the board of directors of OpenAI Ireland Limited for oversight, with changes and justifications for material updates documented in a changelog and published within 30 days of the update.” (FGF, p.19)

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 0%

No mention of any additional governance bodies.

Quotes:

No relevant quotes found.

4.5 Culture (10%) 20%

4.5.1 The company has a strong tone from the top (33.3%) 25%

The Preparedness Frameworks include a commitment to safety, but does not go into detail on the existence of risks and how they need to be balanced with benefits and AI capabilities.

Quotes:

“OpenAI’s mission is to ensure that AGI (artificial general intelligence) benefits all of humanity. To pursue that mission, we are committed to safely developing and deploying highly capable AI systems” (PF, p.1)

“To pursue that mission, we are committed to safely developing and deploying highly capable AI models, which create significant benefits and also bring new risks. We build for safety at every step and share our learnings so that society can make well-informed choices to manage new risks from frontier AI.” (FGF, p.1)

4.5.2 The company has a strong risk culture (33.3%) 10%

The Preparedness Framework mentions some possibility for employees to receive summary information, but it is limited, and it states the company is moving away from safety drills. The FGF adds a documented framework change-management process — material updates reviewed by board-level committees, with justifications recorded in a changelog published within 30 days — and a commitment to incident retrospectives that capture key learnings and track open action items. It does not commit to risk training or describe building a risk culture as such.

Quotes:

“Deprioritize safety drills, as we are shifting our attention to a more durable approach of continuously red-teaming and assessing the effectiveness of our safeguards.” (PF, p.14)

“OpenAI commits to ensuring that this FGF is state-of-the-art and kept up to date with OpenAI’s policies and procedures with respect to the TFAIA and the EU Code of Practice.[…] Material updates will be presented to the Safety and Security Committee of the board of directors of the OpenAI Foundation and to the board of directors of OpenAI Ireland Limited for oversight, with changes and justifications for material updates documented in a changelog and published within 30 days of the update.” (FGF, p.19)

“As part of our commitment to preventing similar future incidents, we may also conduct a retrospective if needed to document key learnings and address open action items.” (FGF, p.13)

4.5.3 The company has a strong speak-up culture (33.3%) 25%

The framework includes a “Raising Concerns Policy”, but it does not guarantee any anonymity or lack of retaliation.

Quotes:

“Noncompliance. Any employee can raise concerns about potential violations of this policy, or about its implementation, via our Raising Concerns Policy. We will track and appropriately investigate any reported or otherwise identified potential instances of noncompliance with this policy, and where reports are substantiated, will take appropriate and proportional corrective action.” (p. 12)

4.6 Transparency (5%) 58%

4.6.1 The company reports externally on what their risks are (33.3%) 75%

The Preparedness Frameworks state the risks in scope and include commitments to public transparency regarding the risks from models and their mitigation.

Quotes:

“For models in scope of this Framework, results of our systemic risk assessment and mitigation processes and measures are documented in a Safety and Security Model Report […] if we have reasonable grounds to believe that the basis for considering the model’s systemic risks acceptable has been materially undermined, we will update our Model Report as appropriate after completing a systemic risk assessment.” (FGF, p.16)

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 75%

The Preparedness Frameworks clearly state OpenAI’s governance mechanisms.

Quotes:

“An internal, cross-functional group of OpenAI leaders called the Safety Advisory Group (SAG) oversees the Preparedness Framework and makes expert recommendations on the level and type of safeguards required for deploying frontier capabilities safely and securely. OpenAI Leadership can approve or reject these recommendations, and our Board’s Safety and Security Committee provides oversight of these decisions.” (PF, p.3)

“OpenAI OpCo LLC and OpenAI Ireland Limited maintain internal governance structures and practices designed to meet the requirements of applicable laws and ensure implementation of the processes in this Framework. OpenAI’s internal governance practices include managing risks across the lifecycle of our models and ongoing legal and compliance reviews to ensure that risk management functions adhere to this Framework. OpenAI OpCo LLC is responsible for compliance with the TFAIA for covered models in the United States. OpenAI Ireland Limited is the provider of OpenAI’s GPAI-SR models in the EU and responsible for compliance with the EU Code of Practice. The board of directors of OpenAI Ireland Limited exercise systemic risk oversight under this Framework for EU purposes.” (FGF, p.18)

4.6.3 The company shares information with industry peers and government bodies (33.3%) 25%

The Preparedness Framework mentions working with e.g. the Frontier Model Forum and the government, but frames these as inputs rather than recipients. The FGF adds an outbound channel to government, where reportable critical safety incidents are reported to appropriate authorities within required deadlines. Sharing with industry peers remains input-only — the Frontier Model Forum is named as a consultation source, not a recipient — and the government channel is triggered by external reporting obligations rather than a self-defined protocol specifying what is shared, with whom, and when.

Quotes:

“Heighten safeguards (and consider further actions) in consultation with appropriate US
government actors, accounting for the complexity of classified information handling.” (PF, p.7)

“This process draws on our own internal research and signals, and where appropriate incorporates feedback from academic researchers, independent domain experts, industry bodies such as the Frontier Model Forum, and the U.S. government and its partners, as well as relevant legal and policy mandates.” (PF, p.4)

“If an incident is determined to be reportable, we will gather relevant information from the investigation and remediation phase for reporting to appropriate authorities within the required deadlines.” (FGF, p.13)

OpenAI

Best in class

Overview

1. Risk Identification

1.1 Classification of Applicable Known Risks (40%) 63%

1.1.1 Risks from literature and taxonomies are well covered (50%) 75%

Quotes:

1.1.2 Exclusions are clearly justified and documented (50%) 50%

Quotes:

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

Quotes:

1.2.2 Third party open-ended red teaming (30%) 0%

Quotes:

1.3 Risk modeling (40%) 19%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%

Quotes:

1.3.2 Risk modeling methodology (40%) 11%

1.3.2.1 Methodology precisely defined (70%) 10%

Quotes:

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

Quotes:

1.3.2.3 Prioritization of severe and probable risks (15%) 25%

Quotes:

1.3.3 Third party validation of risk models (20%) 25%

Quotes:

2. Risk Analysis and Evaluation

2.1 Setting a Risk Tolerance (35%) 16%

2.1.1 Risk tolerance is defined (80%) 20%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 50%

Quotes:

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 10%

Quotes:

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

Quotes:

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

Quotes:

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

Quotes:

2.2 Operationalizing Risk Tolerance (65%) 35%

2.2.1 Key Risk Indicators (KRI) (30%) 35%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

Quotes:

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%

Quotes:

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 10%

Quotes:

2.2.2 Key Control Indicators (KCI) (30%) 32%

2.2.2.1 Containment KCIs (35%) 5%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 10%

Quotes:

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%

Quotes:

2.2.2.2 Deployment KCIs (35%) 43%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 75%

Quotes:

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 10%

Quotes:

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 50%

Quotes:

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 25%

Quotes:

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 50%

Quotes:

3. Risk Treatment

3.1 Implementing Mitigation Measures (50%) 37%

3.1.1 Containment measures (35%) 40%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 50%

Quotes:

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

Quotes:

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 25%

Quotes:

3.1.2 Deployment measures (35%) 40%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 50%

Quotes:

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

Quotes:

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 25%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%