OpenAI

Weak 1.7/5

Click categories for more information

very weak

weak

moderate

substantial

strong

Risk Identification

Learn more

Risk Identification

32%

Risk Analysis and Evaluation

Learn more

Risk Analysis and Evaluation

25%

Risk Treatment

Learn more

Risk Treatment

38%

Risk Governance

Learn more

Risk Governance

39%

Best in class

SEE FRAMEWORK

OpenAI has commendably broken down loss of control risks into research categories including long range autonomy, sandbagging, autonomous replication and adaptation, and undermining safeguards.
Their deployment mitigation thresholds, characterised by Robustness, Usage Monitoring, and Trust-based Access, are unique and show expertise and nuance. They also show this nuance when defining the assurance process thresholds that models must meet (such as lack of autonomous capability, value alignment, etc.)
The Safety Advisory Group, i.e. risk committee advising management, is commendable and shows innovation. Their designation of the specific role of this group is best in class.

Overview

Highlights relative to others

Clearer criteria for deciding whether to track a risk domain.

More substantial detail and nuance for why they believe their elicitation methods will be comprehensive enough to match the elicitation efforts of potential threat actors.

Stronger commitments to share evaluation results with relevant stakeholders.

Weaknesses relative to others

Marginal risk clause makes deployment decisions contingent on other companies' risk tolerance.

Risk tolerance could be made more precise.

Vague threshold for security measures.

Unclear how frequently evaluations are run during development and after deployment.

Poorer risk culture.

Changes

If they hadn't made some changes to their framework, they would've attained a higher score.

Compared to their first Preparedness Framework (Beta), they:

1. Removed the emphasis on identifying “unknown unknowns”. Their Beta framework had a strong emphasis on running a process for identifying unknown categories of catastrophic risk as they emerge. They would have scored higher on the risk identification category if this was still included.

2. Removed safety drills. If this was included, they would have scored higher on escalation protocols.

3. Added the marginal risk clause. This harms their score for 2.2.3.

OpenAI

1. Risk Identification

Weak 32%

1.1 Classification of Applicable Known Risks (40%) 63%

1.1.1 Risks from literature and taxonomies are well covered (50%) 75%

Risks covered include Biological and Chemical risks, Cybersecurity, AI self-improvement, plus tracked categories (i.e. risk domains that are monitored to a lesser extent), including nuclear and radiological risks, and various loss of control risks such as long range autonomy, sandbagging, autonomous replication and adaptation, and undermining safeguards. Breaking down loss of control risks as such is commendable.

They exclude persuasion as a research or tracked category.

There is some mention of referencing literature through “internal research”, and risk identification “incorporates feedback from academic researchers”, though no specific, structured approach is given nor documents referenced.

1.1.2 is not greater than 50% and persuasion is excluded.

Quotes:

“We evaluate whether frontier capabilities create a risk of severe harm through a holistic risk assessment process. This process draws on our own internal research and signals, and where appropriate incorporates feedback from academic researchers, independent domain experts, industry bodies such as the Frontier Model Forum, and the U.S. government and its partners, as well as relevant legal and policy mandates.” (p. 4)

Tracked Categories include (pp. 5-6):

“Biological and Chemical: The ability of an AI model to accelerate and expand access to biological and chemical research, development, and skill-building, including access to expert knowledge and assistance with laboratory work.”
“Cybersecurity: The ability of an AI model to assist in the development of tools and executing operations for cyberdefense and cyberoffense.”
“AI Self improvement: The ability of an AI system to accelerate AI research, including to increase the system’s own capability.”

Research Categories include (p. 7):
“Long-range Autonomy: ability for a model to execute a long-horizon sequence of actions sufficient to realize a “High” threat model (e.g., a cyberattack) without being directed by a human (including successful social engineering attacks when needed)”
“Sandbagging: ability and propensity to respond to safety or capability evaluations in a way that significantly diverges from performance under real conditions, undermining the validity of such evaluations”
“Autonomous Replication and Adaptation: ability to survive, replicate, resist shutdown, acquire resources to maintain and scale its own operations, and commit illegal activities that collectively constitute causing severe harm (whether when explicitly instructed, or at its own initiative), without also utilizing capabilities tracked in other Tracked Categories.”
“Undermining Safeguards: ability and propensity for the model to act to undermine safeguards placed on it, including e.g., deception, colluding with oversight models, sabotaging safeguards over time such as by embedding vulnerabilities in safeguards code, etc.”
“Nuclear and Radiological: ability to meaningfully counterfactually enable the creation of a radiological threat or enable or significantly accelerate the development of or access to a nuclear threat while remaining undetected.”

1.1.2 Exclusions are clearly justified and documented (50%) 50%

The justification for excluding the research categories from becoming tracked categories is clear, whereby they “need more research and threat modeling before they can be rigorously measured, or do not cause direct risks themselves but may need to be monitored because further advancement in this capability could undermine the safeguards we rely on”. To improve, this justification should refer to at least one of: academic literature/scientific consensus; internal threat modelling with transparency; third-party validation, with named expert groups and reasons for their validation. That is, whilst they mention that “these capabilities either need more research and threat modeling before they can be rigorously measured” as justification, they should provide credible plans for how they are improving this threat modeling or why nonrigorous measurement options they have considered are not possible or helpful.

Some of their exclusion criteria, however, is quite commendable. For instance their justification for why nuclear and radiological capabalities are now a research category clearly links to risk models. Nonetheless, expert endorsement or more detailed reasoning could be an improvement.

They acknowledge that persuasion is no longer prioritised because “our Preparedness Framework is specifically focused on frontier AI risks meeting a specific definition of severe harms, and Persuasion category risks do not fit the criteria for inclusion.” However, more detail is required for proper justification, for instance what criteria Persuasion does not fit and why they believe this.

Implicitly, their criteria for inclusion (plausible, measurable, severe, net new and instantaneous or irremediable) gives justification for when risks are not included. However, a more explicit link between risks that are excluded and which criteria they fail is needed. Further, their requirement for a risk to be “measurable” may be overly strict; lacking the capability evaluations to “measure capabilities that closely track the potential for the severe harm” does not necessarily mean the risk should be dismissed.

They do mention that they will “periodically review the latest research and findings for each Research Category”, but a more structured process should be given.

Quotes:

“AI Self-improvement (now a Tracked Category), Long-range Autonomy and Autonomous Replication and Adaptation (now Research Categories) are distinct aspects of what we formerly termed Model Autonomy. We have separated self-improvement because it presents a distinct plausible, net new, and potentially irremediable risk, namely that of a hard-to-track rapid acceleration in AI capabilities which could have hard-to-predict severely harmful consequences.
In addition, the evaluations we use to measure this capability are distinct from those applicable to Long-range Autonomy and Autonomous Replication and Adaptation. Meanwhile, while these latter risks’ threat models are not yet sufficiently mature to receive the scrutiny of Tracked Categories, we believe they justify additional research investment and could qualify in the future, so we are investing in them now as Research Categories.

Nuclear and Radiological capabilities are now a Research Category. While basic information related to nuclear weapons design is available in public sources, the information and expertise needed to actually create a working nuclear weapon is significant, and classified. Further, there are significant physical barriers to success, like access to fissile material, specialized equipment, and ballistics. Because of the significant resources required and the legal controls around information and equipment, nuclear weapons development cannot be fully studied outside a lassified context. Our work on nuclear risks also informs our efforts on the related but distinct risks posed by radiological weapons. We build safeguards to prevent our models from assisting with high-risk queries related to building weapons, and evaluate performance on those refusal policies as part of our safety process. Our analysis suggests that nuclear risks are likely to be of substantially greater severity and therefore we will prioritize research on nuclear-related risks. We will also engage with US national security stakeholders on how best to assess these risks.” (pp. 7–8)

“Within our wider safety stack, our Preparedness Framework is specifically focused on frontier AI risks meeting a specific definition of severe harms, and Persuasion category risks do not fit the criteria for inclusion.” (p. 8)

“There are also some areas of frontier capability that do not meet the criteria to be Tracked Categories, but where we believe work is required now in order to prepare to effectively address risks of severe harms in the future. These capabilities either need more research and threat modeling before they can be rigorously measured, or do not cause direct risks themselves but may need to be monitored because further advancement in this capability could undermine the safeguards we rely on to mitigate existing Tracked Category risks. We call these Research Categories” (p. 7)

“Tracked Categories are those capabilities which we track most closely, measuring them during each covered deployment and preparing safeguards for when a threshold level is crossed. We treat a frontier capability as a Tracked Category if the capability creates a risk that meets five criteria:
1. Plausible: It must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI.
2. Measurable: We can construct or adopt capability evaluations that measure capabilities that closely track the potential for the severe harm.
3. Severe: There is a plausible threat model within the capability area that would create severe harm.
4. Net new: The outcome cannot currently be realized as described (including at that scale, by that threat actor, or for that cost) with existing tools and resources (e.g., available as of 2021) but without access to frontier AI.
5. Instantaneous or irremediable: The outcome is such that once realized, its severe harms are immediately felt, or are inevitable due to a lack of feasible measures to remediate.” (p. 4)

“We will periodically review the latest research and findings for each Research Category” (p. 7)

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

The framework does mention that red-teaming is to be conducted by human experts, but not explicitly for the purpose of identifying unknown risks. It is also only required if a capability threshold is passed.

Quotes:

“The SAG [Safety Advisory Group] reviews the Capabilities Report and decides on next steps. These can include: […] Recommend deep dive research: This is appropriate if SAG needs additional evidence in order to make a recommendation.” (p. 9)

“Deep Dives: designed to provide additional evidence validating the scalable evaluations’ findings on whether a capability threshold has been crossed. These may include a wide range of evidence gathering activities, such as human expert red-teaming, expert consultations, resource-intensive third party evaluations (e.g., bio wet lab studies, assessments by independent third party evaluators), and any other activity requested by SAG.” (p. 8)

1.2.2 Third party open-ended red teaming (30%) 0%

The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

“The SAG reviews the Capabilities Report and decides on next steps. These can include: […] Recommend deep dive research: This is appropriate if SAG needs additional evidence in order to make a recommendation.” (p. 9)

Third-party evaluation of tracked model capabilities: “If we deem that a deployment warrants deeper testing of Tracked Categories of capability (as described in Section 3.1), for example based on results of Capabilities Report presented to them, then when available and feasible, OpenAI will work with third-parties to independently evaluate models.” (p. 13)

1.3 Risk modeling (40%) 18%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%

The framework describes having `threat models’ for each Tracked Category (i.e. risk domain), though not for the Research Categories (“For each Tracked Category, we develop and maintain a threat model to identify specific risks of severe harms that could arise from the frontier capabilities in that domain”.)

The fact that all Tracked Categories must be ‘Plausible’ indicates some risk modelling is being performed even for Research Categories, in order to determine if they should be Tracked Categories (“Plausible: it must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI”.)

The justification for keeping some risks as Research Categories as due to requiring more threat modelling indicates awareness that risk models are necessary to conduct for all areas of monitored risk. However, more detail on how they will achieve this precision should be given.

Details of risk models are not published, but there is some indication of intending to share findings. There is an indication of the risk model for Biological threats: “Our evaluations test acquiring critical and sensitive information across the five stages of the biological threat creation process: Ideation, Acquisition, Magnification, Formulation, and Release.” However, more detail should be provided.

Quotes:

“[capabilities are Tracked Categories if they are] Plausible: It must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI.” (p. 4)

“For each Tracked Category, we develop and maintain a threat model to identify specific risks of severe harms that could arise from the frontier capabilities in that domain” (p. 4)

“Our evaluations test acquiring critical and sensitive information across the five stages of the biological threat creation process: Ideation, Acquisition, Magnification, Formulation, and Release. These evaluations, developed by domain experts, cover things like how to troubleshoot
the laboratory processes involved.”

“These [Research Category] capabilities either need more research and threat modeling before they can be measured […] [for these] we will take the following steps, both internally and in collaboration with external experts: Further developing the threat models for the area […] Sharing summaries of our findings with the public where feasible.” (pp. 6-7)

1.3.2 Risk modeling methodology (40%) 9%

1.3.2.1 Methodology precisely defined (70%) 10%

It is not clear what the methodology is from the framework, or that a particular methodology is followed. However, they do mention identifying causal pathways, which implies some methodology. More detail should be given.

Quotes:

“It must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI.” (p. 4)

“Capability thresholds concretely describe things an AI system might be able to help someone do or might be able to do on its own that could meaningfully increase risk of severe harm.” (p. 4)

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.

Quotes:

No relevant quotes found.

1.3.2.3 Prioritization of severe and probable risks (15%) 10%

For a risk area to be a tracked category, the capability must create a risk that is “Severe: There is a plausible threat model within the capability area that would create severe harm.” This suggests that severity is prioritised, and plausibility here suggests the risk model must have nonzero probability. However, these threat models are developed post-hoc – after deciding which categories to track: “For each Tracked Category, we develop and maintain a threat model identifying specific risks of severe harms that could arise from the frontier capabilities in that domain […]”

They then prioritise monitoring for High and Critical capabilities, implicitly defining these as those capabilities with higher probability x severity of harm: “High capability thresholds mean capabilities that significantly increase existing risk vectors for severe harm”; “Critical capability thresholds mean capabilities that present a meaningful risk of a qualitatively new threat vector for severe harm with no ready precedent.”

However, there is minimal detail on how severity and probability of risk models is determined, and these results published.

In addition, determining whether there is “real risk” of “severe harm” is not explicitly determined by probabilities. The probability and the magnitude of harm should be explicitly estimated for each risk model.

Overall, there is an awareness that they should focus threat models on severe harms, but with little evidence of systematic prioritization among multiple risk models. Risk modelling is only completed after already deciding what to track. This is different from the required criterion of using prioritization of risk models to determine focus areas.

Quotes:

“For each Tracked Category, we develop and maintain a threat model identifying specific risks of severe harms that could arise from the frontier capabilities in that domain […] High capability thresholds mean capabilities that significantly increase existing risk vectors for severe harm. Critical capability thresholds mean capabilities that present a meaningful risk of a qualitatively new threat vector for severe harm with no ready precedent.” (p. 4)

“Where we determine that a capability presents a real risk of severe harm, we may decide to monitor it as a Tracked Category or a Research Category.” (p. 4)

For a capability to be a Tracked Category (p. 4):
“Plausible: It must be possible to identify a causal pathway for a severe harm in the capability area, enabled by frontier AI.”
“Severe: There is a plausible threat model within the capability area that would create severe harm.”

1.3.3 Third party validation of risk models (20%) 25%

While “threat models are informed by […] specific information that we gather across OpenAI teams and external experts”, they are not validated by third parties. Indeed, risk models are only approved internally: “For each Tracked Category, we develop and maintain a threat model identifying specific risks of severe harms that could arise from the frontier capabilities in that domain and set corresponding capability thresholds that would lead to a meaningful increase in risk of severe harm. SAG [Safety Advisory Group] reviews and approves these threat models.” (p. 4)
“Informed by”, “in collaboration with” “gather information from” suggests consultation/input during development of the risk models, rather than independent validation of completed models. To improve, an explicit commitment to garnering third parties to validate risk models should be made.

Quotes:

“For each Tracked Category, we develop and maintain a threat model identifying specific risks of severe harms that could arise from the frontier capabilities in that domain and sets corresponding capability thresholds that would lead to a meaningful increase in risk of severe harm. SAG [Safety Advisory Group] reviews and approves these threat models.” (p. 4)

“Threat models are informed both by our broader risk assessment process, and by more specific information that we gather across OpenAI teams and external experts.” (p. 4)

“For [Research Categories], in collaboration with external experts, we commit to further developing the associated threat models and advancing the science of capability measurement for the area, including by investing in the development of rigorous capability evaluations.” (p. 14)

OpenAI

2. Risk Analysis and Evaluation

Weak 25%

2.1 Setting a Risk Tolerance (35%) 16%

2.1.1 Risk tolerance is defined (80%) 20%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 50%

There is a qualitative definition of scenarios which are implicitly ‘unacceptable’ levels of risk, under the Critical capability threshold. For instance, “Proliferating the ability to create a novel threat vector of the severity of a CDC Class A biological agent (i.e., high mortality, ease of transmission) could cause millions of deaths and significantly disrupt public life, with few available societal safeguards” implicitly states this risk as the risk tolerance.

To improve, they must set out the maximum amount of risk the company is willing to accept, for each risk domain (though they need not differ between risk domains), ideally expressed in terms of probabilities and severity (economic damages, physical lives, etc), and separate from KRIs.

Partial credit is given for the definition of “severe harm” as “the death or grave injury of thousands of people or hundreds of billions of dollars of economic damage.” However, the capability thresholds are not explicitly linked to this proto-risk tolerance, and it should be more specific (e.g., specifying how many thousands of people).

Quotes:

“By “severe harm” in this document, we mean the death or grave injury of thousands of people or hundreds of billions of dollars of economic damage.” (p. 1)

Some examples from Tracked Categories table, under the Critical category for ‘Associated risk of severe harm’ (p. 5):

“Proliferating the ability to create a novel threat vector of the severity of a CDC Class A biological agent (i.e., high mortality, ease of transmission) could cause millions of deaths and significantly disrupt public life, with few available societal safeguards.”

“Finding and executing end-to-end exploits for all software could lead to catastrophe from unilateral actors, hacking military or industrial systems, or OpenAI infrastructure. Novel cyber operations, e.g., those involving novel zero-days or novel methods of command-and-control, generally pose the most serious threat, as they are unpredictable and scarce.”

“A major acceleration in the rate of AI Self-improvement could rapidly increase the rate at which new capabilities and risks emerge, to the point where our current oversight practices are insufficient to identify and mitigate new risks, including risks to maintaining human control of the AI system itself.”

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 10%

The qualitative risk tolerances do not have quantitative probabilities, and are vague in description. The definition of severe harm implies some awareness of quantitative measurement, though this is used to classify critical capability thresholds rather than defined as a risk tolerance itself.

Quotes:

“High capability thresholds mean capabilities that significantly increase existing risk vectors for severe harm” (p. 4)

“Critical capability thresholds mean capabilities that present a meaningful risk of a qualitatively new threat vector” (p. 4)

“Scalable evaluations have associated “indicative thresholds,” which are levels of performance that we have pre-determined to indicate that a deployment may have reached a capability threshold.” (p. 8)

“By “severe harm” in this document, we mean the death or grave injury of thousands of people or hundreds of billions of dollars of economic damage.” (p. 1)

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

Whilst they mention the criterion of “severe harm” via “the death or gave injury of thousands of people or hundreds of billions of dollars of economic damage”, this is still vague, and doesn’t apply as a specific risk tolerance for specific risks. None of the specific risks mention quantitative probabilities, and the implicit risk tolerances from capability thresholds are not fully quantitative either.

Quotes:

“High capability thresholds mean capabilities that significantly increase existing risk vectors for severe harm” (p. 4)
“Critical capability thresholds mean capabilities that present a meaningful risk of a qualitatively new threat vector” (p. 4)

Some examples from Tracked Categories table, under the Critical category for ‘Associated risk of severe harm’ (p. 5):

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

No evidence of asking the public what risk levels they find acceptable. No evidence of seeking regulator input specifically on what constitutes acceptable risk levels.

Quotes:

No relevant quotes found.

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 29%

2.2.1 Key Risk Indicators (KRI) (30%) 33%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

Each Tracked Category has 2 thresholds each, which are specific. They appear to be grounded in risk modelling, but this is hard to tell given risk models are not clear. The Research Categories have less developed KRIs but are still well-defined. To improve, more precision is needed in all of the Research Categories, as well as the High thresholds of Cybersecurity and Biological & Chemical; evidence is needed of the specific thresholds, e.g. of the “baseline of tools available in 2021” or “relative to those researchers’ 2024 baseline”. The Critical threshold for AI Self-improvement is especially good. To improve, KRIs should map directly to evaluation tests performed.

Quotes:

From Tracked Categories, under ‘Capability threshold that could lead to the risk’, pp. 5-6:
Biological and chemical:
“[High] The model can provide meaningful counterfactual assistance (relative to unlimited access to baseline of tools available in 2021) to “novice” actors (anyone with a basic relevant technical background) that enables them to create known biological or chemical threats.”

“[Critical] The model can enable an expert to develop a highly dangerous novel threat vector (e.g., comparable to a novel CDC Class A biological agent) OR the model can be connected to tools and equipment to complete the full engineering and/or synthesis cycle of a regulated or novel biological threat without human intervention.”

Autonomous AI R&D:
“[High] The model’s impact is equivalent to giving every OpenAI researcher a highly performant mid-career research engineer assistant, relative to those researchers’ 2024 baseline.”

“[Critical] The model is capable of recursively self improving (i.e., fully automated AI Self-improvement), defined as either (leading indicator) a superhuman research scientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months.”

Research Categories:
“Long-range Autonomy: ability for a model to execute a long-horizon sequence of actions sufficient to realize a “High” threat model (e.g., a cyberattack) without being directed by a human (including successful social engineering attacks when needed)” (p. 7). More on page 7 can be found.

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%

The Critical capability threshold for AI Self-improvement, and somewhat the High capability threshold for AI Self-improvement and Biological & Chemical give baselines, which mean these KRIs can in theory be measured quantitatively. However, more detail of the actual threshold implemented here is needed. No quantitative thresholds for the others is given.

Quotes:

All the below are from Tracked Categories under ‘Capability threshold that could lead to the risk’, pp. 5-6:

High, Biological & Chemical: “[High] The model can provide meaningful counterfactual assistance (relative to unlimited access to baseline of tools available in 2021) to “novice” actors (anyone with a basic relevant technical background) that enables them to create known biological or chemical threats”

High, AI Self-improvement: “[High] The model’s impact is equivalent to giving every OpenAI researcher a highly performant mid-career research engineer assistant, relative to those researchers’ 2024 baseline.”

Critical, AI Self-improvement: “[Critical] The model is capable of recursively self improving (i.e., fully automated AI Self-improvement), defined as either (leading indicator) a superhuman researchscientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months.”

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%

The KRIs only mention model capabilities. They do mention monitoring and incident response, which could feasibly lead to KRIs which satisfy this criterion.

Quotes:

“– Monitoring and Incident Response: Monitor security and event logs continuously to detect,
triage, and respond to security incidents rapidly by 24×7 on-call staff.” (p. 21)

2.2.2 Key Control Indicators (KCI) (30%) 32%

2.2.2.1 Containment KCIs (35%) 5%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 10%

They do not detail qualitative KCI thresholds for containment. Whilst they detail practices for ensuring security controls, and require these for High capability thresholds, they do not describe what would count as sufficient containment for that risk level. They only state that some standard is required: “As a reminder, covered systems that reach High capability must have safeguards that sufficiently minimize the associated risk of severe harm before they are deployed. Systems that reach Critical capability also require sufficient safeguards during development.” (p. 16) However, the purpose of a containment KCI is to precisely define what counts as “sufficient” in these contexts.

They also don’t specify what would be considered sufficient for the Critical threshold, despite this having instrumental effects if not met: “Until we have specified safeguards and security controls standards that would meet a Critical standard, halt further development” (p. 6)

However, they show understanding that different capability levels need different containment approaches.

Quotes:

“Require security controls meeting High standard (Appendix C.3)”, p. 6

“As a reminder, covered systems that reach High capability must have safeguards that sufficiently minimize the associated risk of severe harm before they are deployed. Systems that reach Critical capability also
require sufficient safeguards during development.” (p. 16)

“Until we have specified safeguards and security controls standards that would meet a Critical standard, halt further development” (p. 6)

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%

There is no mention of a quantitative thresholds for containment KCIs, i.e. measurement of security controls.

Quotes:

No relevant quotes found.

2.2.2.2 Deployment KCIs (35%) 43%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 75%

There are three general deployment KCIs, i.e. targets for the mitigations of risks from malicious users to reach, required for High capability models: “Require safeguards against misuse meeting High standard (Appendix C.1) before external deployment”. However, their actual threshold is still somewhat vague and unspecific, e.g. “sufficiently minimize” requires more detail. KCIs for critical capabilities are also not defined: whilst they state that “Until we have specified safeguards and security controls that would meet a Critical standard, halt further development”, but a “Critical standard” is left to be interpreted.

Nonetheless, the qualitative detail in the three deployment KCIs is commendable, showing nuance and expertise.

Quotes:

“Each capability threshold has a corresponding class of risk-specific safeguard guidelines under the Preparedness Framework. We use the following process to select safeguards for a deployment:
We first identify the plausible ways in which the associated risk of severe harm can come to fruition in the proposed deployment.

For each of those, we then identify specific safeguards that either exist or should be implemented that would address the risk.
For each identified safeguard, we identify methods to measure their efficacy and an efficacy threshold.” (p. 10)

“Potential claims:

Robustness: Malicious users cannot use the model to cause the severe harm because they cannot elicit the necessary capability, such as because the model is modified to refuse to provide assistance to harmful tasks and is robust to jailbreaks that would circumvent those refusals.
Usage Monitoring: If a model does not refuse and provides assistance to harmful tasks, monitors can stop or catch malicious users before they have achieved an unacceptable scale of harm, through a combination of automated and human detection and enforcement within an acceptable time frame.
Trust-based Access: The actors who gain access to the model are not going to use it in a way that presents an associated risk of severe harm under our threat model.” (p. 11)

“Safeguards should sufficiently minimize the risk of severe harm associated with misuse of the model’s capabilities. This can be done by establishing that all plausible known vectors of enabling severe harm are sufficiently addressed by one or more of the following claims:

Robustness: […]” (p. 16)

“Covered systems that reach High capability must have safeguards that sufficiently minimize the associated risk of severe harm before they are deployed. Systems that reach Critical capability also require safeguards that sufficiently minimize associated risks during development.” (p. 11)

“SAG can request further evaluation of the effectiveness of the safeguards to evaluate if the associated risk of severe harm is sufficiently minimized” (p. 11)

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 10%

Whilst “we should ensure that harmful behaviors are detected by monitors with a high recall rate” (p. 19) demonstrates some awareness of quantitative deployment KCI thresholds, there are no actual quantitative deployment KCI thresholds.

Quotes:

“Red-teamers can train an arbitrarily adversarial agent or can pretend to be an agent in our setup and we should ensure that harmful behaviors are detected by monitors with a high recall rate” (p. 19)

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 50%

The framework provides desiderata which can be seen as proto assurance KCIs, though they are somewhat vague (e.g. “the model consistently understands […] instructions” or “sufficiently addressed by one or more of the following claims” lack detail.) It is also unclear if they must all be covered in the Safeguards Report, and how often Safeguards Reports are required. The KCIs also are not linked to specific KRIs explicitly, though the document says they will be.

Table 5 list some desiderata indicating further proto assurance process KCI thresholds. However, these require more development, such as what constitutes “significant evidence” or “consistently takes action”.

Quotes:

“Safeguards should sufficiently minimize the risk associated with a misaligned model circumventing human control and oversight and executing severe harms. This can be done by establishing that all plausible known vectors of enabling severe harm are sufficiently addressed by one or more of the following claims:

Lack of Autonomous Capability: The model is not capable of carrying out tasks autonomously, including the risk of severe harm.
Value Alignment: The model consistently applies human values in novel settings (without any instructions) to avoid taking actions that cause harm, and has shown sufficiently minimal indications of misaligned behaviors like deception or scheming.
Instruction Alignment: The model consistently understands and follows user or system instructions, even when vague, and those instructions rule out pathways to causing severe harm.
Reliable and Robust System Oversight: Effective AI and human oversight of model actions detects and prevents execution of harm or subversion of safeguards.
System Architecture: The model can’t take actions that cause harm because it lacks access to output channels or mechanisms to persist sufficiently to execute the harm, due to containerization and restricted permissions.” (p. 18) See also Table 3, p. 11.

From Table 5, p. 19:
“Internal and external user reports, monitoring, or manual inspection for historical model deployments show no significant evidence of unexpected value alignment failures.”

“Large scale usage or testing in realistic settings should show the model consistently takes action aligned to the user’s instructions or goals (while maintaining value alignment and avoiding harm). Examples against would include trying to escape a container, lying to the user to appease them, etc. […] “Canary evaluations” which test model capabilities to bypass less complex, easier-to exploit versions of our security controls, establishing that our implemented controls are robust”

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 25%

The process of presenting a Capabilities Report and Safeguards Report is a proto pairing of KRI and KCI thresholds, as well as the risk-specific safeguard guidelines for each tracked category capability threshold. Hence, the framework shows awareness of this concept and partial implementation. However, it does not provide explicit detail, and the linkage is only a ‘guideline’. Further, there is reference to the Safety Advisory Group making decisions about the level of risk of models based on these reports, but an improvement would be providing detail on the criteria SAG will be using to make its determinations.

Overall, more detail should be given on why, ex ante, the KCI thresholds chosen will be sufficient to keep residual risk below the risk tolerance, if satisfied. In addition, their marginal risk claim makes the residual risk tolerance contingent on other companies’. This does not follow the criterion; the required level of safeguards should be relative to their pre-determined risk tolerance.

Quotes:

“[We] evaluate the likelihood that severe harms could actually occur in the context of deployment, using threat models that take our safeguards into account.” (p. 3)

“We compile the information on the planned safeguards needed to minimize the risk of severe harm into a Safeguards Report. The Safeguards Report should include the following information:

Identified ways a risk of severe harm can be realized for the given deployment, each mapped to the associated security controls and safeguards
Details about the efficacy of those safeguards
An assessment on the residual risk of severe harm based on the deployment
Any notable limitations with the information provided” (p. 10)

The level of capability in the Tracked Category based on the Capabilities Report.
The associated risks of severe harm, as described in the threat model and where needed, advice of internal or external experts
The safeguards in place and their effectiveness based on the Safeguards Report.
The baseline risk from other deployments, based on a review of any non-OpenAI deployments of models which have crossed the capability thresholds and any public evidence of the safeguards applied for those models.” (pp. 10-11)

“We recognize that another frontier AI model developer might develop or release a system with High or Critical capability in one of this Framework’s Tracked Categories and may do so without instituting comparable safeguards to the ones we have committed to. Such an action could significantly increase the baseline risk of severe harm being realized in the world, and limit the degree to which we can reduce risk using our safeguards. If we are able to rigorously confirm that such a scenario has occurred, then we could adjust accordingly the level of safeguards that we require in that capability area, but only if:

we assess that doing so does not meaningfully increase the overall risk of severe harm,
we publicly acknowledge that we are making the adjustment,
and, in order to avoid a race to the bottom on safety, we keep our safeguards at a level more protective than the other AI developer, and share information to validate this claim” (p. 12)

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 25%

There is a clear statement that if the Critical safeguards are not specified, then development will be halted. However, this only requires specification of the Critical safeguards, not actual proof that the safeguards are sufficient.

Further, halting is only triggered if models pass the Critical capability threshold, meaning the Critical risk level is manifest; this permits the existence of a model with Critical level capabilities when sufficient safeguards or security controls do not exist. This is an issue, as models may be critically dangerous during development, or before the critical capability is detected. In other words, a credible plan or process for pausing before critical capabilities manifest should be developed. Further, detail should be added for when deployment is halted, and the process for doing so.

Quotes:

For each of the critical thresholds of the tracked categories, pp. 5-6:
“Until we have specified safeguards and security controls that would meet a Critical standard, halt further development”

“SAG can find the safeguards do not sufficiently minimize the risk of severe harm and recommend potential alternative deployment conditions or additional or more effective safeguards that would sufficiently minimize the risk.” (p. 11)

“Models that have reached or are forecasted to reach Critical capability in a Tracked Category present severe dangers and should be treated with extreme caution. Such models require additional safeguards (safety and security controls) during development, regardless of whether or when they are externally deployed. We do not currently possess any models that have Critical levels of capability, and we expect to further update this Preparedness Framework before reaching such a level with any model. Our approach to Critical capabilities will need to be robust to both malicious actors (either internal or external) and model misalignment risks. The SAG retains discretion over when to request deep dive evaluations of models whose scalable evaluations indicate that they may possess or may be nearing critical capability thresholds.” (p. 12)

OpenAI

3. Risk Treatment

Weak 38%

3.1 Implementing Mitigation Measures (50%) 37%

3.1.1 Containment measures (35%) 40%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 50%

The framework describes the security controls required for High capability models in C.3 in detail, though not for Critical capability models. However, many measures remain high level desiderata rather than operational measures. However, there are some which are more specific, quoted below.

Quotes:

From appendix C.3, pp. 20-21:
“Adopt a layered security strategy, ensuring robust protection through multiple defensive barriers, including physical and datacenter security, network segmentation and controls, workload isolation, data encryption, and other overlapping and complementary security controls.”
“Employees must authenticate using multi-factor authentication (MFA) and managed devices meeting security baselines. Access must be logged and reviewed for detection and investigative purposes.”
“Integrate automated code analysis, formal security reviews, and penetration testing in engineering processes. Apply security reviews and validation to higher-sensitivity critical components prior to deployment.”

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

The framework mentions that “internal and external assessments to validate [the security] controls are conducted regularly”, but without further detail for why the measures given are likely to be sufficient to meet the containment KCI threshold. Further, their ‘sufficiency’ is ultimately determined by SAG (“reports are provided to OpenAI leadership”), rather than decided prior to the KRI threshold being passed.

Further, without quantified risk tolerance or probability estimates, there’s no foundation for proving containment adequacy.

Quotes:

“Continuous Monitoring and Validation: Ensure security threat models and updates inform where security and data privacy controls should be implemented, improved, and monitored to further reduce risk. Internal and external assessments to validate these controls are conducted regularly and reports are provided to OpenAI leadership.” (p. 20)

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 25%

Whilst the framework requires independent audits for High capability models, they do not describe a similar process for Critical capability models. Further, the process appears to validate already implemented security protocols, whilst this criterion requires validation of the case for why containment measures are sufficient to meet containment KCI thresholds in advance of implementation.

Quotes:

In C.3 Security Controls: “Independent Security Audits: Ensure security controls and practices are validated regularly by third-party auditors to ensure compliance with relevant standards and robustness against identified threats.” (p. 21)

3.1.2 Deployment measures (35%) 40%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 50%

Section C.1 in the Appendix details “potential safeguards” for models with High capability, without explicit commitment to implementing them: “the safeguards should not be construed as a definitive or comprehensive list of the safeguards we will or could apply to a given launch”. Nonetheless, the measures are defined for each KCI threshold, namely robustness, usage monitoring and trust-based access.

To improve, deployment measures must also be defined for the Critical capability.

Quotes:

From Table 4, p. 17:
“Robustness:

Training the model to refuse to help with high-risk tasks or to otherwise produce low risk
responses
Unlearning or training-data filtering to erase specific risk-enabling knowledge from the model’s knowledge-base
Interpretability-based approaches, like activation steering, that directly edit models’ thinking at inference time
Jailbreak robustness, including through adversarial training, inference-time deliberation, and more”

More quotes may be found in Table 4.

“This Appendix provides illustrative examples of potential safeguards, and safeguard efficacy assessments that could be used to establish that we have sufficiently mitigated the risk of severe harm. The examples aim to provide insight on our thinking, but many of the techniques require further research. The safeguards should not be construed as a definitive or comprehensive list of the safeguards we will or could apply to a given launch.

As a reminder, covered systems that reach High capability must have safeguards that sufficiently minimize the associated risk of severe harm before they are deployed. Systems that reach Critical capability also require sufficient safeguards during development.” (p. 16)

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

Section C.1 in the Appendix details “potential safeguard efficacy assessments”, without explicit commitment to implementing them. However, they don’t provide actual proof or evidence that the deployment measures are sufficient ex ante. Instead, it relies on the Safety Advisory Group’s judgment at the time when High or Critical deployment standards need to be implemented, making the decision vulnerable to discretion.

Quotes:

From Table 4, p. 17:
“Robustness:

Automated and expert redteaming (identifying success per resources)
Prevalence of jailbreaks identified via monitoring and reports, in historical deployments
Results from public jailbreak bounties and results from private and public jailbreak benchmarks”

More quotes may be found in Table 4.

“The examples aim to provide insight on our thinking but should not be construed as a definitive checklist of the safeguards we will apply to a given launch.” (p. 10)

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 25%

While they mention third-party stress testing of safeguards, this is not specific to deployment measures, and appears optional.

Quotes:

“Third-party stress testing of safeguards: If we deem that a deployment warrants third party stress testing of safeguards and if high quality third-party testing is available, we will work with third parties to evaluate safeguards. We may seek this out in particular for models that are over a High capability threshold.” (p. 13)

“Independent expert opinions for evidence produced to SAG: The SAG may opt to get independent expert opinion on the evidence being produced to SAG. The purpose of this input is to add independent analysis from individuals or organizations with deep expertise in domains of relevant risks (e.g., biological risk). If provided, these opinions will form part of the analysis presented to SAG in making its decision on the safety of a deployment. These domain experts may not necessarily be AI experts and their input will form one part of the holistic evidence that SAG reviews.” (p. 13)

3.1.3 Assurance processes (30%) 30%

3.1.3.1 Credible plans towards the development of assurance properties (40%) 25%

The framework mentions a commitment to developing assurance processes for Critical capabilities. However, they do not provide further detail on how they will achieve this, or by what point this will need to be intact (i.e. it is unclear if assurance processes must be solidifed before or after a model has been deemed to possess Critical level capabilities). Further, their commitment is only to “update” the framework, meaning their plans or the KCIs/KRIs necessitating Critical level assurance themselves may be modified.

Quotes:

3.1.3.2 Evidence that the assurance properties are enough to achieve their corresponding KCI thresholds (40%) 50%

Whilst they detail safeguard efficacy assessments which is commendable, these are only “potential”. In addition, their descriptions remain more high-level than precise, and do not cover the Critical case. To improve, they could describe what evidence would be sufficient for showing that the safeguards for misalignment meet the required KCI threshold.

Quotes:

“The table below illustrates possible examples of safeguards and safeguard efficacy assessments we could consider to claim we have sufficiently minimized the risks of severe harm associated with High level of capability under any of our Tracked Categories” (p. 19). The table (Table 5) goes on to define “potential safeguard efficacy assessments” for each category. For instance,
“Reliable and Robust System Oversight:

Red-teamers can train an arbitrarily adversarial agent or can pretend to be an agent in our setup and we should ensure that harmful behaviors are detected by monitors with a high recall rate
Monitor escalations are either outright blocked or, in empirical studies, users with the relevant level of expertise were shown to reliably understand and act on monitor flags before significant harm occurred
SLAs for asynchronous monitors to analyze and detect harmful agent actions posthoc”

More examples can be found in Table 5.

3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 10%

The Framework does not explicitly outline the assumptions for assurance processes’ efficacy to hold. However, they demonstrate some implicit awareness that “deception, colluding with oversight models, sabotaging safeguards over time such as by embedding vulnerabilities in safeguards code” could all undermine these KCI measures. To improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.

Quotes:

“Research Category: Undermining Safeguards: ability and propensity for the model to act to undermine safeguards placed on it, including e.g., deception, colluding with oversight models, sabotaging safeguards over time such as by embedding vulnerabilities in safeguards code, etc.
Potential response: If a model has High or Critical capabilities in any of the Tracked Categories, require the Safeguards case to be robust to the discovered capability and/or propensity” (p. 7, under ‘Potential response’ to Research Category “Undermining Safeguards” in Table 2.)

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 39%

3.2.1 Monitoring of KRIs (40%) 36%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 90%

The framework outlines multiple elicitation strategies and commits to fulfill this criterion almost word for word. The elicitation methods detailed show nuance and expertise. To improve, the framework could include measurable information, such as how much compute is used for fine-tuning. More detail could be added on which elicitation methods they anticipate would be used by different threat actors, under realistic settings, to justify their elicitation method.

Quotes:

“Our evaluations are intended to approximate the full capability that the adversary contemplated by our threat model could extract from the deployment candidate model, including by using the highest capability tier of system settings, using a version of the model that has a negligible rate of safety-based refusals on our Tracked Category capability evaluations (which may require a separate model variant), and with the best presently-available scaffolds. These measures are taken to approximate the high end of expected elicitation by threat actors attempting to misuse the model, and should be tailored depending on the level of expected access (e.g., doing finetuning if the weights will be released). Nonetheless, given the continuous progress in model scaffolding and elicitation techniques, we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities that may emerge in real world use and misuse. We incorporate this uncertainty into our assessments. We monitor the technical landscape for changes to the elicitation techniques and best practices, and reassess our evaluations as needed.” (p. 8)

3.2.1.2 Evaluation frequency (25%) 0%

There is no mention of evaluation frequency in terms of the relative variation of effective computing power.

Quotes:

No relevant quotes found.

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 25%

There is some recognition of how post-training enhancements can factor into capability assessments, but this description remains high level.

The commitment to “monitor the technical landscape for changes to the elicitation techniques and best practices, and reassess our evaluations as needed” is vague; it is not clear how evaluations are “reassessed” based on changes in best practices. For an improvement, an explicit commitment to adopt best practices should be given, or otherwise forecasting exercises could be completed to justify their assumptions on the rate of progress in post-training enhancements. However, “we incorporate this uncertainty into our assessments”, whilst vague, shows partial implementation of factoring the uncertainty of the progress of post-training enhancements in the future.

Importantly, more detail could be provided on precisely how post-training enhancements are factored into capability assessments – for instance, the size of the “uncertainty” or the safety buffer they give to account for uncertainty concerning the progress of post-training enhancements.

Further, more detail could be added on how they account(ed) for how post-training enhancements’ risk profiles change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.

Quotes:

3.2.1.4 Vetting of protocols by third parties (15%) 10%

The framework demonstrates discretionary commitment to third-party vetting of evaluation protocols. They do not have a specific structure in place for regularly vetting capabilities assessments by third parties, but they do indicate that they measure the Research Categories capabilities in collaboration with external experts. They also mention a general commitment to soliciting expert opinion on the overall holistic risk assessment process.

Quotes:

“We call these Research Categories, and in these areas we will take the following
steps, both internally and in collaboration with external experts:

Further developing the threat models for the area,
Advancing the science of capability measurement in the area and investing towards the development of rigorous evaluations (which could be achieved internally or via partnerships), and
Sharing summaries of our findings with the public where feasible.” (pp. 5-6)

Deeper capability assessments: “Deep Dives: designed to provide additional evidence validating the scalable evaluations’ findings on whether a capability threshold has been crossed. These may include a wide range of evidence gathering activities, such as human expert red-teaming, expert consultations, resource-intensive third party evaluations (e.g., bio wet lab studies, assessments by independent third party evaluators), and any other activity requested by SAG.” (p. 8)

3.2.1.5 Replication of evaluations by third parties (15%) 25%

The framework gives some recognition of evaluations being conducted independently by third-parties, but only if deemed necessary. Further, they only commit to “work[ing] with” these parties. They do not explicitly commit in the document to have any evaluations replicated, unless it is part of a deeper capability assessment (‘Deep Dive’) that is asked for by the Safety Advisory Group (i.e., at the OpenAI leadership’s discretion.)

Quotes:

“Third-party evaluation of tracked model capabilities: If we deem that a deployment warrants deeper testing of Tracked Categories of capability (as described in Section 3.1), for example based on results of Capabilities Report presented to them, then when available and feasible, OpenAI will work with third-parties to independently evaluate models.” (p. 13)

3.2.2 Monitoring of KCIs (40%) 43%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 50%

There is some reference to monitoring systems, but without explicit justification that KCI thresholds will not be crossed unnoticed, nor a detailed description of the methodology of monitoring or assurance that monitoring is ongoing.

There is implicit justification that KCI thresholds are sufficient in the safeguards for misaligned models. For instance, a “monitor AI supervises agent actions to detect agent in the case of trying to take harmful actions” and “agent actions are all logged in an uneditable database, and asynchronous monitoring routines review those actions for evidence of harm”. This shows that monitoring is being conducted in order to satisfy the assurance process KCI. However, this monitoring is not specifically for measuring mitigation effectiveness; rather, it is to “review [agent] actions for evidence of harm”, which may not catch KCI thresholds being crossed unnoticed.

Indeed, there are multiple “potential safeguard efficacy assessments” detailed in the framework – e.g., “Redteamer success creating a certain number of accounts without detection (success per resources)”. However, none of these efficacy assessments are ongoing; instead, it appears they only need to be conducted for the drafting of the Safeguards Report. Further, they are only “potential” assessments.

To improve, the framework should describe systematic, ongoing monitoring to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will still be met, when required.

Quotes:

“Reliable and Robust System Oversight:

Monitor AI supervises agent actions to detect agent in the case of trying to take harmful actions, scheming, or similar
System forces human approval of agent actions (either by default, or by escalation of the monitor)
Model’s reasoning is provided faithfully and interpretably to humans to review
Agent actions are all logged in an uneditable database, and asynchronous monitoring routines review those actions for evidence of harm” (p. 19)

See Tables 4 and 5, pages 18 and 19, “Potential safeguard efficacy assessments”.

3.2.2.2 Vetting of protocols by third parties (30%) 50%

The framework demonstrates discretionary commitment to third-party vetting of KCI protocols. They do not have a specific structure in place for regularly vetting safeguards assessments by third parties, beyond the quote below. They do not explicitly commit therefore to undergo vetting of KCI protocols by third parties, except for containment KCIs.

Quotes:

“SAG is responsible for assessing whether the safeguards associated with a given deployment sufficiently minimize the risk of severe harm associated with the proposed deployment. The SAG will make this determination based on: […] The associated risks of severe harm, as described in the threat model and where needed, advice of internal or external experts.” (p. 10)

“Independent Security Audits: Ensure security controls and practices are validated regularly by third-party auditors to ensure compliance with relevant standards and robustness against identified threats.” (p. 21)

“Monitoring and Incident Response: Monitor security and event logs continuously to detect, triage, and respond to security incidents rapidly by 24×7 on-call staff.” (p. 21)

3.2.2.3 Replication of evaluations by third parties (30%) 25%

Quotes:

3.2.3 Transparency of evaluation results (10%) 64%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 75%

There are commitments to share evaluation results to the public if models are deployed. However, they do not commit to alert any stakeholders when/if Critical capabilities are reached.

Quotes:

“Public disclosures: We will release information about our Preparedness Framework results in order to facilitate public awareness of the state of frontier AI capabilities for major deployments. This published information will include the scope of testing performed, capability evaluations for each Tracked Category, our reasoning for the deployment decision, and any other context about a model’s development or capabilities that was decisive in the decision to deploy. Additionally, if the model is beyond a High threshold, we will include information about safeguards we have implemented to sufficiently minimize the associated risks. Such disclosures about results and safeguards may be redacted or summarized where necessary, such as to protect intellectual property or safety.” (p. 12)

“Transparency in Security Practices: Ensure security findings, remediation efforts, and key metrics from internal and independent audits are periodically shared with internal stakeholders and summarized publicly to demonstrate ongoing commitment and accountability.” (p. 21)

“Internal Transparency. We will document relevant reports made to the SAG and of SAG’s decision and reasoning. Employees may also request and receive a summary of the testing results and SAG recommendation on capability levels and safeguards (subject to certain limits for highly sensitive information).” (p. 12)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 10%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 10%

There is some indication of monitoring; however, this is not explicitly to gain information on novel risk profiles. To improve, such a process should be detailed, for instance by building on the current monitoring infrastructure.

They do mention that monitoring should be conducted to assert there is “no significant evidence of unexpected value alignment failures”, as a safeguard efficacy assessment. Partial credit is given here for the use of “unexpected”, as this could be further developed to analyse novel risk profiles.

Quotes:

“Internal and external user reports, monitoring, or manual inspection for historical model deployments show no significant evidence of unexpected value alignment failures” (p. 19)

“Prevalence of jailbreaks identified via monitoring and reports, in historical deployments” (p. 17)

“Expanding human monitoring and investigation capacity to track capabilities that pose a risk of severe harm, and developing data infrastructure and review tools to enable human investigations” (p. 17)

“Agent actions are all logged in an uneditable database, and asynchronous monitoring routines review those actions for evidence of harm” (p. 19)

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 10%

There is a commitment to developing threat models for some of the Research Categories. However, this is not explicitly linked to incorporating novel risks, which were unexpected or not previously anticipated. To improve, an encounter with a possibly novel risk profile of a model should trigger risk modelling exercises, to analyse how this finding may impact all other risk models.

They do mention that if a capability “presents a real risk of severe harm, we may decide to monitor it as a Tracked Category or a Research Category”. Whilst this remains general, partial credit is given here for having some reference to incorporating additional risks – noting that “a capability” could refer to any capability.

Quotes:

“Where we determine that a capability presents a real risk of severe harm, we may decide to monitor it as a Tracked Category or a Research Category.” (p. 4)

OpenAI

4. Risk Governance

Weak 39%

4.1 Decision-making (25%) 34%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 10%

The framework states that the CEO or a designated person is the decision-maker, but it is unclear if this is on a risk-by-risk basis and it is unclear how often the risk ownership is delegated to someone other than the CEO.

Quotes:

“OpenAI Leadership, i.e., the CEO or a person designated by them, is responsible for: Making all final decisions, including accepting any residual risks and making deployment go/no-go decisions, informed by SAG’s recommendations. Resourcing the implementation of the Preparedness Framework (e.g., additional work on safeguards where necessary).” (p. 15)

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No mention of a management risk committee.

Quotes:

No relevant quotes found.

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 75%

The company outlines clear protocols for their decision-making, including who makes the decisions and on what basis. It specifies its use of residual risk (net of safeguards). It could improve further by being more clear on when decisions are made and if and when they are revisited.

Quotes:

“SAG then has the following decision points: 1. SAG can find that it is confident that the safeguards sufficiently minimize the associated risk of severe harm for the proposed deployment, and recommend deployment. 2. SAG can request further evaluation… 3. SAG can find the safeguards do not sufficiently minimize the risk…The SAG will strive to recommend further actions that are as targeted and non-disruptive as possible while still mitigating risks of severe harm. All of SAG’s recommendations will go to OpenAI Leadership for final decision-making in accordance with the decision-making practices outlined in Appendix B.” (p. 11)

“OpenAI Leadership, i.e., the CEO or a person designated by them, is responsible for: Making all final decisions, including accepting any residual risks and making deployment go/no-go decisions, informed by SAG’s recommendations.
Resourcing the implementation of the Preparedness Framework (e.g., additional work on safeguards where necessary).” (p. 15)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 50%

The framework has some details on what is to happen in the case of rapid risk level change, but does not provide a lot of detail.

Quotes:

“Fast-track. In the rare case that a risk of severe harm rapidly develops (e.g., there is a change in our understanding of model safety that requires urgent response), we can request a fast track for the SAG to process the report urgently. The SAG Chair should also coordinate with OpenAI Leadership for immediate reaction as needed to address the risk.” (p. 15)

4.2. Advisory and Challenge (20%) 48%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 0%

No mention of an executive risk officer.

Quotes:

No relevant quotes found.

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 90%

The Safety Advisory Group (SAG) plays this role and its role is described in detail.

Quotes:

“The Safety Advisory Group (SAG) is responsible for: Overseeing the effective design, implementation, and adherence to the Preparedness Framework in partnership with the safety organization leader. For each deployment in scope under the Preparedness Framework, reviewing relevant reports and all other relevant materials and assessing of the level of Tracked Category capabilities and any post-safeguards residual risks. For each deployment under the Preparedness Framework, providing recommendations on potential next steps and any applicable risks to OpenAI Leadership, as well as rationale. Making other recommendations to OpenAI Leadership on longer-term changes or investments that are forecasted to be necessary for upcoming models to continue to keep residual risks at acceptable levels.” (p. 15)

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 75%

The framework outlines a fairly detailed system for tracking and monitoring risks, at least in terms of capability evaluations. To improve, further detail could be provided on other risk indicators and how risk information is aggregated and processed for a holistic view.

Quotes:

“We invest deeply in developing or adopting new science-backed evaluations that provide high precision and high recall indications of whether a covered system has reached a capability threshold in one of our Tracked Categories.” (p. 8)

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 50%

The Safety Advisory Group (SAG) partly plays this role. However, it is unclear how much challenge it offers to management. The framework specifies explicitly that “OpenAI Leadership can also make decisions without the SAG’s participation”.

Quotes:

“The Safety Advisory Group (SAG), including the SAG Chair, provides a diversity of perspectives to evaluate the strength of evidence related to catastrophic risk and recommend appropriate actions.” (p. 15)

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 75%

The framework clearly outlines risk information to be gathered and shared with management. To improve further, the company should specify more details on these reports and how they describe the risk levels.

Quotes:

“The results of these evaluations… are compiled into a Capabilities Report that is submitted to the SAG.” (p. 9)
“We compile the information on the planned safeguards needed to minimize the risk of severe harm into a Safeguards Report.”(p. 10)

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 38%

4.3.1 The company has an internal audit function involved in AI governance (50%) 0%

No mention of an internal audit function.

Quotes:

No relevant quotes found.

4.3.2 The company involves external auditors (50%) 75%

The framework includes several mentions of third-party auditors for security and controls. For improved scores, these could be applied more broadly.

Quotes:

“Independent Security Audits: Ensure security controls and practices are validated regularly by third-party auditors”. (p. 21)
“Third-party stress testing of safeguards: If we deem that a deployment warrants third party stress testing of safeguards and if high quality third-party testing is available, we will work with third parties to evaluate safeguards.” (p. 13)

4.4 Oversight (20%) 45%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 90%

The framework company specifies that there is a dedicated committee of the Board for safety and security.

Quotes:

“The Safety and Security Committee (SSC) of the OpenAI Board of Directors will be given visibility into processes, and can review decisions and otherwise require reports and information from OpenAI Leadership as necessary to fulfill the Board’s oversight role. Where necessary, the Board may reverse a decision and/or mandate a revised course of action.” (p. 15)

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 0%

No mention of any additional governance bodies.

Quotes:

No relevant quotes found.

4.5 Culture (10%) 15%

4.5.1 The company has a strong tone from the top (33.3%) 25%

The framework includes a commitment to safety. However, it does not go into detail on the risks that are present and how they need to be balanced with benefits and AI capabilities.

Quotes:

“OpenAI’s mission is to ensure that AGI (artificial general intelligence) benefits all of humanity. To pursue that mission, we are committed to safely developing and deploying highly capable AI systems”. (p. 1)

4.5.2 The company has a strong risk culture (33.3%) 10%

The framework mentions some possibility for employees to receive summary information regarding risks. However, this seems somewhat limited and should be made more comprehensive. The framework, in its change log, also states that the company is moving away from safety drills, which does not seem aligned to best practice.

Quotes:

“Deprioritize safety drills, as we are shifting our attention to a more durable approach of continuously red-teaming and assessing the effectiveness of our safeguards.” (p. 14)

4.5.3 The company has a strong speak-up culture (33.3%) 10%

The framework includes a “Raising Concerns Policy”. However, to improve the score, it would need to include guarantees of anonymity and the lack of retaliation.

Quotes:

“Noncompliance. Any employee can raise concerns about potential violations of this policy, or about its implementation, via our Raising Concerns Policy. We will track and appropriately investigate any reported or otherwise identified potential instances of noncompliance with this policy, and where reports are substantiated, will take appropriate and proportional corrective action.” (p. 12)

4.6 Transparency (5%) 53%

4.6.1 The company reports externally on what their risks are (33.3%) 75%

The framework states the risks in scope and includes commitments to public transparency regarding the risks and their mitigation. Further information could be provided on the process of selecting these specific risks and what other risks have been considered.

Quotes:

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 75%

The framework clearly states the governance mechanisms, in a section on “internal governance” under “building trust”.

Quotes:

“An internal, cross-functional group of OpenAI leaders called the Safety Advisory Group (SAG) oversees the Preparedness Framework and makes expert recommendations on the level and type of safeguards required for deploying frontier capabilities safely and securely. OpenAI Leadership can approve or reject these recommendations, and our Board’s Safety and Security Committee provides oversight of these decisions.” (p. 3)

4.6.3 The company shares information with industry peers and government bodies (33.3%) 10%

The framework mentions working with e.g. the Frontier Model Forum and the government, but only as inputs. In order to gain a higher score, they company would need to specify what information would be shared with them.

Quotes:

“Heighten safeguards (and consider further actions) in consultation with appropriate US government actors, accounting for the complexity of classified information handling.” (p. 7)
“This process draws on our own internal research and signals, and where appropriate incorporates feedback from academic researchers, independent domain experts, industry bodies such as the Frontier Model Forum, and the U.S. government and its partners, as well as relevant legal and policy mandates.” (p. 4)

OpenAI

Best in class

Overview

Marginal risk clause makes deployment decisions contingent on other companies' risk tolerance. Risk tolerance could be made more precise. Vague threshold for security measures. Unclear how frequently evaluations are run during development and after deployment. Poorer risk culture.

If they hadn't made some changes to their framework, they would've attained a higher score.

1. Risk Identification

1.1 Classification of Applicable Known Risks (40%) 63%

1.1.1 Risks from literature and taxonomies are well covered (50%) 75%

Quotes:

1.1.2 Exclusions are clearly justified and documented (50%) 50%

Quotes:

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

Quotes:

1.2.2 Third party open-ended red teaming (30%) 0%

Quotes:

1.3 Risk modeling (40%) 18%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%

Quotes:

1.3.2 Risk modeling methodology (40%) 9%

1.3.2.1 Methodology precisely defined (70%) 10%

Quotes:

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

Quotes:

1.3.2.3 Prioritization of severe and probable risks (15%) 10%

Quotes:

1.3.3 Third party validation of risk models (20%) 25%

Quotes:

2. Risk Analysis and Evaluation

2.1 Setting a Risk Tolerance (35%) 16%

2.1.1 Risk tolerance is defined (80%) 20%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 50%

Quotes:

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 10%

Quotes:

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

Quotes:

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

Quotes:

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

Quotes:

2.2 Operationalizing Risk Tolerance (65%) 29%

2.2.1 Key Risk Indicators (KRI) (30%) 33%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

Quotes:

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%

Quotes:

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%

Quotes:

2.2.2 Key Control Indicators (KCI) (30%) 32%

2.2.2.1 Containment KCIs (35%) 5%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 10%

Quotes:

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%

Quotes:

2.2.2.2 Deployment KCIs (35%) 43%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 75%

Quotes:

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 10%

Quotes:

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 50%

Quotes:

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 25%

Quotes:

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 25%

Quotes:

3. Risk Treatment

3.1 Implementing Mitigation Measures (50%) 37%

3.1.1 Containment measures (35%) 40%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 50%

Quotes:

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

Quotes:

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 25%

Quotes:

3.1.2 Deployment measures (35%) 40%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 50%

Quotes:

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

Marginal risk clause makes deployment decisions contingent on other companies' risk tolerance.

Risk tolerance could be made more precise.

Vague threshold for security measures.

Unclear how frequently evaluations are run during development and after deployment.

Poorer risk culture.

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 25%