DeepMind

Weak 1.0/5
very weak
weak
moderate
substantial
strong
Risk Identification
Learn more
Risk Identification
22%
Risk Analysis and Evaluation
Learn more
Risk Analysis and Evaluation
18%
Risk Treatment
Learn more
Risk Treatment
27%
Risk Governance
Learn more
Risk Governance
12%
Up to date as of October 2025
Best in class

SEE FRAMEWORK

  • GoogleDeepMind stands out for their robust descriptions of safety cases, including many different factors which feed into their risk determinations.
Overview
Highlights relative to others
  • Uniquely includes risk category of instrumental reasoning. Recognition that automated monitoring may not be sufficient for mitigation.
  • References use of a safety case, displaying stronger recognition of the need to prove safety of models.
  • Covers harmful manipulation.
Weaknesses relative to others
  1. Lacking in some key governance measures, such as risk owners, internal audit, and an executive responsible for risk management process.
  2. Escalation procedures are not specified.
  3. No details of a strong speak-up culture or external information sharing.
  4. Risk tolerance and KRIs are particularly vague.
Changes

Compared to their Frontier Safety Framework Version 1.0, they:

  1. Added an explicit dependency on these practices being adopted by the industry as a whole for mitigations to be committed to. This is similar to a marginal risk clause, which detracts from the spirit of risk management.
  2. Removed targeted date for implementation. This detracts from having credible plans to develop assurance processes (and other mitigations).
  3. Removed the set cadence for evaluations, now instead opting for “regular” evaluations.

1.1 Classification of Applicable Known Risks (40%) 43%

1.1.1 Risks from literature and taxonomies are well covered (50%) 75%

Risk domains covered include CBRN, Cyber, Machine Learning R&D, harmful manipulation, and instrumental reasoning. More justification could be given for why they focus on instrumental reasoning as the main metric of loss of control risks as opposed to other metrics of loss of control, though it is commendable they are breaking down loss of control risks into more measurable risk areas for their models.

There is a reference to “early research” informing which domains of risk they focus on. There is no further justification for why they chose to select these domains; to improve, they could include documents which informed their risk identification process. However, they do note that their Framework overall is informed by other frameworks, which they link, showing awareness of the importance of linking wider literature.

1.1.2 is below 50% and persuasion is excluded.

Quotes:

“The Framework is informed by the broader conversation on Frontier AI Safety Frameworks.” (p. 2) followed by Footnote 1: “See https://www.gov.uk/government/publications/emerging-processes-for-frontier-ai-safety, https://metr.org/faisc, https://www.anthropic.com/news/anthropics-responsible-scaling-policy, https://openai.com/index/updating-our-preparedness-framework/, https://www.frontiermodelforum.org/publications/#technical-reports

“The Framework addresses misuse risk, risks from machine learning research and development (ML R&D), and misalignment risk.” (p. 2)

“We describe three sets of CCLs: misuse CCLs, machine learning R&D CCLs, and misalignment CCLs. For misuse risk, we define CCLs in the following risk domains where the misuse of model capabilities may result in severe harm:

  • CBRN: Risks of models assisting in the development, preparation, and/or execution of a chemical, biological, radiological, or nuclear (“CBRN”) threat.
  • Cyber: Risks of models assisting in the development, preparation, and/or execution of a cyber attack.
  • Harmful Manipulation: Risks of models with high manipulative capabilities potentially being misused in ways that could reasonably result in large scale harm.

For machine learning R&D risk, we define CCLs that identify when ML R&D capabilities in our models may, if not properly managed, reduce society’s overall ability to manage AI risks. Such capabilities may serve as a substantial cross-cutting risk factor for other pathways to severe harm. For misalignment risk, we outline an exploratory approach that focuses on detecting when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control, assuming no additional mitigations were applied. Most CCLs define one important component of our risk acceptance criteria. Because the CCLs for misalignment risk are exploratory and intended for illustration only, we do not associate them with explicit risk acceptance criteria.” (p. 4)

“As part of our broader research into frontier AI models, we continue to assess whether there are other risk domains where severe risks may arise and will update our approach as appropriate.” (p. 5)

“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:

  • Update our risk domains and CCLs, where necessary.
  • Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding. The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16)

1.1.2 Exclusions are clearly justified and documented (50%) 10%

They justify in a footnote that misalignment is excluded from the typical risk identification (i.e. risk modelling) procedure due to its “exploratory nature”. However, more justification here could be given for why they believe this. To improve, justification could refer to at least one of: academic literature/scientific consensus; internal threat modelling with transparency; third-party validation, with named expert groups and reasons for their validation.

There is no justification for why other risks, such as other forms of loss of control risks like autonomy or autonomous self-replication, have not been considered.

Quotes:

Footnote 4: “We exclude misalignment risk from this list of domains because of its exploratory nature.” (p. 5)

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.2.2 Third party open-ended red teaming (30%) 0%

The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.3 Risk modeling (40%) 13%

1.3.1 The company uses risk models for all the risk domains identified
and the risk models are published (with potentially dangerous
information redacted) (40%) 25%

There is a commitment to engage in risk modelling (i.e. Critical Capability Levels “are determined by” [threat modeling]), and evidence of partial implementation, like an explicit commitment for undertaking risk modelling for each risk domain identified.

However, any risk models completed are not published. To improve, they could reference literature in which their risk models have been published, e.g. refer to (Rodriguez et al. 2025) There should also be evidence of a sincere attempt to map out risk models as much as possible.

Quotes:

“[Critical Capability Levels] are determined by identifying and analyzing the main foreseeable paths through which a model could result in severe harm: we then define the CCLs as the minimal set of capabilities a model must possess to do so.” (p. 4)

“For each of the four identified domains, we have developed specific scenarios in which these risks could materialize.” (p. 5)

1.3.2 Risk modeling methodology (40%) 9%

1.3.2.1 Methodology precisely defined (70%) 10%

There is an indication of an awareness of risk modeling methodologies, but there are no details about implementation.

Quotes:

“[Critical Capability Levels] can be determined by identifying and analyzing the main foreseeable paths through which a model could cause severe harm: we then define the CCLs as the minimal set of capabilities a model must possess to do so.” (p. 2)

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.

Quotes:

No relevant quotes found.

1.3.2.3 Prioritization of severe and probable risks (15%) 10%

There is an explicit intent to prioritize monitoring capabilities in “high-risk domains” which “may be most likely” to cause severe harm, or “may pose heightened risk of severe harm.” However, they do not identify these capabilities from multiple risk models which they then prioritize; rather, they describe a high level preference. In other words, the list of identified scenarios, plus justification for why their chosen risk models are most severe or probable, is not detailed.

Quotes:

“The Framework is built around capability thresholds called “Critical Capability Levels (CCLs).” These are capability levels at which, absent mitigation measures, frontier AI models or systems may pose heightened risk of severe harm.” (p. 4)

“As explained above, we have identified risk domains where, based on early research, we have determined severe risks may be most likely to arise from future models” (p. 5)

“The Frontier Safety Framework focuses on possible severe risks stemming from high-impact capabilities of frontier AI models.” (p. 4)

1.3.3 Third party validation of risk models (20%) 0%

There is a brief mention that third parties may help “inform and guide our approach” and that “where appropriate, we may engage relevant and appropriate external actors […] to inform our responsible development and deployment practices”, but this is very broad and applicable to the whole of the framework. There is no explicit or implicit mention that there is third party input into risk modelling, let alone third party validation.

Quotes:

“The core components of Frontier AI Safety Frameworks are to: […] Where appropriate, involve external parties to help inform and guide our approach.” (p. 1)

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

Back to top

2.1 Setting a Risk Tolerance (35%) 4%

2.1.1 Risk tolerance is defined (80%) 3%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 10%

They indicate that they will not tolerate certain risks of “severe harm” which is not further defined. Each capability threshold functions as an implicit risk tolerance, e.g. “Cyber autonomy level 1: Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.” There are also multiple mentions of bringing risk to an “appropriate” or “acceptable” level for the risk acceptance criteria. However, these are vague and discretionary. While indeed “the science of AI risk assessment is still developing”, it would be an improvement to state what they currently would use as their risk tolerance, as they nonetheless will operate with one. They also provide no risk acceptance criteria for misalignment risk.

To improve, they must set out the maximum amount of risk the company is willing to accept, for each risk domain (though they need not differ between risk domains), ideally expressed in terms of probabilities and severity (economic damages, physical lives, etc), and separate from KRIs.

Quotes:

“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 4)

“Cyber uplift level 1: Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.” (p. 10)

“Most CCLs define one important component of our risk acceptance criteria. Because the CCLs for misalignment risk are exploratory and intended for illustration only, we do not associate them with explicit risk acceptance criteria.” (p. 4)

“A model for which the risk assessment indicates a machine learning R&D CCL has been reached will be deemed to pose an acceptable level of risk for further development or deployment, if, for example: We assess that the deployment mitigations have brought the risk of severe harm to an appropriate level proportionate to the risk, based on considerations such as whether the risk has been reduced to an acceptable level by mitigations, and information pertaining to model propensities and the severity of related events.” (pp. 6–7)

“In particular, we will deem deployment mitigations adequate if the evidence suggests that for the CCLs the model has reached, the increase in likelihood of severe harm has been reduced to an acceptable level.” (p. 9)

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

The risk tolerance, implicit or otherwise, is not expressed fully or partly quantitatively.
No indication of expressing the risk tolerance beyond “severe harm”, which is not further defined. To improve, the risk tolerance should be expressed fully quantitatively or as a combination of scenarios with probabilities.

Quotes:

“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

No indication of expressing the risk tolerance beyond “severe harm”, which is not further defined. There is no quantitative definition of severity nor probabilities given.

Quotes:

“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)

2.1.2 Process to define the tolerance (20%) 5%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 10%

No evidence of asking the public what risk levels they find acceptable. No evidence of seeking regulator input specifically on what constitutes acceptable risk levels. However, there is a process which draws on “relevant high-quality research” and “information shared through industry forums” which informs CCLs (which function as risk tolerances/unacceptable risk tiers.) Partial credit is given thus.

Quotes:

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:

  • Update our risk domains and CCLs, where necessary.
  • Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding.

The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16)

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 25%

2.2.1 Key Risk Indicators (KRI) (30%) 24%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 25%

Each risk domain has at least one KRI, which is qualitatively defined. The KRI appears to be grounded in risk modelling, but is overly vague. To improve, they could have more KRIs of higher severity (i.e. ‘Level 2’), to show preparation (akin to OpenAI’s ‘Critical’ thresholds.) They have done this for Instrumental Reasoning capabilities but not others. To improve, KRIs should map directly to evaluation tests performed.

Quotes:

pp. 10–11:
“CBRN uplift level 1: Provides low to medium resourced actors uplift in reference scenarios resulting in additional expected harm at severe scale.” Footnote 10: “Here, and in other misuse CCLs, we intend this to mean relative to a baseline without generative AI.”

“Cyber uplift level 1: Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.”

“Harmful manipulation level 1 (exploratory): Possesses manipulative capabilities sufficient to enable it to systematically and substantially change beliefs and behavior in identified high stakes contexts over the course of interactions with the model, reasonably resulting in additional expected harm at severe scale.”

“ML R&D acceleration level 1: Has been used to accelerate AI development, resulting in AI progress substantially accelerating from historical rates.”

“ML R&D automation level 1: Can fully automate the work of any team of researchers at Google focused on improving AI capabilities, with approximately comparable all-inclusive costs.”

“Instrumental Reasoning Level 1: The instrumental reasoning abilities of the model enable enough situational awareness (ability to work out and use relevant details of its deployment setting) and stealth (ability to circumvent basic oversight mechanisms) such that, absent additional mitigations, we cannot rule out the model significantly undermining human control.”

“Instrumental Reasoning Level 2: The instrumental reasoning abilities of the model enable enough situational awareness and stealth that, even when relevant model outputs (including, e.g. scratchpads) are being monitored, we cannot detect or rule out the risk of a model significantly undermining human control.”

“Updated set of risks and mitigations: There may be additional risk domains and critical capabilities that fall into scope as AI capabilities improve and the external environment changes. Future work will aim to include additional pressing risks, which may include additional risk domains or higher CCLs within existing domains.” (p. 8)

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

“Note on Machine Learning R&D CCLs: Risk assessment must take into account the fact that other actors may put significantly more effort into eliciting capabilities than we put into assessing risk, thus requiring conservatism in the form of evaluations. However, as a frontier AI company, we do not expect other groups to put significantly more effort into ML R&D than we do ourselves. As a result, to assess the ML R&D CCLs, we may use sources of information about our own progress at accelerating ML R&D to assess whether we are near or at the CCLs, in addition to evaluations of ML R&D capabilities. Similarly, our alert threshold may be defined based on these sources of information, rather than on evaluation scores.” (pp. 5–6)

“Where model capabilities remain quite distant from a CCL, a response plan may involve the adoption of additional capability assessment processes to flag when heightened mitigations are required.” (p. 6)

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 10%

The KRIs have opportunity to become quantitative, e.g. by specifying precisely what counts as a “sufficient uplift”, “high stakes contexts”, “severe scale”, “historical rates”, “comparable all-inclusivecosts”, and so on.

Quotes:

pp. 10–11, 13–14 and 15:
“CBRN uplift level 1: Provides low to medium resourced actors uplift in reference scenarios resulting in additional expected harm at severe scale.” Footnote 10: “Here, and in other misuse CCLs, we intend this to mean relative to a baseline without generative AI.”

“Cyber uplift level 1: Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.”

“Harmful manipulation level 1 (exploratory): Possesses manipulative capabilities sufficient to enable it to systematically and substantially change beliefs and behavior in identified high stakes contexts over the course of interactions with the model, reasonably resulting in additional expected harm at severe scale.”

“ML R&D acceleration level 1: Has been used to accelerate AI development, resulting in AI progress substantially accelerating from historical rates.”

“ML R&D automation level 1: Can fully automate the work of any team of researchers at Google focused on improving AI capabilities, with approximately comparable all-inclusive costs.”

“Instrumental Reasoning Level 1: The instrumental reasoning abilities of the model enable enough situational awareness (ability to work out and use relevant details of its deployment setting) and stealth (ability to circumvent basic oversight mechanisms) such that, absent additional mitigations, we cannot rule out the model significantly undermining human control.”

“Instrumental Reasoning Level 2: The instrumental reasoning abilities of the model enable enough situational awareness and stealth that, even when relevant model outputs (including, e.g. scratchpads) are being monitored, we cannot detect or rule out the risk of a model significantly undermining human control.”

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 50%

The framework references referring to “model independent information” and to adjust the alert threshold (i.e., the KRI) if “the rate of progress suggests our safety buffer is no longer adequate.” Whilst this could be more specific, it shows partial implementation of KRIs monitoring the level of risk in the external environment. The ML R&D CCLs also take into account information such as Google DeepMind’s “own progress at accelerating ML R&D”. Mitigation efficacy assessment also takes into account “the historical incidence and severity of related events” for both misuse and ML R&D risks. To improve, the KRI must be measurable, with a specific threshold.

Quotes:

“The Framework is informed by the broader conversation on Frontier AI Safety and Security Frameworks. The core components of such Frameworks are to:

  • Identify capability levels at which frontier AI models, without additional mitigations, could pose severe risk.
  • Implement protocols to detect the attainment of such capability levels throughout the model lifecycle.
  • Prepare and articulate proactive mitigation plans to ensure severe risks are adequately mitigated when such capability levels are attained.
  • Where required or appropriate, involve external parties to help inform and guide the approach.” (p. 2)

“We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)

“Note on Machine Learning R&D CCLs: Risk assessment must take into account the fact that other actors may put significantly more effort into eliciting capabilities than we put into assessing risk, thus requiring conservatism in the form of evaluations. However, as a frontier AI company, we do not expect other groups to put significantly more effort into ML R&D than we do ourselves. As a result, to assess the ML R&D CCLs, we may use sources of information about our own progress at accelerating ML R&D to assess whether we are near or at the CCLs, in addition to evaluations of ML R&D capabilities. Similarly, our alert threshold may be defined based on these sources of information, rather than on evaluation scores.” (pp. 5–6)

“We assess that the deployment mitigations have brought the risk of severe harm to an appropriate level proportionate to the risk, based on considerations such as whether the risk has been reduced to an acceptable level by mitigations, the scope of the deployment, what capabilities and mitigations are available on other publicly available models (e.g. if other models are similarly capable and have few mitigations, then the marginal risk added by our release is likely low), and the historical incidence and severity of related events. This is required only for external deployment, not further development.” (p. 7)

“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as:

  1. How much the risk has been reduced by mitigations. For example, whether tests run on mitigated models suggest that the refusal rate and jailbreak robustness together imply the risk has been brought substantially lower than that posed by a model reaching the CCL without mitigations.
  2. The likelihood and consequences of model misuse, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.
  3. The scope of the deployment. For example, small scale and private deployments may pose substantially less risk than large scale or public deployments.
  4. What capabilities and mitigations are available on other publicly available models. For example, whether another (non-Google) publicly deployed model is at the same CCL, and has mitigations that are less effective at preventing misuse than that of the model being assessed, in which case the deployment of this model is less likely to materially increase risk. v. The historical incidence and severity of related events: for example, whether data suggests a high (or low) likelihood of attempted misuse of models at the CCL. Mitigations would consequently have to be stronger (or would not have to be so strong) for deployment to be appropriate.” (pp. 8–9)

“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:

a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.

b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as:

  1. How much the risk has been reduced by mitigations. For example, tests run on the safeguards may suggest that it is very unlikely they can be circumvented by external threat actors or the model in question to increase ML R&D risk.
  2. The likelihood and consequences of model misuse or misalignment, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.
  3. The scope of the deployment. For example, small scale and private deployments may pose substantially less risk than large scale or public deployments.
  4. Model propensity for, historical incidence of and severity of related events: for example, such data may suggest a high (or low) likelihood of misalignment in or misuse of models at the CCL, and mitigations would consequently have to be stronger (or not as strong) for deployment to be appropriate.” (p. 12)

2.2.2 Key Control Indicators (KCI) (30%) 38%

2.2.2.1 Containment KCIs (35%) 63%
2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 75%

For each of the misuse KRIs, they have qualitative containment KCI thresholds related to the RAND security levels, though with a vague qualifier: “at a level generally aligned with RAND SL 2.” It is especially good that some reasoning behind each containment measure is given. However, containment KCIs need to also be defined for the misalignment KRIs.

Quotes:

From pp. 10–11, 13–14, 15:

CBRN, uplift 1: “Security level 2 The difficulty of building defenses against certain CBRN threats means the exfiltration and leak of model weights with this capability could be highly damaging. However, the low to medium resourced actors who would be likely to experience the most CBRN uplift are unlikely to pose a substantial exfiltration threat at the level of RAND OC3 groups.”

Cyber, uplift level 1: “Security level 2 Models able to greatly assist cyber attack might be of interest to well-resourced state actors. However, the potential for automated cyber-defense and social adaptation as a response to exfiltration means that higher levels of security, and the resulting costs to innovation, are likely not warranted.”

Harmful manipulation level 1 (exploratory): “Security level 2 The lower velocity of harm scenarios associated with this CCL and the viability of social defenses against large scale misuse of such models count against security mitigations with substantial costs to innovation.”

Machine Learning R&D, acceleration level 1: “Security level 3 Unrestricted access to models at this level of capability could significantly increase a threat actor’s ability to progress to yet more powerful models and other critical capabilities. The exfiltration of such a model may therefore have a significant effect on society’s ability to adapt to and govern powerful AI models, effects that may have long-lasting consequences. Substantially strengthened security is therefore recommended. However, we expect that acceleration will stem from systems of models integrated with workflows, rather than the model alone. The overall reduced impact of model weights counts against security levels with substantial innovation costs.”

Machine Learning R&D, automation level 1: “We recommend Security level 4 for this capability threshold, but emphasize that this must be taken on by the frontier AI field as a whole. Unrestricted access to models at this level of capability could give an actor (or AI systems) with adequate computational resources the ability to reach capabilities much more powerful than those in the other CCLs listed in a short amount of time. This could be catastrophic if there is no effective way of defending against rapidly improving and potentially superhuman AI systems wielded by threat actors. Therefore, we recommend models at this level of capability have exceptional security even though they may have substantial innovation costs.”

“Given its nascency, we expect our approach to misalignment risk to evolve substantially. This section is therefore illustrative only. Accordingly, we do not indicate security mitigations for models at these CCLs. The table below summarizes the initial approach:” (p. 15)

Footnote 8: “In other words, “security level N” indicates security controls and detections at a level generally aligned with RAND SL N. See https://www.rand.org/pubs/research_reports/RRA2849-1.html, pp 21-22. In aligning our security levels with RAND’s, we are referring to the security goals and principles in the RAND framework, rather than the benchmarks (i.e. concrete measures) also described in the RAND report. As the authors point out, the “security level benchmarks represent neither a complete standard nor a compliance regime—they are provided for informational purposes only and should inform security teams’ decisions rather than supersede them.”” (p. 8)

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 50%

For each of the misuse KRIs, they reference the RAND security levels as the relevant containment KCI, though with a vague qualifier: “at a level generally aligned with RAND SL 2”.

These RAND levels count somewhat as quantitative containment KCIs, but would need to be coupled with probabilities to be fully quantitative. For instance, the RAND levels state criteria such as: ‘A system that can likely thwart most professional opportunistic efforts by attackers that execute moderate effort or non targeted attacks (OC2).’, and the actor is defined quantitatively (“Operations roughly less capable than or comparable to a single individual who is broadly capable in information security spending several weeks with a total budget of up to $10,000 on the specific operation, with preexisting personal cyber infrastructure but no preexisting access to the organization”), but ‘likely’ could be defined quantitatively as probability.

It is especially good that some reasoning behind each containment measure is given. However, this needs to also be defined for the misalignment KRIs.

Quotes:

From pp. 10–11, 13–14, 15:

CBRN, uplift 1: “Security level 2 The difficulty of building defenses against certain CBRN threats means the exfiltration and leak of model weights with this capability could be highly damaging. However, the low to medium resourced actors who would be likely to experience the most CBRN uplift are unlikely to pose a substantial exfiltration threat at the level of RAND OC3 groups.”

Cyber, uplift level 1: “Security level 2 Models able to greatly assist cyber attack might be of interest to well-resourced state actors. However, the potential for automated cyber-defense and social adaptation as a response to exfiltration means that higher levels of security, and the resulting costs to innovation, are likely not warranted.”

Harmful manipulation level 1 (exploratory): “Security level 2 The lower velocity of harm scenarios associated with this CCL and the viability of social defenses against large scale misuse of such models count against security mitigations with substantial costs to innovation.”

Machine Learning R&D, acceleration level 1: “Security level 3 Unrestricted access to models at this level of capability could significantly increase a threat actor’s ability to progress to yet more powerful models and other critical capabilities. The exfiltration of such a model may therefore have a significant effect on society’s ability to adapt to and govern powerful AI models, effects that may have long-lasting consequences. Substantially strengthened security is therefore recommended. However, we expect that acceleration will stem from systems of models integrated with workflows, rather than the model alone. The overall reduced impact of model weights counts against security levels with substantial innovation costs.”

Machine Learning R&D, automation level 1: “We recommend Security level 4 for this capability threshold, but emphasize that this must be taken on by the frontier AI field as a whole. Unrestricted access to models at this level of capability could give an actor (or AI systems) with adequate computational resources the ability to reach capabilities much more powerful than those in the other CCLs listed in a short amount of time. This could be catastrophic if there is no effective way of defending against rapidly improving and potentially superhuman AI systems wielded by threat actors. Therefore, we recommend models at this level of capability have exceptional security even though they may have substantial innovation costs.”

“Given its nascency, we expect our approach to misalignment risk to evolve substantially. This section is therefore illustrative only. Accordingly, we do not indicate security mitigations for models at these CCLs. The table below summarizes the initial approach:” (p. 15)

Footnote 8: “In other words, “security level N” indicates security controls and detections at a level generally aligned with RAND SL N. See https://www.rand.org/pubs/research_reports/RRA2849-1.html, pp 21-22. In aligning our security levels with RAND’s, we are referring to the security goals and principles in the RAND framework, rather than the benchmarks (i.e. concrete measures) also described in the RAND report. As the authors point out, the “security level benchmarks represent neither a complete standard nor a compliance regime—they are provided for informational purposes only and should inform security teams’ decisions rather than supersede them.”” (p. 8)

2.2.2.2 Deployment KCIs (35%) 25%
2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 50%

The implicit KCI for misuse is “[the ability to] counter the misuse of critical capabilities in deployments.” However, this is still quite vague – there is no detail as to what would indicate this ability or inability. The measures for the deployment KCI include “developing and improving a suite of safeguards targeting the capability”, suggesting each KRI has a deployment KCI, but it is not clear how the KCIs differ for each KRI.

 

Quotes:

“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as:

  1. How much the risk has been reduced by mitigations. For example, whether tests run on mitigated models suggest that the refusal rate and jailbreak robustness together imply the risk has been brought substantially lower than that posed by a model reaching the CCL without mitigations.
  2. The likelihood and consequences of model misuse, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.
  3. The scope of the deployment. For example, small scale and private deployments may pose substantially less risk than large scale or public deployments.
  4. What capabilities and mitigations are available on other publicly available models. For example, whether another (non-Google) publicly deployed model is at the same CCL, and has mitigations that are less effective at preventing misuse than that of the model being assessed, in which case the deployment of this model is less likely to materially increase risk.
  5. The historical incidence and severity of related events: for example, whether data suggests a high (or low) likelihood of attempted misuse of models at the CCL. Mitigations would consequently have to be stronger (or would not have to be so strong) for deployment to be appropriate.” (pp. 8–9)

“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:

a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.

b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as:

  1. How much the risk has been reduced by mitigations. For example, tests run on the safeguards may suggest that it is very unlikely they can be circumvented by external threat actors or the model in question to increase ML R&D risk.
  2. The likelihood and consequences of model misuse or misalignment, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.
  3. The scope of the deployment. For example, small scale and private deployments may pose substantially less risk than large scale or public deployments.
  4. Model propensity for, historical incidence of and severity of related events: for example, such data may suggest a high (or low) likelihood of misalignment in or misuse of models at the CCL, and mitigations would consequently have to be stronger (or not as strong) for deployment to be appropriate.” (p. 12)

 

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%

There are no quantitative deployment KCI thresholds given.

Quotes:

No relevant quotes found.

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 25%

For deceptive alignment, the implicit KCI is “the model is [incapable] of effective instrumental reasoning in ways that [cannot] be monitored.” However, this requires more detail, e.g. what would constitute as ability or inability. It is also not specific to the KRIs.

Quotes:

“Mitigation Approach and Critical Capability Levels.
An initial mitigation approach [for deceptive alignment] focuses on detecting when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control, assuming no additional mitigations were applied. When models reach this capability level, we believe applying an automated monitor to the model’s explicit reasoning (e.g. chain-of-thought output) is an effective mitigation. Once a model is capable of effective instrumental reasoning in ways that cannot be monitored, additional mitigations may be warranted—the development of which is an area of active research.”

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 25%

There is a clear process for pairing KRI and KCIs via the ‘safety case’ model, for both misuse and ML R&D risks. However, this should be conducted before the model is developed, to justify why the KCI is high enough to mitigate the risk level if a KRI is crossed, before the KRI is crossed.

Further, safety cases should be developed for misalignment risks, too. The KRIs and KCIs should also be specifically linked via risk models.

Quotes:

“Acceptance determination and mitigations: We then determine whether the model has met or will meet a CCL and, if so, whether we need to implement any further mitigations to reduce the risk to an acceptable level (see below).” (p. 5)

“We assess that the deployment mitigations have brought the risk of severe harm to an appropriate level proportionate to the risk, based on considerations such as whether the risk has been reduced to an acceptable level by mitigations, the scope of the deployment, what capabilities and mitigations are available on other publicly available models (e.g. if other models are similarly capable and have few mitigations, then the marginal risk added by our release is likely low), and the historical incidence and severity of related events. This is required only for external deployment, not further development.” (p. 7)

“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as:

  1. How much the risk has been reduced by mitigations. For example, whether tests run on mitigated models suggest that the refusal rate and jailbreak robustness together imply the risk has been brought substantially lower than that posed by a model reaching the CCL without mitigations.
  2. The likelihood and consequences of model misuse, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.
  3. The scope of the deployment. For example, small scale and private deployments may pose substantially less risk than large scale or public deployments.
  4. What capabilities and mitigations are available on other publicly available models. For example, whether another (non-Google) publicly deployed model is at the same CCL, and has mitigations that are less effective at preventing misuse than that of the model being assessed, in which case the deployment of this model is less likely to materially increase risk.
  5. The historical incidence and severity of related events: for example, whether data suggests a high (or low) likelihood of attempted misuse of models at the CCL. Mitigations would consequently have to be stronger (or would not have to be so strong) for deployment to be appropriate.” (pp. 8–9)

“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:

a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.

b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as:

  1. How much the risk has been reduced by mitigations. For example, tests run on the safeguards may suggest that it is very unlikely they can be circumvented by external threat actors or the model in question to increase ML R&D risk.
  2. The likelihood and consequences of model misuse or misalignment, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.
  3. The scope of the deployment. For example, small scale and private deployments may pose substantially less risk than large scale or public deployments.
  4. Model propensity for, historical incidence of and severity of related events: for example, such data may suggest a high (or low) likelihood of misalignment in or misuse of models at the CCL, and mitigations would consequently have to be stronger (or not as strong) for deployment to be appropriate.” (p. 12)

 

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 10%

There is not a clear commitment to put development on hold, only that external deployment is subject to review from the appropriate governance function. The commitment “we will deem deployment mitigations adequate if the evidence suggests that for the CCLs the model has reached, the increase in likelihood of severe harm has been reduced to an acceptable level” should make it more clear that this means deployment will be put on hold if the corresponding KCI cannot be met for a given KRI (ie CCL). This must be made explicit so that there is as little discretion as is reasonably possible, at the time of decisionmaking.

Quotes:

“external deployments and large scale internal deployments of a model take place only after the appropriate governance function determines the safety case regarding each CCL the model has reached to be adequate. In particular, we will deem deployment mitigations adequate if the evidence suggests that for the CCLs the model has reached, the increase in likelihood of severe harm has been reduced to an acceptable level.” (pp. 12–13)

Back to top

3.1 Implementing Mitigation Measures (50%) 29%

3.1.1 Containment measures (35%) 25%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 25%

The framework outlines potential containment measures, but does not commit to them. To improve, they should be precise as to what containment measures they plan to implement. This transparency allows public scrutiny so their measures can improve.

Quotes:

“Security mitigations against exfiltration risk, such as identity and access management practices and hardening interface-access to unreleased model parameters, are important for models reaching CCLs.” (p. 8)

Footnote 11: “Mitigations at this level may include model access management, physical security controls, authentication measures, endpoint security, access management, secure model storage, vulnerability detection & management, detection of & response to suspected malicious activity.” (p. 10)

Footnote 15: “This level may include mitigations aligned with SL 2, plus additional mitigations designed to prevent unilateral access, harden infrastructure, and prevent data exfiltration.” (p. 13)

Footnote 16: “This level may include mitigations aligned with SL 2 and 3, plus additional mitigations aimed to isolate model weights, enhanced data center security, further hardening of infrastructure and minimizing potential attack surface.” (p. 14)

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 25%

Whilst the framework mentions internal validation that containment measures are sufficient, proof is not provided for why they believe their given containment measures to be likely to be sufficient.

Quotes:

“We will use various processes to evaluate the effectiveness and limitations of mitigations:

  • Security mitigations: security infrastructure at Google is subject to penetration testing and other kinds of assessments, and is continually improved based on these results.” (p. 6)
3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 0%

There is no mention of third-party verification of containment measures meeting the threshold.

Quotes:

No relevant quotes found.

3.1.2 Deployment measures (35%) 40%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 50%

The framework mentions some possible deployment measures (‘deployment mitigations’), but without explicit commitment to implementing them. To improve, they should detail precisely the deployment measures which will be implemented to meet the relevant deployment KCI threshold.

Quotes:

“Developing and improving a suite of safeguards targeting the capability, which may include measures such as safety post-training, monitoring and analysis, account moderation, jailbreak detection and patching, user verification, and bug bounties” (p. 8)

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

The framework describes a process, assumedly internal, for “evaluate the effectiveness and limitations of mitigations”, but does not detail why they ex ante believe their deployment measures to be sufficient. Instead, it relies on the “appropriate corporate governance body” and their discretion. To improve, this proof should be garnered in advance, to be sure that the measures will be sufficient to meet the KCI threshold once the model crosses the relevant KRI threshold, and indeed have “proactive mitigation plans”.

Quotes:

“We will use various processes to evaluate the effectiveness and limitations of mitigations: […] Deployment mitigations: we will use a combination of threat modeling, empirical testing, and other sources of information to assess the effectiveness and limitations of our deployment mitigations. These will form the basis of a safety case for models reaching CCLs, that will be reviewed before deployment.” (p. 6)

“Prepare and articulate proactive mitigation plans to ensure severe risks are adequately mitigated when such capability levels are attained.” (p. 2)

“This process is designed to ensure that residual risk remains at acceptable levels: evidence of efficacy collected during development and testing, as well as expert-driven estimates of other parameters, will enable us to assess residual risk and to detect substantial changes that invalidate our risk assessment. With iteration on safeguards and safety cases, we believe that we are able to make informed decisions about the level of risk via a CCL before a model is released, and reliably prevent models posing unacceptable levels of risk from being deployed.” (p. 9)

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 0%

There is no mention of third-party verification of deployment measures meeting the threshold.

Quotes:

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

3.1.3 Assurance processes (30%) 22%

3.1.3.1 Credible plans towards the development of assurance properties (40%) 25%

The framework mentions they are “actively researching approaches to addressing models” that reach the highest misalignment capability, instrumental reasoning level 2. However, they do not provide detail on how they will achieve this, or by what point it will need to be intact (i.e., whether assurance processes must be settled before the model has reached some margin of the critical capability).

Quotes:

“Instrumental Reasoning Level 2: […] Mitigation: Future work: We are actively researching approaches to addressing models that reach this CCL.”

“Prepare and articulate proactive mitigation plans to ensure severe risks are adequately mitigated when such capability levels are attained.” (p. 2)

3.1.3.2 Evidence that the assurance properties are enough to achieve their corresponding KCI thresholds (40%) 10%

They acknowledge that automated monitoring may become redundant “once a model is capable of effective instrumental reasoning in ways that cannot be monitored”, and don’t provide evidence for why they believe they are not yet at this point/how they will measure this. To improve, they should describe a methodology for testing their assumptions for assurance processes to be effective, such as via evaluations on model organisms or theoretical proofs.

Quotes:

“Here we describe an approach for addressing misalignment risk that focuses specifically on when models may develop a baseline instrumental reasoning ability at which, without additional mitigations, they may have the potential to undermine human control. When models reach this capability level, one possible mitigation is to apply an automated monitor to the model’s explicit reasoning (e.g. chain-of-thought output). Once a model is capable of effective instrumental reasoning in ways that cannot be monitored, additional mitigations may be warranted—the development of which is an area of active research.” (p. 15)

3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 50%

They do explicitly outline that safety cases require many assumptions/factors, but they do not spell out exactly what these assumptions are. Whilst they refer to other work (https://arxiv.org/pdf/2505.01420) which spells out these safety cases in more detail, it would be preferable for this to be in an appendix of the framework, to keep commitments very clear. Overall then, to improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.

However, more detail could be given on this assumption. For instance, the framework could further explore the assumptions required for automated monitoring to be successful/the control evaluation to be legitimate requires; some of these might include faithful chain-of-thought reasoning, red-team expertise of the human red team or the weaker monitor, or absence of collusion between the monitor and model. Overall then, to improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.

Quotes:

“We will use various processes to evaluate the effectiveness and limitations of mitigations:

  • Security mitigations: security infrastructure at Google is subject to penetration testing and other kinds of assessments, and is continually improved based on these results.
  • Deployment mitigations: we will use a combination of threat modeling, empirical testing, and other sources of information to assess the effectiveness and limitations of our deployment mitigations. These will form the basis of a safety case5 for models reaching CCLs, that will be reviewed before deployment. See the deployment mitigations sections below for misuse and machine learning R&D for more.” (p. 6)

“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as:

  1. How much the risk has been reduced by mitigations. For example, whether tests run on mitigated models suggest that the refusal rate and jailbreak robustness together imply the risk has been brought substantially lower than that posed by a model reaching the CCL without mitigations.
  2. The likelihood and consequences of model misuse, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.
  3. The scope of the deployment. For example, small scale and private deployments may pose substantially less risk than large scale or public deployments.
  4. What capabilities and mitigations are available on other publicly available models. For example, whether another (non-Google) publicly deployed model is at the same CCL, and has mitigations that are less effective at preventing misuse than that of the model being assessed, in which case the deployment of this model is less likely to materially increase risk. v. The historical incidence and severity of related events: for example, whether data suggests a high (or low) likelihood of attempted misuse of models at the CCL. Mitigations would consequently have to be stronger (or would not have to be so strong) for deployment to be appropriate.” (pp. 8–9)

“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:

a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.

b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as:

  1. How much the risk has been reduced by mitigations. For example, tests run on the safeguards may suggest that it is very unlikely they can be circumvented by external threat actors or the model in question to increase ML R&D risk.
  2. The likelihood and consequences of model misuse or misalignment, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.
  3. The scope of the deployment. For example, small scale and private deployments may pose substantially less risk than large scale or public deployments.
  4. Model propensity for, historical incidence of and severity of related events: for example, such data may suggest a high (or low) likelihood of misalignment in or misuse of models at the CCL, and mitigations would consequently have to be stronger (or not as strong) for deployment to be appropriate.” (p. 12)

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 25%

3.2.1 Monitoring of KRIs (40%) 24%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 50%

Whilst they express commitment to developing intensive elicitation methods, they do not provide justification that their evaluations are comprehensive enough. Further, “we seek to equip the model” only signals an intent, rather than a commitment. Nonetheless, they do acknowledge that evaluations require “conservatism” in case of extra elicitation effort. More detail could be added on which elicitation methods they anticipate would be used by different threat actors, under realistic settings, and their exact elicitation setup.

Quotes:

“Risk assessment will necessarily involve evaluating cross-cutting capabilities such as agency, tool use, reasoning, and scientific understanding.” (p. 2)

“Analysis: Central to our model evaluations are “early warning evaluations,” to assess the proximity of the model to a CCL. We define “alert thresholds” for these evaluations that are designed to flag when a CCL may be reached before a risk assessment is conducted again. In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)

“Risk assessment must take into account the fact that other actors may put significantly more effort into eliciting capabilities than we put into assessing risk, thus requiring conservatism in the form of evaluations.” (p. 5)

3.2.1.2 Evaluation frequency (25%) 10%

They demonstrate an intent to run evaluations frequently, according to a “safety buffer”, implying that this pertains to rate of progress of AI capabilities, but do not describe what this safety buffer is or what determines how frequently these are run. They commit to evaluating at least whenever there is the “first external deployment” or “if the model has meaningful new capabilities or a material increase in performance.” However, to improve, their frequency should not depend on noticing capability jumps, as jumps may be larger than mitigations can prepare for by the time they are noticed. Instead, a frequent pace could guarantee consistent measurement.

Quotes:

“Analysis: Central to our model evaluations are “early warning evaluations,” to assess the proximity of the model to a CCL. We define “alert thresholds” for these evaluations that are designed to flag when a CCL may be reached before a risk assessment is conducted again. In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)

“For each risk domain, we conduct aspects of our risk assessment at various moments throughout the model development process, both before and after deployment. We conduct a risk assessment for the first external deployment of a new frontier AI model. For subsequent versions of the model, we conduct a further risk assessment if the model has meaningful new capabilities or a material increase in performance, until the model is retired or we deploy a more capable model. The reason for this is because a material change in the model’s capabilities may mean that the risk profile of the model has changed or the justification for why the risks stemming from the model are acceptable has been materially undermined. To identify meaningful new capabilities or material increases in performance, we conduct model capability evaluations, including our automated benchmarks. These evaluations are primarily aimed at understanding the capabilities of the model and may be triggered, for example, upon the completion of a pre-training or post-training run, on various candidates of a model version. These evaluations include a broad range of areas, including general capability evaluations, model behavior, efficiency, coding capabilities, multilinguality, or reasoning. Data from these evaluations are collected and analyzed to give us an indication as to how the model is performing and whether a risk assessment is necessary. At a high level, our risk assessment involves the following steps (which do not need to be repeated where a previous risk assessment is still appropriate):” (pp. 4–5)

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 25%

The “safety buffer” quoted here likely refers to the assumption that capability evaluations are underestimating future capabilities, given post-training enhancements. It would be an improvement to make this more explicit. They also note that safety cases must take into account “capability improvements after the risk assessment”. More detail on this methodology, e.g. the enhancements used, or the forecasting exercises completed to assure a wide enough safety buffer, would improve the score.

Further, more detail could be added on how they account(ed) for how post-training enhancements’ risk profiles change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.

Quotes:

“Acceptance determination and mitigations: We then determine whether the model has met or will meet a CCL and, if so, whether we need to implement any further mitigations to reduce the risk to an acceptable level (see below).” (p. 5)

“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as: […] The likelihood and consequences of model misuse, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.” (pp. 8–9)

“We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate.” (p. 5)

3.2.1.4 Vetting of protocols by third parties (15%) 10%

There is no mention of having the evaluation methodology vetted by third parties. However, they do make a discretionary commitment to involve external experts when determining the level of risk after a KRI threshold is crossed, showing some awareness that external opinion is helpful when assessing the risks and capabilities of a model.

Quotes:

“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p. 3)

“We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

3.2.1.5 Replication of evaluations by third parties (15%) 10%

There is no mention of having evaluations replicated, though they mention that they “may use additional external evaluators […] if evaluators with relevant expertise are needed to provide an additional signal about a model’s proximity to CCLs.” This only shows partial implementation.

Quotes:

“We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

“we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed.” (p. 5)

3.2.2 Monitoring of KCIs (40%) 23%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 75%

There is mention of updating mitigations as a result of post-market monitoring, but not necessarily of measuring mitigations outirght, and more detail could be given on how frequent this is. An improvement would be to commit to a systematic, ongoing monitoring scheme to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will still be met, when required.

Finally, it is commendable that they conduct “post-deployment processes”, where the “safety cases and mitigations may be updated if deemed necessary by post-market monitoring.” More detail could be provided on what would constitute a necessary update.

Quotes:

“We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)

“We will use various processes to evaluate the effectiveness and limitations of mitigations: […] Deployment mitigations: we will use a combination of threat modeling, empirical testing, and other sources of information to assess the effectiveness and limitations of our deployment mitigations.” (p. 6)

“our safety cases and mitigations may be updated if deemed necessary by post-market monitoring.” (p. 9)

“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may: […] Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding.” (p. 16)

“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:

a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.
b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as: […] likelihood and consequences of our mitigations being circumvented, deactivated, or subverted […] Model propensity for, historical incidence of and severity of related events: for example, such data may suggest a high (or low) likelihood of misalignment in or misuse of models at the CCL, and mitigations would consequently have to be stronger (or not as strong) for deployment to be appropriate.” (p. 12)

3.2.2.2 Vetting of protocols by third parties (30%) 10%

External input into mitigation protocols is optional and only ‘informs’ the response plan.

Quotes:

“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p. 3)

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

3.2.2.3 Replication of evaluations by third parties (30%) 0%

There is no mention of control evaluations/mitigation testing being replicated or conducted by third-parties.

Quotes:

“Analysis: Central to our model evaluations are “early warning evaluations,” to assess the proximity of the model to a CCL. We define “alert thresholds” for these evaluations that are designed to flag when a CCL may be reached before a risk assessment is conducted again. In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

3.2.3 Transparency of evaluation results (10%) 43%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 50%

They mention sharing information with the government when models have critical capabilities, though the content of this information remains discretionary. There are no commitments to share evaluation reports to the public if models are deployed.

Quotes:

“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

“If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share relevant information with appropriate government authorities where it will facilitate safety of frontier AI. Where appropriate, and subject to adequate confidentiality and security measures and considerations around proprietary and sensitive information, this information may include: 

  • Model information: characteristics of the AI model relevant to the risk it may pose with its critical capabilities.
  • Evaluation results: such as details about the evaluation design, the results, and any robustness tests.
  • Mitigation plans: descriptions of our mitigation plans and how they are expected to reduce the risk.

We may also consider disclosing information to other external organisations to promote shared learning and coordinated risk mitigation. We will continue to review and evolve our disclosure process over time.” (p. 16)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 18%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 10%

They show a commitment to assess “whether there are other risk domains where severe risks may arise” and “update our risk domains and CCLs, where necessary” at least annually. To improve, such a process for identifying novel risks/novel risk models should be detailed, such as threat modeling exercises or monitoring.

This is especially important as “we cannot detect or rule out the risk of a model significantly undermining human control” is a critical capability level, and so represents “a foreseeable path to severe harm”. Necessarily then, monitoring for changes in this risk profile, or other aspects which may make this risk profile more or less likely, is likely highly relevant for assessing risk. Whilst they state an intent to update their set of risks and mitigations, a monitoring setup specifically to detect novel risk profiles is not detailed.

Quotes:

“As part of our broader research into frontier AI models, we continue to assess whether there are other risk domains where severe risks may arise and will update our approach as appropriate.” (p. 5)

“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:

  • Update our risk domains and CCLs, where necessary.
  • Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding.

The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16)

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 25%

There is no formal mechanism for incorporating risks identified post-deployment into a structured risk modelling process. However, they do indicate that they may update risk domains at least annually (though not necessarily risk models). To improve, novel risks or risk pathways identified via monitoring post-deployment should trigger further risk modeling and scenario analysis. This may include updating multiple or all risk models.

Quotes:

“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:

  • Update our risk domains and CCLs, where necessary.
  • Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding.

The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16)

Back to top

4.1 Decision-making (25%) 13%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 0%

No mention of risk owners.

Quotes:

No relevant quotes found.

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No mention of a management risk committee.

Quotes:

No relevant quotes found.

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 50%

The framework outlines fairly detailed protocols for decision-making in terms of the capability levels, but to improve, it should specify more detail on who makes the decisions and the basis for them.

Quotes:

“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p. 3)
“2. Pre-deployment review of safety case: external deployments of a model take place only after the appropriate governance function determines the safety case regarding each CCL the model has reached to be adequate. In particular, we will deem deployment mitigations adequate if the evidence suggests that for the CCLs the model has reached, the increase in likelihood of severe harm has been reduced to an acceptable level. 3. Post-deployment processes: our safety cases and mitigations may be updated if deemed necessary by post-market monitoring. Material updates to a safety case will be submitted to the appropriate governance function for review.” (p. 9)
“For Google models, when alert thresholds are reached, the response plan will be reviewed and approved by appropriate corporate governance bodies”. (p. 7)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 0%

No mention of escalation procedures.

Quotes:

No relevant quotes found.

4.2. Advisory and Challenge (20%) 14%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 0%

No mention of an executive risk officer.

Quotes:

No relevant quotes found.

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 10%

The company has a large amount of councils that advise management on AI risk matters, but the only structures mentioned are “appropriate corporate governance bodies” and the “appropriate governance function”.

Quotes:

“Pre-deployment review of safety case: external deployments and large scale internal deployments of a model take place only after the appropriate governance function determines the safety case regarding each CCL the model has reached to be adequate.” (pp. 12–13)

“The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16) 

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 50%

The framework lists some details regarding their system for monitoring risk levels in terms of the capability levels. To improve, they should monitor risk indicators other than solely capabilities and integrate these for a holistic risk view.

Quotes:

“Critical Capability Levels. These are capability levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)

“We intend to evaluate our most powerful frontier models regularly to check whether their AI capabilities are approaching a CCL.” (p. 3)

“We will define a set of evaluations called ‘early warning evaluations,’ with a specific ‘alert threshold’ that flags when a CCL may be reached before the evaluations are run again.” (p. 3)

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 0%

No mention of people that challenge decisions.

Quotes:

No relevant quotes found.

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 25%

The framework refers to reviews of relevant information by the advisory committees. However, to improve, it should make more clear what risk information is reported to senior management and in what format.

Quotes:

“Pre-deployment review of safety case: external deployments and large scale internal deployments of a model take place only after the appropriate governance function determines the safety case regarding each CCL the model has reached to be adequate.” (pp. 12–13)

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 18%

4.3.1 The company has an internal audit function involved in AI governance (50%) 25%

The framework mentions they will assess their adherence to the Framework, though they don’t say by what process or which body. To improve, they should describe how this audit will remain independent.

Quotes:

“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:

  • Update our risk domains and CCLs, where necessary.
  • Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding.

The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16)

4.3.2 The company involves external auditors (50%) 10%

The framework mentions potentially involving external expertise, but it is tentative. Further, it does not mention external independent review.

Quotes:

“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed.” (p. 3)

“We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)

4.4 Oversight (20%) 5%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 0%

No mention of a Board risk committee.

Quotes:

No relevant quotes found.

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 10%

The company has a large amount of councils that advise management on AI risk matters, but the only structures mentioned are “appropriate corporate governance bodies” and the “appropriate governance function”. To improve further, the company should clarify whether these are advisory bodies or oversight bodies, as per the Three Lines model.

Quotes:

“Pre-deployment review of safety case: external deployments and large scale internal deployments of a model take place only after the appropriate governance function determines the safety case regarding each CCL the model has reached to be adequate.” (pp. 12–13)

“The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16) 

4.5 Culture (10%) 12%

4.5.1 The company has a strong tone from the top (33.3%) 10%

The framework includes a few references that reinforces the tone from the top, but would benefit from more substantial commitments to managing risk.

Quotes:

“It is intended to complement Google’s existing suite of AI responsibility and safety practices, and enable AI innovation and deployment consistent with our AI Principles.” (p. 1)

“We expect the Framework to evolve substantially as our understanding of the risks and benefits of frontier models improves, and we will publish substantive revisions as appropriate.” (p. 1)

4.5.2 The company has a strong risk culture (33.3%) 10%

The framework includes a few references to updating the approach over time, which is important for risk culture. To improve, more aspects such as training and internal transparency would be needed.

Quotes:

“We may change our approach over time as we gain experience and insights on the projected capabilities of future frontier models. We will review the Framework periodically and we expect it to evolve substantially as our understanding of the risks and benefits of frontier models improves.” (p. 2)

4.5.3 The company has a strong speak-up culture (33.3%) 0%

No mention of elements of speak-up culture.

Quotes:

No relevant quotes found.

4.6 Transparency (5%) 15%

4.6.1 The company reports externally on what their risks are (33.3%) 25%

The framework states which capabilities that the company is tracking as part of this framework. To improve its score, the company could specify how it will provide information regarding risks going forward in e.g. model cards.

Quotes:

“In the Framework, we specify protocols for the detection of capability levels at which frontier AI models may pose severe risks (which we call “Critical Capability Levels (CCLs)”), and articulate mitigation approaches to address such risks. The Framework addresses misuse risk, risks from machine learning research and development (ML R&D), and misalignment risk. For each type of risk, we define a set of CCLs and a mitigation approach for them.” (p. 2)

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 10%

The company has a large amount of councils but the only structures mentioned in the Framework are “appropriate corporate governance bodies” and the “appropriate governance function”. To improve, more transparency on the governance structure should be provided.

Quotes:

“Pre-deployment review of safety case: external deployments and large scale internal deployments of a model take place only after the appropriate governance function determines the safety case regarding each CCL the model has reached to be adequate.” (pp. 12–13)

“The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16) 

4.6.3 The company shares information with industry peers and government bodies (33.3%) 10%

The framework suggests potential information sharing, but the language is fairly vague, with e.g. “may” and “aim to”. For a higher score, the company would need to add precision.

Quotes:

“Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)

“If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share relevant information with appropriate government authorities where it will facilitate safety of frontier AI. Where appropriate, and subject to adequate confidentiality and security measures and considerations around proprietary and sensitive information, this information may include:

  • Model information: characteristics of the AI model relevant to the risk it may pose with its critical capabilities.
  • Evaluation results: such as details about the evaluation design, the results, and any robustness tests.
  • Mitigation plans: descriptions of our mitigation plans and how they are expected to reduce the risk.

We may also consider disclosing information to other external organisations to promote shared learning and coordinated risk mitigation. We will continue to review and evolve our disclosure process over time.”

Back to top
Category
DeepMind
Best in Class
1. Risk Identification
22%
32%
  • 1. Classification of Applicable Known Risks (40%)
    43%
    63%
    • 1. Risks from literature and taxonomies are well covered (50%)
      75%
      75%
    • 2. Exclusions are clearly justified and documented (50%)
      10%
      50%
  • 2. Identification of Unknown Risks (Open-ended red teaming) (20%)
    0%
    10%
    • 1. Internal open-ended red teaming (70%)
      0%
      10%
    • 2. Third party open-ended red teaming (30%)
      0%
      10%
  • 3. Risk modeling (40%)
    13%
    41%
    • 1. The company uses risk models for all the risk domains identified
and the risk models are published (with potentially dangerous
information redacted) (40%)
      25%
      50%
    • 2. Risk modeling methodology (40%)
      9%
      39%
      • 1. Methodology precisely defined (70%)
        10%
        50%
      • 2. Mechanism to incorporate red teaming findings (15%)
        0%
        10%
      • 3. Prioritization of severe and probable risks (15%)
        10%
        75%
    • 3. Third party validation of risk models (20%)
      0%
      50%
2. Risk Analysis and Evaluation
18%
30%
  • 1. Setting a Risk Tolerance (35%)
    4%
    22%
    • 1. Risk tolerance is defined (80%)
      3%
      28%
      • 1. Risk tolerance is at least qualitatively defined for all risks (33%)
        10%
        75%
      • 2. Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%)
        0%
        10%
      • 3. Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%)
        0%
        0%
    • 2. Process to define the tolerance (20%)
      5%
      5%
      • 1. AI developers engage in public consultations or seek guidance from regulators where available (50%)
        10%
        10%
      • 2. Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%)
        0%
        0%
  • 2. Operationalizing Risk Tolerance (65%)
    25%
    34%
    • 1. Key Risk Indicators (KRI) (30%)
      24%
      33%
      • 1. KRI thresholds are at least qualitatively defined for all risks (45%)
        25%
        50%
      • 2. KRI thresholds are quantitatively defined for all risks (45%)
        10%
        25%
      • 3. KRIs also identify and monitor changes in the level of risk in the external environment (10%)
        50%
        50%
    • 2. Key Control Indicators (KCI) (30%)
      38%
      38%
      • 1. Containment KCIs (35%)
        63%
        63%
        • 1. All KRI thresholds have corresponding qualitative containment KCI thresholds (50%)
          75%
          90%
        • 2. All KRI thresholds have corresponding quantitative containment KCI thresholds (50%)
          50%
          50%
      • 2. Deployment KCIs (35%)
        25%
        43%
        • 1. All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%)
          50%
          75%
        • 2. All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%)
          0%
          10%
      • 3. For advanced KRIs, assurance process KCIs are defined (30%)
        25%
        50%
    • 3. Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%)
      25%
      25%
    • 4. Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%)
      10%
      75%
3. Risk Treatment
27%
41%
  • 1. Implementing Mitigation Measures (50%)
    29%
    38%
    • 1. Containment measures (35%)
      25%
      74%
      • 1. Containment measures are precisely defined for all KCI thresholds (60%)
        25%
        90%
      • 2. Proof that containment measures are sufficient to meet the thresholds (40%)
        25%
        50%
      • 3. Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2])
        0%
        25%
    • 2. Deployment measures (35%)
      40%
      40%
      • 1. Deployment measures are precisely defined for all KCI thresholds (60%)
        50%
        50%
      • 2. Proof that deployment measures are sufficient to meet the thresholds (40%)
        25%
        25%
      • 3. Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2])
        0%
        25%
    • 3. Assurance processes (30%)
      22%
      30%
      • 1. Credible plans towards the development of assurance properties (40%)
        25%
        25%
      • 2. Evidence that the assurance properties are enough to achieve their corresponding KCI thresholds (40%)
        10%
        50%
      • 3. The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%)
        50%
        50%
  • 2. Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%)
    25%
    51%
    • 1. Monitoring of KRIs (40%)
      24%
      64%
      • 1. Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%)
        50%
        90%
      • 2. Evaluation frequency (25%)
        10%
        100%
      • 3. Description of how post-training enhancements are factored into capability assessments (15%)
        25%
        50%
      • 4. Vetting of protocols by third parties (15%)
        10%
        25%
      • 5. Replication of evaluations by third parties (15%)
        10%
        50%
    • 2. Monitoring of KCIs (40%)
      23%
      43%
      • 1. Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%)
        75%
        75%
      • 2. Vetting of protocols by third parties (30%)
        10%
        50%
      • 3. Replication of evaluations by third parties (30%)
        0%
        50%
    • 3. Transparency of evaluation results (10%)
      43%
      77%
      • 1. Sharing of evaluation results with relevant stakeholders as appropriate (85%)
        50%
        90%
      • 2. Commitment to non-interference with findings (15%)
        0%
        25%
    • 4. Monitoring for novel risks (10%)
      18%
      75%
      • 1. Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%)
        10%
        75%
      • 2. Mechanism to incorporate novel risks identified post-deployment (50%)
        25%
        75%
4. Risk Governance
12%
49%
  • 1. Decision-making (25%)
    13%
    60%
    • 1. The company has clearly defined risk owners for every key risk identified and tracked (25%)
      0%
      75%
    • 2. The company has a dedicated risk committee at the management level that meets regularly (25%)
      0%
      90%
    • 3. The company has defined protocols for how to make go/no-go decisions (25%)
      50%
      75%
    • 4. The company has defined escalation procedures in case of incidents (25%)
      0%
      75%
  • 2. Advisory and Challenge (20%)
    14%
    48%
    • 1. The company has an executive risk officer with sufficient resources (16.7%)
      0%
      75%
    • 2. The company has a committee advising management on decisions involving risk (16.7%)
      10%
      90%
    • 3. The company has an established system for tracking and monitoring risks (16.7%)
      50%
      75%
    • 4. The company has designated people that can advise and challenge management on decisions involving risk (16.7%)
      0%
      50%
    • 5. The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%)
      25%
      75%
    • 6. The company has an established central risk function (16.7%)
      0%
      50%
  • 3. Audit (20%)
    18%
    70%
    • 1. The company has an internal audit function involved in AI governance (50%)
      25%
      75%
    • 2. The company involves external auditors (50%)
      10%
      90%
  • 4. Oversight (20%)
    5%
    50%
    • 1. The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%)
      0%
      90%
    • 2. The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%)
      10%
      75%
  • 5. Culture (10%)
    12%
    63%
    • 1. The company has a strong tone from the top (33.3%)
      10%
      50%
    • 2. The company has a strong risk culture (33.3%)
      10%
      75%
    • 3. The company has a strong speak-up culture (33.3%)
      0%
      90%
  • 6. Transparency (5%)
    15%
    72%
    • 1. The company reports externally on what their risks are (33.3%)
      25%
      75%
    • 2. The company reports externally on what their governance structure looks like (33.3%)
      10%
      90%
    • 3. The company shares information with industry peers and government bodies (33.3%)
      10%
      90%