Best in class
- GoogleDeepMind stands out for their robust descriptions of safety cases, including many different factors which feed into their risk determinations.
Compared to their Frontier Safety Framework Version 1.0, they:
Risk domains covered include CBRN, Cyber, Machine Learning R&D and misalignment, and harmful manipulation. Loss of control risks (covered by machine learning R&D and misalignment) are partly decomposed into a “stealth and situational awareness TCL” and “acceleration” and “automation” CCLs, though it is unclear how the TCL and CCLs interact to form a combined risk picture of loss of control risks. They also link to external analyses and discussions on safety frameworks, including by Anthropic, METR, OpenAI, the Frontier Model Forum, and the UK government.
“We identify CCLs for two kinds of risks: misuse risk and risks related to machine learning R&D and misalignment. For misuse risk, we define CCLs in the following risk domains where the misuse of model capabilities may result in severe harm:
● CBRN: Risks of models assisting in the development, preparation, and/or execution of a chemical, biological, radiological, or nuclear (“CBRN”) threat.
● Cyber: Risks of models assisting in the development, preparation, and/or execution of a cyber attack.
● Harmful Manipulation: Risks of models with high manipulative capabilities potentially being misused in ways that could reasonably result in large scale harm. For machine learning R&D and misalignment risks, we define CCLs that identify when ML R&D capabilities in our models may, if not properly managed, reduce society’s overall ability to manage AI risks.” (p.4)
“We consider a wide range of risks as part of our ongoing research, taking into account the characteristics, capabilities, propensities, and affordances of our models and other sources of information, such as our internal risk taxonomies, internal expertise and relevant external research. As explained above, we have identified risk domains where, based on early research, we have determined significant or severe risks may be most likely to arise from future models: CBRN, cyber, harmful manipulation, as well as machine learning R&D and misalignment.” (p.5)
“Stealth and Situational Awareness TCL: The instrumental reasoning abilities of the model enable enough situational awareness (ability to discover and use relevant details of its deployment setting) and stealth (ability to circumvent basic oversight mechanisms) such that, absent additional mitigations, we cannot rule out the model significantly undermining human control.” (p.14)
“ML R&D acceleration level 1: Has been used to accelerate AI development, resulting in AI progress substantially accelerating from historical rates. […]
ML R&D automation level 1: Can fully automate the work of any team of researchers at Google focused on improving AI capabilities, with approximately comparable all-inclusive costs.” (p.15)
“The Framework is informed by the broader conversation on Frontier AI Safety and Security Frameworks.1” (p.2) followed by Footnote 1: “1 See https://www.gov.uk/government/publications/emerging-processes-for-frontier-ai-safety, https://metr.org/faisc, https://www.anthropic.com/rsp-updates, https://www.anthropic.com/news/compliance-framework-SB53, https://openai.com/index/updating-our-preparedness-framework/, https://www.frontiermodelforum.org/publications/#technical-reports.”
“As part of our broader research into and development of frontier AI models, we continue to assess whether there are other risk domains where significant or severe risks may arise and will update our approach as appropriate.” (p.5)
“The Frontier Safety Framework will be reviewed at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of significant and severe risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework.” (p.17)
The framework covers the main risks present in the literature. However, it only provides a basic breakdown of loss of control risks into “stealth and situational awareness”, as well as “acceleration” and “automation” of AI R&D. There is no justification for why other aspects of loss of control risks like autonomy or autonomous self-replication have not been considered.
Further, the framework asserts that they “may […] update [their] risk domains and T/CCLs, where necessary”, but does not provide the concrete evidence that would be required for this to happen.
“As part of our broader research into and development of frontier AI models, we continue to assess whether there are other risk domains where significant or severe risks may arise and will update our approach as appropriate.” (p.5)
“We may include TCLs for additional risks in the future, as our threat modeling develops.” (p.4)
“The Frontier Safety Framework will be reviewed at least once a year […] Following this assessment, we may: Update our risk domains and T/CCLs, where necessary.” (p.17)
The framework asserts that they “continue to assess whether there are other risk domains where significant or sever risks may arise and will upate [their] approach as appropriate”, but does not specify the method of this assessment and does not specify whether this is also done pre-deployment. Red-teaming is not mentioned in the context of novel risk identification.
“As part of our broader research into and development of frontier AI models, we continue to assess whether there are other risk domains where significant or severe risks may arise and will update our approach as appropriate.” (p.5)
The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model.
No relevant quotes found.
The framework is centered around Critical Capability Levels (CCLs), which are “determined by identifying and analyzing the main foreseeable paths through which a model could result in severe harm”—a reasonably precise definition of risk modeling. However, any risk models completed are not published. To improve, DeepMind could reference literature in which their risk models have been published, e.g. refer to (Rodriguez et al. 2025) .There should also be evidence of a sincere attempt to map out the risk space as much as possible using representative scenarios.
“The Framework is built primarily around capability thresholds called ‘Critical Capability Levels (CCLs).’ These are capability levels at which, absent mitigation measures, frontier AI models or systems may pose heightened risk of severe harm. CCLs are determined by identifying and analyzing the main foreseeable paths through which a model could result in severe harm: we then define the CCLs as the minimal set of capabilities a model must possess to do so.” (p.4)
“As explained above, we have identified risk domains where, based on early research, we have determined significant or severe risks may be most likely to arise from future models: CBRN, cyber, harmful manipulation, as well as machine learning R&D and misalignment. […] For each of the four identified domains, we have developed specific scenarios and T/CCLs in which these risks could materialize.” (p.5)
“Central to our critical capability assessments are ‘early warning evaluations,’ which we use to test the specific threats and risk scenarios identified through our threat modeling, determine a model’s capability, and assess the proximity of the model to a T/CCL.” (p.5)
“CBRN uplift level 1: Provides low to medium resourced actors uplift in reference scenarios resulting in additional expected harm at severe scale.” (p.11)
There is an indication of an awareness of risk modeling methodologies, but there are no concrete details about implementation.
“CCLs are determined by identifying and analyzing the main foreseeable paths through which a model could result in severe harm: we then define the CCLs as the minimal set of capabilities a model must possess to do so.” (p.4)
“We identify potential risks that could stem from our models and analyze their characteristics to determine which of the identified risks could be significant or severe risks.” (p.5)
No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.
No relevant quotes found.
The framework centers around critical capability levels (CCLs) which “may pose heightened risk of severe harm”, which are “determined by identifying and analyzing the main foreseeable paths through which a model could result in severe harm”. However, no concrete risk models are given, and beyond descriptions of severity and probability do not go beyond “heightened risk” and “severe harm”, which are imprecise terms that can be interpreted in various ways. No examples are given of specific risk models considered and excluded.
“The Framework is built primarily around capability thresholds called ‘Critical Capability Levels (CCLs).’ These are capability levels at which, absent mitigation measures, frontier AI models or systems may pose heightened risk of severe harm. CCLs are determined by identifying and analyzing the main foreseeable paths through which a model could result in severe harm: we then define the CCLs as the minimal set of capabilities a model must possess to do so.” (p.4)
“As explained above, we have identified risk domains where, based on early research, we have determined severe risks may be most likely to arise from future models” (p.5)
“The Frontier Safety Framework focuses on possible severe risks stemming from high-impact capabilities of frontier AI models.” (p.4)
“Critical Capability Levels (CCLs): are the main capability thresholds around which we have built the Framework process. They represent the capability levels at which, absent mitigation measures, frontier AI models or systems may pose heightened risk of severe harm.
Tracked Capability Levels (TCLs): are capability thresholds which capture a lower level of risks than our CCLs. They represent the capability levels at which, absent mitigation measures, frontier AI models or systems may pose heightened risk of significant but not severe levels of harm.” (p.18)
There is no mention of the external review of risk models.
No relevant quotes found.
They indicate that they will not tolerate certain risks of “severe harm” which is not further defined. Each capability threshold functions as an implicit risk tolerance, e.g. “Cyber autonomy level 1: Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.” There are also multiple mentions of bringing risk to an “appropriate” or “acceptable” level for the risk acceptance criteria. However, these are vague and discretionary. While indeed “the science of AI risk assessment is still developing”, it would be an improvement to state what they currently would use as their risk tolerance, as they nonetheless will operate with one. They also provide no risk acceptance criteria for misalignment risk.
To improve, they must set out the maximum amount of risk the company is willing to accept, for each risk domain (though they need not differ between risk domains), ideally expressed in terms of probabilities and severity (economic damages, physical lives, etc), and separate from KRIs.
“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 4)
“Cyber uplift level 1: Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.” (p. 10)
“Most CCLs define one important component of our risk acceptance criteria. Because the CCLs for misalignment risk are exploratory and intended for illustration only, we do not associate them with explicit risk acceptance criteria.” (p. 4)
“A model for which the risk assessment indicates a machine learning R&D CCL has been reached will be deemed to pose an acceptable level of risk for further development or deployment, if, for example: We assess that the deployment mitigations have brought the risk of severe harm to an appropriate level proportionate to the risk, based on considerations such as whether the risk has been reduced to an acceptable level by mitigations, and information pertaining to model propensities and the severity of related events.” (pp. 6–7)
“In particular, we will deem deployment mitigations adequate if the evidence suggests that for the CCLs the model has reached, the increase in likelihood of severe harm has been reduced to an acceptable level.” (p. 9)
The risk tolerance, implicit or otherwise, is not expressed fully or partly quantitatively.
No indication of expressing the risk tolerance beyond “severe harm”, which is not further defined. To improve, the risk tolerance should be expressed fully quantitatively or as a combination of scenarios with probabilities.
“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)
No indication of expressing the risk tolerance beyond “severe harm”, which is not further defined. There is no quantitative definition of severity nor probabilities given.
“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)
No evidence of asking the public what risk levels they find acceptable. No evidence of seeking regulator input specifically on what constitutes acceptable risk levels. However, there is a process which draws on “relevant high-quality research” and “information shared through industry forums” which informs CCLs (which function as risk tolerances/unacceptable risk tiers.) Partial credit is given thus.
“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)
“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:
The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16)
No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.
No relevant quotes found.
Each risk domain has at least one KRI, which is qualitatively defined. The KRI appears to be grounded in risk modelling, but is overly vague. To improve, they could have more KRIs of higher severity (i.e. ‘Level 2’), to show preparation (akin to OpenAI’s ‘Critical’ thresholds.) They have done this for Instrumental Reasoning capabilities but not others. To improve, KRIs should map directly to evaluation tests performed.
pp. 10–11:
“CBRN uplift level 1: Provides low to medium resourced actors uplift in reference scenarios resulting in additional expected harm at severe scale.” Footnote 10: “Here, and in other misuse CCLs, we intend this to mean relative to a baseline without generative AI.”
“Cyber uplift level 1: Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.”
“Harmful manipulation level 1 (exploratory): Possesses manipulative capabilities sufficient to enable it to systematically and substantially change beliefs and behavior in identified high stakes contexts over the course of interactions with the model, reasonably resulting in additional expected harm at severe scale.”
“ML R&D acceleration level 1: Has been used to accelerate AI development, resulting in AI progress substantially accelerating from historical rates.”
“ML R&D automation level 1: Can fully automate the work of any team of researchers at Google focused on improving AI capabilities, with approximately comparable all-inclusive costs.”
“Instrumental Reasoning Level 1: The instrumental reasoning abilities of the model enable enough situational awareness (ability to work out and use relevant details of its deployment setting) and stealth (ability to circumvent basic oversight mechanisms) such that, absent additional mitigations, we cannot rule out the model significantly undermining human control.”
“Instrumental Reasoning Level 2: The instrumental reasoning abilities of the model enable enough situational awareness and stealth that, even when relevant model outputs (including, e.g. scratchpads) are being monitored, we cannot detect or rule out the risk of a model significantly undermining human control.”
“Updated set of risks and mitigations: There may be additional risk domains and critical capabilities that fall into scope as AI capabilities improve and the external environment changes. Future work will aim to include additional pressing risks, which may include additional risk domains or higher CCLs within existing domains.” (p. 8)
“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)
“Note on Machine Learning R&D CCLs: Risk assessment must take into account the fact that other actors may put significantly more effort into eliciting capabilities than we put into assessing risk, thus requiring conservatism in the form of evaluations. However, as a frontier AI company, we do not expect other groups to put significantly more effort into ML R&D than we do ourselves. As a result, to assess the ML R&D CCLs, we may use sources of information about our own progress at accelerating ML R&D to assess whether we are near or at the CCLs, in addition to evaluations of ML R&D capabilities. Similarly, our alert threshold may be defined based on these sources of information, rather than on evaluation scores.” (pp. 5–6)
“Where model capabilities remain quite distant from a CCL, a response plan may involve the adoption of additional capability assessment processes to flag when heightened mitigations are required.” (p. 6)
The KRIs have opportunity to become quantitative, e.g. by specifying precisely what counts as a “sufficient uplift”, “high stakes contexts”, “severe scale”, “historical rates”, “comparable all-inclusivecosts”, and so on.
pp. 10–11, 13–14 and 15:
“CBRN uplift level 1: Provides low to medium resourced actors uplift in reference scenarios resulting in additional expected harm at severe scale.” Footnote 10: “Here, and in other misuse CCLs, we intend this to mean relative to a baseline without generative AI.”
“Cyber uplift level 1: Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.”
“Harmful manipulation level 1 (exploratory): Possesses manipulative capabilities sufficient to enable it to systematically and substantially change beliefs and behavior in identified high stakes contexts over the course of interactions with the model, reasonably resulting in additional expected harm at severe scale.”
“ML R&D acceleration level 1: Has been used to accelerate AI development, resulting in AI progress substantially accelerating from historical rates.”
“ML R&D automation level 1: Can fully automate the work of any team of researchers at Google focused on improving AI capabilities, with approximately comparable all-inclusive costs.”
“Instrumental Reasoning Level 1: The instrumental reasoning abilities of the model enable enough situational awareness (ability to work out and use relevant details of its deployment setting) and stealth (ability to circumvent basic oversight mechanisms) such that, absent additional mitigations, we cannot rule out the model significantly undermining human control.”
“Instrumental Reasoning Level 2: The instrumental reasoning abilities of the model enable enough situational awareness and stealth that, even when relevant model outputs (including, e.g. scratchpads) are being monitored, we cannot detect or rule out the risk of a model significantly undermining human control.”
The framework references referring to “model independent information” and to adjust the alert threshold (i.e., the KRI) if “the rate of progress suggests our safety buffer is no longer adequate.” Whilst this could be more specific, it shows partial implementation of KRIs monitoring the level of risk in the external environment. The ML R&D CCLs also take into account information such as Google DeepMind’s “own progress at accelerating ML R&D”. Mitigation efficacy assessment also takes into account “the historical incidence and severity of related events” for both misuse and ML R&D risks. To improve, the KRI must be measurable, with a specific threshold.
“The Framework is informed by the broader conversation on Frontier AI Safety and Security Frameworks. The core components of such Frameworks are to:
“We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)
“Note on Machine Learning R&D CCLs: Risk assessment must take into account the fact that other actors may put significantly more effort into eliciting capabilities than we put into assessing risk, thus requiring conservatism in the form of evaluations. However, as a frontier AI company, we do not expect other groups to put significantly more effort into ML R&D than we do ourselves. As a result, to assess the ML R&D CCLs, we may use sources of information about our own progress at accelerating ML R&D to assess whether we are near or at the CCLs, in addition to evaluations of ML R&D capabilities. Similarly, our alert threshold may be defined based on these sources of information, rather than on evaluation scores.” (pp. 5–6)
“We assess that the deployment mitigations have brought the risk of severe harm to an appropriate level proportionate to the risk, based on considerations such as whether the risk has been reduced to an acceptable level by mitigations, the scope of the deployment, what capabilities and mitigations are available on other publicly available models (e.g. if other models are similarly capable and have few mitigations, then the marginal risk added by our release is likely low), and the historical incidence and severity of related events. This is required only for external deployment, not further development.” (p. 7)
“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as:
“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:
a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.
b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as:
For each of the misuse KRIs, they have qualitative containment KCI thresholds related to the RAND security levels, though with a vague qualifier: “at a level generally aligned with RAND SL 2.” It is especially good that some reasoning behind each containment measure is given. However, containment KCIs need to also be defined for the misalignment KRIs.
From pp. 10–11, 13–14, 15:
CBRN, uplift 1: “Security level 2 The difficulty of building defenses against certain CBRN threats means the exfiltration and leak of model weights with this capability could be highly damaging. However, the low to medium resourced actors who would be likely to experience the most CBRN uplift are unlikely to pose a substantial exfiltration threat at the level of RAND OC3 groups.”
Cyber, uplift level 1: “Security level 2 Models able to greatly assist cyber attack might be of interest to well-resourced state actors. However, the potential for automated cyber-defense and social adaptation as a response to exfiltration means that higher levels of security, and the resulting costs to innovation, are likely not warranted.”
Harmful manipulation level 1 (exploratory): “Security level 2 The lower velocity of harm scenarios associated with this CCL and the viability of social defenses against large scale misuse of such models count against security mitigations with substantial costs to innovation.”
Machine Learning R&D, acceleration level 1: “Security level 3 Unrestricted access to models at this level of capability could significantly increase a threat actor’s ability to progress to yet more powerful models and other critical capabilities. The exfiltration of such a model may therefore have a significant effect on society’s ability to adapt to and govern powerful AI models, effects that may have long-lasting consequences. Substantially strengthened security is therefore recommended. However, we expect that acceleration will stem from systems of models integrated with workflows, rather than the model alone. The overall reduced impact of model weights counts against security levels with substantial innovation costs.”
Machine Learning R&D, automation level 1: “We recommend Security level 4 for this capability threshold, but emphasize that this must be taken on by the frontier AI field as a whole. Unrestricted access to models at this level of capability could give an actor (or AI systems) with adequate computational resources the ability to reach capabilities much more powerful than those in the other CCLs listed in a short amount of time. This could be catastrophic if there is no effective way of defending against rapidly improving and potentially superhuman AI systems wielded by threat actors. Therefore, we recommend models at this level of capability have exceptional security even though they may have substantial innovation costs.”
“Given its nascency, we expect our approach to misalignment risk to evolve substantially. This section is therefore illustrative only. Accordingly, we do not indicate security mitigations for models at these CCLs. The table below summarizes the initial approach:” (p. 15)
Footnote 8: “In other words, “security level N” indicates security controls and detections at a level generally aligned with RAND SL N. See https://www.rand.org/pubs/research_reports/RRA2849-1.html, pp 21-22. In aligning our security levels with RAND’s, we are referring to the security goals and principles in the RAND framework, rather than the benchmarks (i.e. concrete measures) also described in the RAND report. As the authors point out, the “security level benchmarks represent neither a complete standard nor a compliance regime—they are provided for informational purposes only and should inform security teams’ decisions rather than supersede them.”” (p. 8)
For each of the misuse KRIs, they reference the RAND security levels as the relevant containment KCI, though with a vague qualifier: “at a level generally aligned with RAND SL 2”.
These RAND levels count somewhat as quantitative containment KCIs, but would need to be coupled with probabilities to be fully quantitative. For instance, the RAND levels state criteria such as: ‘A system that can likely thwart most professional opportunistic efforts by attackers that execute moderate effort or non targeted attacks (OC2).’, and the actor is defined quantitatively (“Operations roughly less capable than or comparable to a single individual who is broadly capable in information security spending several weeks with a total budget of up to $10,000 on the specific operation, with preexisting personal cyber infrastructure but no preexisting access to the organization”), but ‘likely’ could be defined quantitatively as probability.
It is especially good that some reasoning behind each containment measure is given. However, this needs to also be defined for the misalignment KRIs.
From pp. 10–11, 13–14, 15:
CBRN, uplift 1: “Security level 2 The difficulty of building defenses against certain CBRN threats means the exfiltration and leak of model weights with this capability could be highly damaging. However, the low to medium resourced actors who would be likely to experience the most CBRN uplift are unlikely to pose a substantial exfiltration threat at the level of RAND OC3 groups.”
Cyber, uplift level 1: “Security level 2 Models able to greatly assist cyber attack might be of interest to well-resourced state actors. However, the potential for automated cyber-defense and social adaptation as a response to exfiltration means that higher levels of security, and the resulting costs to innovation, are likely not warranted.”
Harmful manipulation level 1 (exploratory): “Security level 2 The lower velocity of harm scenarios associated with this CCL and the viability of social defenses against large scale misuse of such models count against security mitigations with substantial costs to innovation.”
Machine Learning R&D, acceleration level 1: “Security level 3 Unrestricted access to models at this level of capability could significantly increase a threat actor’s ability to progress to yet more powerful models and other critical capabilities. The exfiltration of such a model may therefore have a significant effect on society’s ability to adapt to and govern powerful AI models, effects that may have long-lasting consequences. Substantially strengthened security is therefore recommended. However, we expect that acceleration will stem from systems of models integrated with workflows, rather than the model alone. The overall reduced impact of model weights counts against security levels with substantial innovation costs.”
Machine Learning R&D, automation level 1: “We recommend Security level 4 for this capability threshold, but emphasize that this must be taken on by the frontier AI field as a whole. Unrestricted access to models at this level of capability could give an actor (or AI systems) with adequate computational resources the ability to reach capabilities much more powerful than those in the other CCLs listed in a short amount of time. This could be catastrophic if there is no effective way of defending against rapidly improving and potentially superhuman AI systems wielded by threat actors. Therefore, we recommend models at this level of capability have exceptional security even though they may have substantial innovation costs.”
“Given its nascency, we expect our approach to misalignment risk to evolve substantially. This section is therefore illustrative only. Accordingly, we do not indicate security mitigations for models at these CCLs. The table below summarizes the initial approach:” (p. 15)
Footnote 8: “In other words, “security level N” indicates security controls and detections at a level generally aligned with RAND SL N. See https://www.rand.org/pubs/research_reports/RRA2849-1.html, pp 21-22. In aligning our security levels with RAND’s, we are referring to the security goals and principles in the RAND framework, rather than the benchmarks (i.e. concrete measures) also described in the RAND report. As the authors point out, the “security level benchmarks represent neither a complete standard nor a compliance regime—they are provided for informational purposes only and should inform security teams’ decisions rather than supersede them.”” (p. 8)
The implicit KCI for misuse is “[the ability to] counter the misuse of critical capabilities in deployments.” However, this is still quite vague – there is no detail as to what would indicate this ability or inability. The measures for the deployment KCI include “developing and improving a suite of safeguards targeting the capability”, suggesting each KRI has a deployment KCI, but it is not clear how the KCIs differ for each KRI.
“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as:
“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:
a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.
b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as:
There are no quantitative deployment KCI thresholds given.
No relevant quotes found.
For deceptive alignment, the implicit KCI is “the model is [incapable] of effective instrumental reasoning in ways that [cannot] be monitored.” However, this requires more detail, e.g. what would constitute as ability or inability. It is also not specific to the KRIs.
“Mitigation Approach and Critical Capability Levels.
An initial mitigation approach [for deceptive alignment] focuses on detecting when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control, assuming no additional mitigations were applied. When models reach this capability level, we believe applying an automated monitor to the model’s explicit reasoning (e.g. chain-of-thought output) is an effective mitigation. Once a model is capable of effective instrumental reasoning in ways that cannot be monitored, additional mitigations may be warranted—the development of which is an area of active research.”
There is a clear process for pairing KRI and KCIs via the ‘safety case’ model, for both misuse and ML R&D risks. However, this should be conducted before the model is developed, to justify why the KCI is high enough to mitigate the risk level if a KRI is crossed, before the KRI is crossed.
Further, safety cases should be developed for misalignment risks, too. The KRIs and KCIs should also be specifically linked via risk models.
Google DeepMind provides more detailed inputs to safety cases than other Providers, specifying factors such as: how much risk has been reduced by mitigations, the likelihood and consequences of misuse or misalignment, the scope of deployment, and historical incidence of related events. They also describe an iterative process for developing and assessing mitigations through testing (e.g. automated evaluations, red teaming) and threat modeling research.
However, several gaps remain. Google DeepMind does not provide a quantified confidence level, safety margin, or discrete measurable steps for risk acceptance determination. Further, like OpenAI, they permit assessing adequacy relative to other companies’ practices (e.g. “if another publicly deployed model is at the same CCL, and has mitigations that are less effective… the deployment of this model is less likely to materially increase risk”). This anchors standards to industry practice rather than absolute risk levels.
Additionally, safety cases should be developed ex ante (i.e. before the model is developed) to justify why the KCI threshold is sufficient to mitigate risk if a KRI is crossed. Google DeepMind’s current approach appears to develop safety cases after capability assessment rather than beforehand.
“Acceptance determination and mitigations: We then determine whether the model has met or will meet a CCL and, if so, whether we need to implement any further mitigations to reduce the risk to an acceptable level (see below).” (p. 5)
“We assess that the deployment mitigations have brought the risk of severe harm to an appropriate level proportionate to the risk, based on considerations such as whether the risk has been reduced to an acceptable level by mitigations, the scope of the deployment, what capabilities and mitigations are available on other publicly available models (e.g. if other models are similarly capable and have few mitigations, then the marginal risk added by our release is likely low), and the historical incidence and severity of related events. This is required only for external deployment, not further development.” (p. 7)
“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as:
“These recommended security levels reflect our current thinking proportionate to the risks posed and may be adjusted if our understanding of the risks changes. This may occur if, for example, a model does not possess capabilities meaningfully different from other publicly available models that have weaker security applied (in which case the marginal benefit of higher security is limited), or if we assess that the benefits of the open release of model weights outweigh the risks.” (p. 9)
“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:
a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.
b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as:
There is not a clear commitment to put development on hold, only that external deployment is subject to review from the appropriate governance function. The commitment “we will deem deployment mitigations adequate if the evidence suggests that for the CCLs the model has reached, the increase in likelihood of severe harm has been reduced to an acceptable level” should make it more clear that this means deployment will be put on hold if the corresponding KCI cannot be met for a given KRI (ie CCL). This must be made explicit so that there is as little discretion as is reasonably possible, at the time of decisionmaking.
“external deployments and large scale internal deployments of a model take place only after the appropriate governance function determines the safety case regarding each CCL the model has reached to be adequate. In particular, we will deem deployment mitigations adequate if the evidence suggests that for the CCLs the model has reached, the increase in likelihood of severe harm has been reduced to an acceptable level.” (pp. 12–13)
The framework outlines potential containment measures, but does not commit to them. To improve, they should be precise as to what containment measures they plan to implement. This transparency allows public scrutiny so their measures can improve.
“Security mitigations against exfiltration risk, such as identity and access management practices and hardening interface-access to unreleased model parameters, are important for models reaching CCLs.” (p. 8)
Footnote 11: “Mitigations at this level may include model access management, physical security controls, authentication measures, endpoint security, access management, secure model storage, vulnerability detection & management, detection of & response to suspected malicious activity.” (p. 10)
Footnote 15: “This level may include mitigations aligned with SL 2, plus additional mitigations designed to prevent unilateral access, harden infrastructure, and prevent data exfiltration.” (p. 13)
Footnote 16: “This level may include mitigations aligned with SL 2 and 3, plus additional mitigations aimed to isolate model weights, enhanced data center security, further hardening of infrastructure and minimizing potential attack surface.” (p. 14)
Whilst the framework mentions internal validation that containment measures are sufficient, proof is not provided for why they believe their given containment measures to be likely to be sufficient.
“We will use various processes to evaluate the effectiveness and limitations of mitigations:
There is no mention of third-party verification of containment measures meeting the threshold.
No relevant quotes found.
The framework mentions some possible deployment measures (‘deployment mitigations’), but without explicit commitment to implementing them. To improve, they should detail precisely the deployment measures which will be implemented to meet the relevant deployment KCI threshold.
“Developing and improving a suite of safeguards targeting the capability, which may include measures such as safety post-training, monitoring and analysis, account moderation, jailbreak detection and patching, user verification, and bug bounties” (p. 8)
The framework describes a process, assumedly internal, for “evaluate the effectiveness and limitations of mitigations”, but does not detail why they ex ante believe their deployment measures to be sufficient. Instead, it relies on the “appropriate corporate governance body” and their discretion. To improve, this proof should be garnered in advance, to be sure that the measures will be sufficient to meet the KCI threshold once the model crosses the relevant KRI threshold, and indeed have “proactive mitigation plans”.
“We will use various processes to evaluate the effectiveness and limitations of mitigations: […] Deployment mitigations: we will use a combination of threat modeling, empirical testing, and other sources of information to assess the effectiveness and limitations of our deployment mitigations. These will form the basis of a safety case for models reaching CCLs, that will be reviewed before deployment.” (p. 6)
“Prepare and articulate proactive mitigation plans to ensure severe risks are adequately mitigated when such capability levels are attained.” (p. 2)
“This process is designed to ensure that residual risk remains at acceptable levels: evidence of efficacy collected during development and testing, as well as expert-driven estimates of other parameters, will enable us to assess residual risk and to detect substantial changes that invalidate our risk assessment. With iteration on safeguards and safety cases, we believe that we are able to make informed decisions about the level of risk via a CCL before a model is released, and reliably prevent models posing unacceptable levels of risk from being deployed.” (p. 9)
There is no mention of third-party verification of deployment measures meeting the threshold.
“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)
The framework mentions they are “actively researching approaches to addressing models” that reach the highest misalignment capability, instrumental reasoning level 2. However, they do not provide detail on how they will achieve this, or by what point it will need to be intact (i.e., whether assurance processes must be settled before the model has reached some margin of the critical capability).
“Instrumental Reasoning Level 2: […] Mitigation: Future work: We are actively researching approaches to addressing models that reach this CCL.”
“Prepare and articulate proactive mitigation plans to ensure severe risks are adequately mitigated when such capability levels are attained.” (p. 2)
They acknowledge that automated monitoring may become redundant “once a model is capable of effective instrumental reasoning in ways that cannot be monitored”, and don’t provide evidence for why they believe they are not yet at this point/how they will measure this. To improve, they should describe a methodology for testing their assumptions for assurance processes to be effective, such as via evaluations on model organisms or theoretical proofs.
“Here we describe an approach for addressing misalignment risk that focuses specifically on when models may develop a baseline instrumental reasoning ability at which, without additional mitigations, they may have the potential to undermine human control. When models reach this capability level, one possible mitigation is to apply an automated monitor to the model’s explicit reasoning (e.g. chain-of-thought output). Once a model is capable of effective instrumental reasoning in ways that cannot be monitored, additional mitigations may be warranted—the development of which is an area of active research.” (p. 15)
They do explicitly outline that safety cases require many assumptions/factors, but they do not spell out exactly what these assumptions are. Whilst they refer to other work (https://arxiv.org/pdf/2505.01420) which spells out these safety cases in more detail, it would be preferable for this to be in an appendix of the framework, to keep commitments very clear. Overall then, to improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.
However, more detail could be given on this assumption. For instance, the framework could further explore the assumptions required for automated monitoring to be successful/the control evaluation to be legitimate requires; some of these might include faithful chain-of-thought reasoning, red-team expertise of the human red team or the weaker monitor, or absence of collusion between the monitor and model. Overall then, to improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.
“We will use various processes to evaluate the effectiveness and limitations of mitigations:
“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as:
“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:
a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.
b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as:
Whilst they express commitment to developing intensive elicitation methods, they do not provide justification that their evaluations are comprehensive enough. Further, “we seek to equip the model” only signals an intent, rather than a commitment. Nonetheless, they do acknowledge that evaluations require “conservatism” in case of extra elicitation effort. More detail could be added on which elicitation methods they anticipate would be used by different threat actors, under realistic settings, and their exact elicitation setup.
“Risk assessment will necessarily involve evaluating cross-cutting capabilities such as agency, tool use, reasoning, and scientific understanding.” (p. 2)
“Analysis: Central to our model evaluations are “early warning evaluations,” to assess the proximity of the model to a CCL. We define “alert thresholds” for these evaluations that are designed to flag when a CCL may be reached before a risk assessment is conducted again. In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)
“Risk assessment must take into account the fact that other actors may put significantly more effort into eliciting capabilities than we put into assessing risk, thus requiring conservatism in the form of evaluations.” (p. 5)
They demonstrate an intent to run evaluations frequently, according to a “safety buffer”, implying that this pertains to rate of progress of AI capabilities, but do not describe what this safety buffer is or what determines how frequently these are run. They commit to evaluating at least whenever there is the “first external deployment” or “if the model has meaningful new capabilities or a material increase in performance.” However, to improve, their frequency should not depend on noticing capability jumps, as jumps may be larger than mitigations can prepare for by the time they are noticed. Instead, a frequent pace could guarantee consistent measurement.
“Analysis: Central to our model evaluations are “early warning evaluations,” to assess the proximity of the model to a CCL. We define “alert thresholds” for these evaluations that are designed to flag when a CCL may be reached before a risk assessment is conducted again. In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)
“For each risk domain, we conduct aspects of our risk assessment at various moments throughout the model development process, both before and after deployment. We conduct a risk assessment for the first external deployment of a new frontier AI model. For subsequent versions of the model, we conduct a further risk assessment if the model has meaningful new capabilities or a material increase in performance, until the model is retired or we deploy a more capable model. The reason for this is because a material change in the model’s capabilities may mean that the risk profile of the model has changed or the justification for why the risks stemming from the model are acceptable has been materially undermined. To identify meaningful new capabilities or material increases in performance, we conduct model capability evaluations, including our automated benchmarks. These evaluations are primarily aimed at understanding the capabilities of the model and may be triggered, for example, upon the completion of a pre-training or post-training run, on various candidates of a model version. These evaluations include a broad range of areas, including general capability evaluations, model behavior, efficiency, coding capabilities, multilinguality, or reasoning. Data from these evaluations are collected and analyzed to give us an indication as to how the model is performing and whether a risk assessment is necessary. At a high level, our risk assessment involves the following steps (which do not need to be repeated where a previous risk assessment is still appropriate):” (pp. 4–5)
The “safety buffer” quoted here likely refers to the assumption that capability evaluations are underestimating future capabilities, given post-training enhancements. It would be an improvement to make this more explicit. They also note that safety cases must take into account “capability improvements after the risk assessment”. More detail on this methodology, e.g. the enhancements used, or the forecasting exercises completed to assure a wide enough safety buffer, would improve the score.
Further, more detail could be added on how they account(ed) for how post-training enhancements’ risk profiles change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.
“Acceptance determination and mitigations: We then determine whether the model has met or will meet a CCL and, if so, whether we need to implement any further mitigations to reduce the risk to an acceptable level (see below).” (p. 5)
“Assessing the robustness of these mitigations against the risk posed through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, and could take into account factors such as: […] The likelihood and consequences of model misuse, capability improvements after the risk assessment, and likelihood and consequences of our mitigations being circumvented, deactivated, or subverted.” (pp. 8–9)
“We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate.” (p. 5)
There is no mention of having the evaluation methodology vetted by third parties. However, they do make a discretionary commitment to involve external experts when determining the level of risk after a KRI threshold is crossed, showing some awareness that external opinion is helpful when assessing the risks and capabilities of a model.
“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p. 3)
“We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)
“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)
There is no mention of having evaluations replicated, though they mention that they “may use additional external evaluators […] if evaluators with relevant expertise are needed to provide an additional signal about a model’s proximity to CCLs.” This only shows partial implementation.
“We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)
“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)
“we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed.” (p. 5)
There is mention of updating mitigations as a result of post-market monitoring, but not necessarily of measuring mitigations outirght, and more detail could be given on how frequent this is. An improvement would be to commit to a systematic, ongoing monitoring scheme to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will still be met, when required.
Finally, it is commendable that they conduct “post-deployment processes”, where the “safety cases and mitigations may be updated if deemed necessary by post-market monitoring.” More detail could be provided on what would constitute a necessary update.
“We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)
“We will use various processes to evaluate the effectiveness and limitations of mitigations: […] Deployment mitigations: we will use a combination of threat modeling, empirical testing, and other sources of information to assess the effectiveness and limitations of our deployment mitigations.” (p. 6)
“our safety cases and mitigations may be updated if deemed necessary by post-market monitoring.” (p. 9)
“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may: […] Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding.” (p. 16)
“Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:
a. Developing and improving a suite of safeguards targeting the capability, which may include measures such as limiting affordances, monitoring and escalation, auditing, and alignment training, in addition to measures for preventing large scale misuse.
b. Assessing the robustness of these mitigations against the risk posed in both internal and external deployment through testing (e.g. automated evaluations, red teaming) and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as: […] likelihood and consequences of our mitigations being circumvented, deactivated, or subverted […] Model propensity for, historical incidence of and severity of related events: for example, such data may suggest a high (or low) likelihood of misalignment in or misuse of models at the CCL, and mitigations would consequently have to be stronger (or not as strong) for deployment to be appropriate.” (p. 12)
External input into mitigation protocols is optional and only ‘informs’ the response plan.
“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p. 3)
“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)
There is no mention of control evaluations/mitigation testing being replicated or conducted by third-parties.
“Analysis: Central to our model evaluations are “early warning evaluations,” to assess the proximity of the model to a CCL. We define “alert thresholds” for these evaluations that are designed to flag when a CCL may be reached before a risk assessment is conducted again. In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model independent information, external evaluations, and post-market monitoring as appropriate.” (p. 5)
“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)
They mention sharing information with the government when models have critical capabilities, though the content of this information remains discretionary. There are no commitments to share evaluation reports to the public if models are deployed.
“Our approach to model evaluations and risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any severe risk is properly identified and mitigated. Where appropriate, we may engage relevant and appropriate external actors, including governments, to inform our responsible development and deployment practices.” (p. 5)
“If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share relevant information with appropriate government authorities where it will facilitate safety of frontier AI. Where appropriate, and subject to adequate confidentiality and security measures and considerations around proprietary and sensitive information, this information may include:
We may also consider disclosing information to other external organisations to promote shared learning and coordinated risk mitigation. We will continue to review and evolve our disclosure process over time.” (p. 16)
No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.
No relevant quotes found.
They show a commitment to assess “whether there are other risk domains where severe risks may arise” and “update our risk domains and CCLs, where necessary” at least annually. To improve, such a process for identifying novel risks/novel risk models should be detailed, such as threat modeling exercises or monitoring.
This is especially important as “we cannot detect or rule out the risk of a model significantly undermining human control” is a critical capability level, and so represents “a foreseeable path to severe harm”. Necessarily then, monitoring for changes in this risk profile, or other aspects which may make this risk profile more or less likely, is likely highly relevant for assessing risk. Whilst they state an intent to update their set of risks and mitigations, a monitoring setup specifically to detect novel risk profiles is not detailed.
“As part of our broader research into frontier AI models, we continue to assess whether there are other risk domains where severe risks may arise and will update our approach as appropriate.” (p. 5)
“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:
The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16)
There is no formal mechanism for incorporating risks identified post-deployment into a structured risk modelling process. However, they do indicate that they may update risk domains at least annually (though not necessarily risk models). To improve, novel risks or risk pathways identified via monitoring post-deployment should trigger further risk modeling and scenario analysis. This may include updating multiple or all risk models.
Google DeepMind commits to updating “risk domains and CCLs” at least annually, but does not describe how novel risks identified via post-deployment monitoring would trigger further risk modeling or scenario analysis. The focus appears to be on updating domains and testing approaches rather than structured incorporation of new risk pathways into existing risk models. To improve, novel risks should explicitly trigger additional threat modeling beyond domain-level updates.
“The Frontier Safety Framework will be updated at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of systemic risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:
The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p. 16)
No mention of risk owners.
No relevant quotes found.
No mention of a management risk committee.
No relevant quotes found.
The framework outlines fairly detailed protocols for decision-making, but is more vague than some other companies on who makes the decisions and the basis for them.
“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p.6)
“2. Pre-deployment review of residual risk assessment: external deployments of a model take place only after the appropriate governance function determines the residual risk to be acceptable (including a safety case where a CCL has been reached). In particular, we will deem deployment mitigations adequate if the evidence suggests that for the T/CCLs the model has reached, the increase in likelihood of harm from the proposed external deployment has been reduced to an acceptable level. 3. Post-deployment processes: our residual risk assessments, safety cases and mitigations may be updated as a result of post-market monitoring, including information about incidents relating to our frontier safety risk domains. Material updates to a safety case will be submitted to the appropriate governance function for review and might result in updates to the related residual risk assessment or safety case.” (p.9)
“For Google models, when alert thresholds are reached, the response plan will be reviewed and approved by appropriate corporate governance bodies”. (p.7)
The framework mentions that GDM “strive to learn” from their post-market monitoring mechanisms, including from “detection, mitigation, response and/or reporting of incidents relating to frontier safety risk domains”. It also states that “residual risk assessments, safety cases and mitigations may be updated as a result of post-market monitoring, including information about incidents”, which “might result in updates to the related residual risk assessments or safety case”. This provides a high-level description of actions that might be taken in the case of incidents. However, it uses non-committal language, and does not specify timelines and specific actors involved in the process
“we strive to learn from our post-market monitoring mechanisms, including as part of our detection, mitigation, response and/or reporting of incidents relating to our frontier safety risk domains. Actionable insights from these processes allow us to enhance our tools, training, processes, policies, and response efforts.” (p.5)
“Post-deployment processes: our residual risk assessments, safety cases and mitigations may be updated as a result of post-market monitoring, including information about incidents relating to our frontier safety risk domains. Material updates to a safety case will be submitted to the appropriate governance function for review and might result in updates to the related residual risk assessment or safety case.” (p.14)
No mention of an executive risk officer.
No relevant quotes found.
The only structures mentioned are “appropriate corporate governance bodies” and the “appropriate governance function”, but no specifics are provided.
“Pre-deployment review of residual risk assessment: external deployments and high-risk internal deployments of a model take place only after the appropriate governance function determines the residual risk to be acceptable (including a safety case where a CCL has been reached) […]
Post-deployment processes: our residual risk assessments, safety cases and mitigations may be updated as a result of post-market monitoring, including information about incidents relating to our frontier safety risk domains. Material updates to a safety case will be submitted to the appropriate governance function for review and might result in updates to the related residual risk assessment or safety case.” (p.14)
“The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p.17)
“A critical capability assessment may not be conducted for low-risk external deployments (e.g. to a small number of trusted testers) if the appropriate governance function determines the residual risk of such deployments to be acceptable even if the model has reached a T/CCL (including with a safety case where a CCL has been reached).” (p.5)
The framework specifies a system of tracked (TCL) and critical (CCL) capability levels which helps GDM keep track of the risks stemming from its models in a systematic way. Classification into those levels is done based on “evaluation results, expert assessments, and other sources of information”, as well as “model-indepenedent information, external evalutioans, and post market monitoring as appropriate”. While this creates a framework to asess and categorize risks, the framework is missing a description of a concrete way that the risk information is aggregated, such as a dashboard or equivalent
“In the Framework, we specify protocols for the detection of capability levels at which frontier AI models may pose significant or severe risks (which we call ‘Tracked Capability Levels (TCLs)’ and ‘Critical Capability Levels (CCLs)’ respectively)” (p.2)
“To understand if a subsequent version of the model has meaningful new capabilities or material increases in performance relative to the last checkpoint subject to a critical capability assessment, we conduct material capability change assessments on various checkpoints upon the completion of a post-training run. […] We may also conduct light-weight versions of our ‘early warning evaluations’ […] to confirm a model remains below T/CCL.” (p.5)
“Central to our critical capability assessments are ‘early warning evaluations,’ which we use to test the specific threats and risk scenarios identified through our threat modeling, determine a model’s capability, and assess the proximity of the model to a T/CCL. For CCLs, we define ‘alert thresholds,’ which may draw on evaluation results, expert assessments, and other sources of information; that are designed to flag when a CCL may be reached before a critical capability assessment is conducted again. […] We may run early warning evaluations more frequently or adjust alert thresholds if the rate of progress suggests our safety buffer is no longer adequate. We conduct further analysis, including reviewing model-independent information, external evaluations, and post-market monitoring as appropriate. In particular, we strive to learn from our post-market monitoring mechanisms, including as part of our detection, mitigation, response and/or reporting of incidents relating to our frontier safety risk domains. Actionable insights from these processes allow us to enhance our tools, training, processes, policies, and response efforts.
Our approach to model evaluations and inherent risk assessments described above means we can proactively monitor a model’s capabilities throughout the entire lifecycle of the model and ensure that any significant or severe risk is properly identified and mitigated.” (pp.5–6)
“Our residual risk assessments, safety cases and mitigations may be updated as a result of post-market monitoring, including information about incidents relating to our frontier safety risk domains. Material updates to a safety case will be submitted to the appropriate governance function for review […]” (p.14)
“When a model has reached this TCL, we will carry out periodic residual risk assessments of the misalignment risk posed. This assessment may take into account models’ alignment propensities, capabilities, and our defenses against misaligned models.” (p.14)
No mention of people that challenge decisions.
No relevant quotes found.
The framework states that “material updates to a safety case will be submitted to the appropriate governance function for review”, and that “updated version [of the framework] and framework assessment will be reviewed by the appropriate corporate governance bodies”. It does not specify what those governance functions or bodies are, what the format is, and at what cadence information is shared.
“Post-deployment processes: our residual risk assessments, safety cases and mitigations may be updated as a result of post-market monitoring, including information about incidents relating to our frontier safety risk domains. Material updates to a safety case will be submitted to the appropriate governance function for review and might result in updates to the related residual risk assessment or safety case.” (p.14)
“The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p.17)
“Responsibilities for assessing and mitigating risks are clearly defined and allocated across all levels of the organization. This includes legal, compliance, and safety reviews with escalation procedures to ensure appropriate oversight.” (p.16)
No mention of a central risk function.
No relevant quotes found.
Google DeepMind commits to assessing “adherence to the Framework” at least annually, but does not specify which body conducts this assessment or how independence is ensured. The review is described as being conducted by “appropriate corporate governance bodies” without further specification. This lacks the independence and specificity typically associated with internal audit functions.
“We have in place a well-established and comprehensive internal governance structure designed to ensure the robust implementation of the processes outlined in this Frontier Safety Framework. Responsibilities for assessing and mitigating risks are clearly defined and allocated across all levels of the organization. This includes legal, compliance, and safety reviews with escalation procedures to ensure appropriate oversight.” (p.16)
“The Frontier Safety Framework will be reviewed at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of significant and severe risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:
● Update our risk domains and T/CCLs, where necessary.
● Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding.
The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p.17)
The framework mentions involving external experts “as needed” to assess the proximity of a model to a TCL or CCL or to infrom GDM’s development and deployment practices. It leaves unspecified who those external experts are, how their independence is ensured, the degree of access granted to them, and the scope of their involvement.
“[E]arly warning evaluations will be used to assess the proximity of a model to a TCL and analyze the risk posed, involving internal and external experts as needed.” (p.4)
“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed.” (p.6)
“Where appropriate, we may engage relevant external actors, including governments, to inform our responsible development and deployment practices.” (p.6)
No mention of a Board risk committee.
No relevant quotes found.
The framework mentions “corporate governance bodies” that review frameworks and can veto external and high-risk internal deployments of models, but it is unclear whether these entities reside within or outside the board, and what the scope of their roles and responsibilities are.
“We have in place a well-established and comprehensive internal governance structure designed to ensure the robust implementation of the processes outlined in this Frontier Safety Framework. Responsibilities for assessing and mitigating risks are clearly defined and allocated across all levels of the organization. This includes legal, compliance, and safety reviews with escalation procedures to ensure appropriate oversight.” (p.16)
“The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p.17)
“Pre-deployment review of residual risk assessment: external deployments and high-risk internal deployments of a model take place only after the appropriate governance function determines the residual risk to be acceptable […] Material updates to a safety case will be submitted to the appropriate governance function for review […]” (p.13)
The framework includes a few references that reinforces the tone from the top, but would benefit from more substantial commitments to managing risk. It also posits that mitigations should “assessed holistically” to “balanc[e] safety with innovation”, which is a defensible principle but could be interpreted as softening the framework’s commitment to upholding safe development and deployment practices in the face of asserted benefits to “innovation”.
“The Frontier Safety Framework is a set of protocols that aims to address severe risks that may arise from the high-impact capabilities of frontier AI models. It complements Google’s suite of AI responsibility and safety practices, and enables Google’s AI innovation and deployment consistent with our AI Principles.” (p.2)
“The safety and security of frontier AI models is a global public good.” (p.2)
The framework notes that GDM’s safety approach “may change over time” and commits to reviewing the Frontier Safety Framework at least once a year, with more frequent reviews possible in the case of inadequacy of the framework or spotty adherence to it. To improve, the framework should either commit to building a strong risk culture or include more details on risk training, safety drills, internal transparency, and other risk culture-promoting activities.
“The Framework is based on early and evolving research. We may change our approach over time as we gain experience and insights on the projected capabilities of future frontier models. We will review the Framework periodically and we expect it to evolve substantially as our understanding of the risks and benefits of frontier models improves.” (p.2)
“The Frontier Safety Framework will be reviewed at least once a year—more frequently if we have reasonable grounds to believe the adequacy of the Framework or our adherence to it has been materially undermined. The process will involve (i) an assessment of the Framework’s appropriateness for the management of significant and severe risk, drawing on information sources such as record of adherence to the framework, relevant high-quality research, information shared through industry forums, and evaluation results, as necessary, and (ii) an assessment of our adherence to the Framework. Following this assessment, we may:
● Update our risk domains and T/CCLs, where necessary.
● Update our testing and mitigation approaches, where needed to ensure risk remains adequately assessed and addressed according to our current understanding.
The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p.17)
No mention of elements of speak-up culture.
No relevant quotes found.
The framework states which capabilities the company is tracking, but does not explicitly commit to communicate the risk findings for individual models. To improve its score, the company should specify how it will provide information regarding risks going forward in e.g. model cards.
“In the Framework, we specify protocols for the detection of capability levels at which frontier AI models may pose significant or severe risks (which we call ‘Tracked Capability Levels (TCLs)’ and ‘Critical Capability Levels (CCLs)’ respectively), and articulate mitigation approaches to address such risks. The Framework addresses misuse risk as well as machine learning research and development (ML R&D) and misalignment risk.” (p.2)
The Framework repeatedly mentions “corporate governance bodies” and “governance functions”, but does not clearly distinguish between them or assign clear tasks to different bodies. It includes a distinct section on governance (Section 4), which is commendable, but this section only includes one paragraph which asserts at a high-level that a “comprehensive internal govenrance structure” exists, without specifying any operational detail.
“Section 4: Governance and Accountability
4.1 Governance structure
We have in place a well-established and comprehensive internal governance structure designed to ensure the robust implementation of the processes outlined in this Frontier Safety Framework. Responsibilities for assessing and mitigating risks are clearly defined and allocated across all levels of the organization. This includes legal, compliance, and safety reviews with escalation procedures to ensure appropriate oversight.” (p.16)
“external deployments and high-risk internal deployments of a model take place only after the appropriate governance function determines the residual risk to be acceptable” (p.14)
“The updated version and framework assessment will be reviewed by the appropriate corporate governance bodies.” (p.17)
“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed.” (p.6)
“Material updates to a safety case will be submitted to the appropriate governance function for review and might result in updates to the related residual risk assessment or safety case.” (p.10)
The framework suggests potential information sharing with “appropriate government authorities” and lists potential aspects this shared information may include, but frames it as a discretionary choice. For a higher score, the company would need to add precision to the sharing trigger, the information shared, the parties it will share it with, and make a binding commitment.
“If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share relevant information with appropriate government authorities where it will facilitate safety of frontier AI. Where appropriate, and subject to adequate confidentiality and security measures and considerations around proprietary and sensitive information, this information may include:
● Model information: characteristics of the AI model relevant to the risk it may pose with its critical capabilities.
● Evaluation results: such as details about the evaluation design, the results, and any robustness tests.
● Mitigation plans: descriptions of our mitigation plans and how they are expected to reduce the risk.
We may also consider disclosing information to other external organizations to promote shared learning and coordinated risk mitigation. We will continue to review and evolve our disclosure process over time.” (p.17)