DeepMind

Weak 1.0/5
very weak
weak
moderate
substantial
strong
Risk Identification
Learn more
Risk Identification
20%
Risk Analysis and Evaluation
Learn more
Risk Analysis and Evaluation
19%
Risk Treatment
Learn more
Risk Treatment
26%
Risk Governance
Learn more
Risk Governance
16%
Best in class

SEE FRAMEWORK

  • GoogleDeepMind stands out for their relatively large number of governance bodies that are involved in risk decision making, including the Google DeepMind AGI Safety Council, Google DeepMind Responsibility and Safety Council, and Google Trust & Compliance Council.
Overview
Highlights relative to others

Uniquely includes risk category of deceptive alignment. Recognition that automated monitoring may not be sufficient for mitigation.

References use of a safety case, displaying stronger recognition of the need to prove safety of models.

Stronger definitions and linkages between KRI thresholds and KCI thresholds.

Weaknesses relative to others

Lacking in some key governance measures, such as risk owners, internal audit, and an executive responsible for risk management process.

Escalation procedures are not specified.

No details of a strong speak-up culture or external information sharing.

Containment measures are not defined.

Changes

If they hadn't made some changes to their framework, they would've attained a higher score.

Compared to their Frontier Safety Framework Version 1.0, they:

 

1. Added an explicit dependency on these practices being adopted by the industry as a whole for mitigations to be committed to. This is similar to a marginal risk clause, which detracts from the spirit of risk management.

 

2. Removed targeted date for implementation. This detracts from having credible plans to develop assurance processes (and other mitigations).

 

3. Removed the set cadence for evaluations, now instead opting for “regular” evaluations.

1.1 Classification of Applicable Known Risks (40%) 43%

1.1.1 Risks from literature and taxonomies are well covered (50%) 75%

Risk domains covered include CBRN, Cyber, Machine Learning R&D, and instrumental reasoning. More justification could be given for why they focus on instrumental reasoning as the main metric of loss of control risks as opposed to other metrics of loss of control, though it is commendable they are breaking down loss of control risks into more measurable risk areas for their models.

There is a reference to “early research” informing which domains of risk they focus on. There is no further justification for why they chose to select these domains; to improve, they could include documents which informed their risk identification process. However, they do note that their Framework overall is informed by other frameworks, which they link, showing awareness of the importance of linking wider literature.

1.1.2 is below 50% and persuasion is excluded.

Quotes:

“For misuse risk, we define [Critical Capability Levels] in high-risk domains where, based on early research, we believe risks of severe harm may be most likely to arise from future models:

  • CBRN: Risks of models assisting in the development, preparation, and/or execution of a chemical, biological, radiological, or nuclear (“CBRN”) attack.
  • Cyber: Risks of models assisting in the development, preparation, and/or execution of a cyber attack.
  • Machine Learning R&D: Risks of the misuse of models capable of accelerating the rate of AI progress to potentially destabilizing levels, the result of which could be the unsafe attainment or proliferation of other powerful AI models. Capabilities in this area are under active research, and in the longer term may exacerbate frontier AI risks–including in other risk domains–if insufficiently managed.” (p. 2)

“For deceptive alignment risk, the initial approach focuses on detecting when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control, assuming no additional mitigations were applied. The two instrumental reasoning CCLs thus focus on delineating when such capability becomes present, and subsequently when the initial mitigation for this capability—automated monitoring—is no longer adequate.” (pp. 2-3)

“The Framework is informed by the broader conversation on Frontier AI Safety Frameworks.” (p. 1) followed by Footnote 1: “See https://www.gov.uk/government/publications/emerging-processes-for-frontier-ai-safety,
https://metr.org/blog/2023-09-26-rsp/, https://www.anthropic.com/news/anthropics-responsible-scaling-policy,
https://openai.com/preparedness/, https://www.frontiermodelforum.org/updates/issue-brief-components-of-frontier-ai-safety-frameworks/

1.1.2 Exclusions are clearly justified and documented (50%) 10%

They justify in a footnote why a previous risk domain, Autonomy, that was considered has now been omitted from consideration, as “Most of the advanced risk that was captured by this CCL is now covered by our misalignment section.” However, more justification here could be given for why they believe this. To improve, justification could refer to at least one of: academic literature/scientific consensus; internal threat modelling with transparency; third-party validation, with named expert groups and reasons for their validation.

There is no justification for why other risks, such as persuasion or other forms of loss of control risks, have not been considered.

Quotes:

“Note that we have removed the Autonomy risk domain, which was included in Frontier Safety Framework version 1.0. Most of the advanced risk that was captured by this CCL is now covered by our misalignment section. From the perspective of misuse risks, our threat models suggest that no heightened deployment mitigations would be necessary, and that security controls and detection at a level generally aligned with RAND SL 2 would be adequate.” (Footnote 9, p. 5)

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.2.2 Third party open-ended red teaming (30%) 0%

The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.3 Risk modeling (40%) 7%

1.3.1 The company uses risk models for all the risk domains identified
and the risk models are published (with potentially dangerous
information redacted) (40%) 10%

There is an indication of a willingness to engage in risk modelling (i.e. Critical Capability Levels “can be determined by” [threat modeling]), and some evidence of partial implementation, though no explicit commitment for undertaking risk modelling for each risk domain identified. They state that they “aim to address […] greater precision in risk modeling”, indicating an awareness that risk models are necessary to conduct for all areas of monitored risk. However, more detail on how they will achieve this precision should be given.

Further, any risk models completed are not published. To improve, they could reference literature in which their risk models have been published, e.g. refer to (Rodriguez et al. 2025)

Quotes:

“[Critical Capability Levels] can be determined by identifying and analyzing the main foreseeable paths through which a model could cause severe harm, and then defining the CCLs as the minimal set of capabilities a model must possess to do so.” (p. 2)

“Future Work: […] Issues that we aim to address in future versions of the Framework include:
– Greater precision in risk modeling: While we have updated our [Critical Capability Levels] and underlying threat models from version 1.0, there remains significant room for improvement in understanding the risks posed by models in different domains, and refining our set of CCLs.”

1.3.2 Risk modeling methodology (40%) 9%

1.3.2.1 Methodology precisely defined (70%) 10%

There is an indication of an awareness of risk modeling methodologies, but there are no details about implementation.

Quotes:

“[Critical Capability Levels] can be determined by identifying and analyzing the main foreseeable paths through which a model could cause severe harm, and then defining the CCLs as the minimal set of capabilities a model must possess to do so.” (p. 2)

“Future Work: […] Greater precision in risk modeling: While we have updated our [Critical Capability Levels] and underlying threat models from version 1.0, there remains significant room for improvement in understanding the risks posed by models in different domains, and refining our set of CCLs.”

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.

Quotes:

No relevant quotes found.

1.3.2.3 Prioritization of severe and probable risks (15%) 10%

There is an explicit intent to prioritize monitoring capabilities in “high-risk domains” which “may be most likely” to cause severe harm, or “may pose heightened risk of severe harm.” However, they do not identify these capabilities from mutliple risk models which they then prioritize; rather, they describe a high level preference. In other words, the list of identified scenarios, plus justification for why their chosen risk models are most severe or probable, is not detailed.

Quotes:

“These [critical capability levels (CCLs)] are capability levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)

“For misuse risk, we define CCLs in high-risk domains where, based on early research, we believe risks of severe harm may be most likely to arise from future models” (p. 2)

1.3.3 Third party validation of risk models (20%) 0%

There is a brief mention that third parties may help “inform and guide our approach”, but this is very broad and applicable to the whole of the framework. There is no explicit or implicit mention that there is third party input into risk modelling, let alone third party validation.

Quotes:

“The core components of Frontier AI Safety Frameworks are to: […] Where appropriate, involve external parties to help inform and guide our approach.” (p. 1)

Back to top

2.1 Setting a Risk Tolerance (35%) 7%

2.1.1 Risk tolerance is defined (80%) 8%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 25%

They indicate that they will not tolerate certain risks of “severe harm” which is not further defined. Each capability threshold function as an implicit risk tolerance, e.g. “Cyber Autonomy Level 1: Can be used to drastically reduce the cost (e.g. through full automation) of carrying out end-to-end cyberattacks on organizations with a limited security posture.”

To improve, they must set out the maximum amount of risk the company is willing to accept, for each risk domain (though they need not differ between risk domains), ideally expressed in terms of probabilities and severity (economic damages, physical lives, etc), and separate from KRIs.

Quotes:

“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)

“Cyber Autonomy Level 1: Can be used to drastically reduce the cost (e.g. through full automation) of carrying out end-to-end cyberattacks on organizations with a limited security posture.” (pp. 5–6)

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

The risk tolerance, implicit or otherwise, is not expressed fully or partly quantitatively.
No indication of expressing the risk tolerance beyond “severe harm”, which is not further defined. To improve, the risk tolerance should be expressed fully quantitatively or as a combination of scenarios with probabilities.

Quotes:

“In version 2.0 of the Framework, we specify protocols for the detection of capability levels at which models may pose severe risks (which we call “Critical Capability Levels (CCLs)”), and articulate mitigation approaches to address such risks.” (p. 1)

“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

No indication of expressing the risk tolerance beyond “severe harm”, which is not further defined. There is no quantitative definition of severity nor probabilities given.

Quotes:

“In version 2.0 of the Framework, we specify protocols for the detection of capability levels at which models may pose severe risks (which we call “Critical Capability Levels (CCLs)”), and articulate mitigation approaches to address such risks.” (p. 1)

“Critical Capability Levels […] are levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

No evidence of asking the public what risk levels they find acceptable. No evidence of seeking regulator input specifically on what constitutes acceptable risk levels.

Quotes:

No relevant quotes found.

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 26%

2.2.1 Key Risk Indicators (KRI) (30%) 33%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 50%

Each risk domain has at least one KRI, which is specific and qualitatively defined. The KRI appears to be grounded in risk modelling. To improve, they could have more KRIs of higher severity (i.e. ‘Level 2’), to show preparation (akin to OpenAI’s ‘Critical’ thresholds.) They have done this for Instrumental Reasoning capabilities but not others. They do however signal that they intend to explore future work in determining more KRIs. To improve, KRIs should map directly to evaluation tests performed.

Quotes:

From Table 1, p. 5:
CBRN: “Uplift Level 1: Can be used to significantly assist a low-resourced actor with dual-use scientific protocols, resulting in a substantial increase in ability to cause a mass casualty event.” with Footnote 10 after ‘event’: “For example, through the use of a self-replicating CBRN agent. Compared to a counterfactual of not using generative AI systems.”

Cyber: “Autonomy Level 1: Can be used to drastically reduce the cost (e.g. through full automation) of carrying out end-to-end cyberattacks on organizations with a limited security posture.” with Footnote 11 after ‘posture’: “E.g. deletion or exfiltration of critical information, or destroying or disabling key systems.”
“Uplift Level 1: Can be used to significantly assist with high impact cyber attacks, resulting in overall cost/resource reductions of an order of magnitude or more.” with footnote 12 after ‘attacks,’: “E.g. deletion or exfiltration of sensitive information/disruption of key systems of organizations with strong security posture, or the creation of predictable and large-scale effects on critical national infrastructure.” and footnote 13 after ‘or more.’: “Relative to the counterfactual of using 2024 AI technology and tooling.”

Machine Learning R&D: “Uplift Level 1: Can or has been used to accelerate AI development, resulting in AI progress substantially accelerating (e.g. 2x) from 2020-2024 rates.”
“Autonomy Level 1: Can fully automate the AI R&D pipeline at a competitive cost. The cost comparison is relative to humans augmented by AI tools.”

From Table 2, p. 7:

“Instrumental Reasoning Level 1: the instrumental reasoning abilities of the model enable enough situational awareness (ability to work out and use relevant details of its deployment setting) and stealth (ability to circumvent basic oversight mechanisms) such that, absent additional mitigations, we cannot rule out the model significantly undermining human control.”

“Instrumental Reasoning Level 2: the instrumental reasoning abilities of the model enable enough situational awareness and stealth that, even when relevant model outputs (including, e.g., scratchpads) are being monitored, we cannot detect or rule out the risk of a model significantly undermining human control.”

“Updated set of risks and mitigations: There may be additional risk domains and critical capabilities that fall into scope as AI capabilities improve and the external environment changes. Future work will aim to include additional pressing risks, which may include additional risk domains or higher CCLs within existing domains.” (p. 8)

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 10%

The KRI for Machine Learning R&D uplift level 1 is somewhat quantitative, though they only provide it as an example rather than an explicit threshold by saying “e.g. 2x”. The remaining KRIs have opportunity to become quantitative, e.g. by specifying precisely what counts as a “competitive cost”, “fully automated”, “significantly assist”, “low-resourced”, “drastically reduce the cost”, “mass casualty event”, and so on.

Quotes:

From Table 1, p. 5:
CBRN: “Uplift Level 1: Can be used to significantly assist a low-resourced actor with dual-use scientific protocols, resulting in a substantial increase in ability to cause a mass casualty event.” with Footnote 10 after ‘event’: “For example, through the use of a self-replicating CBRN agent. Compared to a counterfactual of not using generative AI systems.”

Cyber: “Autonomy Level 1: Can be used to drastically reduce the cost (e.g. through full automation) of carrying out end-to-end cyberattacks on organizations with a limited security posture.” with Footnote 11 after ‘posture’: “E.g. deletion or exfiltration of critical information, or destroying or disabling key systems.”
“Uplift Level 1: Can be used to significantly assist with high impact cyber attacks, resulting in overall cost/resource reductions of an order of magnitude or more.” with footnote 12 after ‘attacks,’: “E.g. deletion or exfiltration of sensitive information/disruption of key systems of organizations with strong security posture, or the creation of predictable and large-scale effects on critical national infrastructure.” and footnote 13 after ‘or more.’: “Relative to the counterfactual of using 2024 AI technology and tooling.”

Machine Learning R&D: “Uplift Level 1: Can or has been used to accelerate AI development, resulting in AI progress substantially accelerating (e.g. 2x) from 2020-2024 rates.”
“Autonomy Level 1: Can fully automate the AI R&D pipeline at a competitive cost. The cost comparison is relative to humans augmented by AI tools.”

From Table 2, p. 7:

“Instrumental Reasoning Level 1: the instrumental reasoning abilities of the model enable enough situational awareness (ability to work out and use relevant details of its deployment setting) and stealth (ability to circumvent basic oversight mechanisms) such that, absent additional mitigations, we cannot rule out the model significantly undermining human control.”

“Instrumental Reasoning Level 2: the instrumental reasoning abilities of the model enable enough situational awareness and stealth that, even when relevant model outputs (including, e.g., scratchpads) are being monitored, we cannot detect or rule out the risk of a model significantly undermining human control.”

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%

The KRIs only reference model capabilities.

Quotes:

No relevant quotes found.

2.2.2 Key Control Indicators (KCI) (30%) 31%

2.2.2.1 Containment KCIs (35%) 63%
2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 75%

For each of the misuse KRIs, they have qualitative containment KCI thresholds related to the RAND security levels, though with a vague qualifier: “at a level generally aligned with RAND SL 2.” It is especially good that some reasoning behind each containment measure is given. However, containment KCIs need to also be defined for the Deceptive Alignment KRIs.

Quotes:

From Table 1, p. 5:

CBRN, Uplift Level 1: “Security controls and detections at a level generally aligned with RAND SL 2. The potential magnitude of harm these capabilities may enable means the exfiltration and leak of model weights reaching this CCL could be highly damaging. However, low-resourced actors are unlikely to pose a substantial exfiltration threat.”

Cyber, Autonomy Level 1: “Security controls and detections at a level generally aligned with RAND SL 2. Harmful cyberattacks against organizations with limited security posture can already be carried out by individuals with limited expertise, but the automation of such attacks would significantly lower the costs of doing so. Exfiltration of model weights could enable the execution of such attacks at scale. However, cybersecurity may improve correspondingly when models reach such capability levels. The relatively ambiguous net costs of exfiltration count against security levels with higher costs to innovation.”

Cyber, Uplift level 1: “Security controls and detections at a level generally aligned with RAND SL 2. A model at this capability level could help fairly well-resourced threat actors carry out severe cyber attacks on targets like critical businesses, national government entities, and critical national infrastructure with lower resource expenditure, potentially increasing the frequency of such attacks significantly. However, as above, cyber defense may improve to diminish the impact of AI-assisted cyber attacks. Similarly, the ambiguous net costs of exfiltration count against security levels with higher costs to innovation.”

Machine Learning R&D, Uplift Level 1: “Security controls and detections at a level generally aligned with RAND SL 3. Unrestricted access to models at this level of capability could significantly increase an actor’s ability to progress to yet more powerful models and other critical capabilities. The exfiltration of such a model may therefore have a significant effect on society’s ability to adapt to and govern powerful AI systems, effects that may have long-lasting consequences. Substantially strengthened security is therefore recommended.”

Machine Learning R&D, Autonomy Level 1: “We recommend security controls and detections at a level generally aligned with RAND SL 4 for this capability threshold, but emphasize that this must be taken on by the frontier AI field as a whole.
Unrestricted access to models at this level of capability could give an actor with adequate computational resources the ability to reach capabilities much more powerful than those in the other CCLs listed in a short amount of time. This could be catastrophic if there is no effective way of defending against rapidly improving and potentially superhuman AI systems wielded by threat actors. Therefore, we recommend models at this level of capability have exceptional security even though they may have substantial innovation costs.”

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 50%

For each of the misuse KRIs, they reference the RAND security levels as the relevant containment KCI, though with a vague qualifier: “at a level generally aligned with RAND SL 2”.

These RAND levels count somewhat as quantitative containment KCIs, but would need to be coupled with probabilities to be fully quantitative. For instance, the RAND levels state criteria such as: ‘A system that can likely thwart most professional opportunistic efforts by attackers that execute moderate effort or non targeted attacks (OC2).’, and the actor is defined quantitatively (“Operations roughly less capable than or comparable to a single individual who is broadly capable in information security spending several weeks with a total budget of up to $10,000 on the specific operation, with preexisting personal cyber infrastructure but no preexisting access to the organization”), but ‘likely’ could be defined quantitatively as probability.

It is especially good that some reasoning behind each containment measure is given. However, this needs to also be defined for the Deceptive Alignment KRIs.

Quotes:

From Table 1, p. 5:

CBRN, Uplift Level 1: “Security controls and detections at a level generally aligned with RAND SL 2. The potential magnitude of harm these capabilities may enable means the exfiltration and leak of model weights reaching this CCL could be highly damaging. However, low-resourced actors are unlikely to pose a substantial exfiltration threat.”

Cyber, Autonomy Level 1: “Security controls and detections at a level generally aligned with RAND SL 2. Harmful cyberattacks against organizations with limited security posture can already be carried out by individuals with limited expertise, but the automation of such attacks would significantly lower the costs of doing so. Exfiltration of model weights could enable the execution of such attacks at scale. However, cybersecurity may improve correspondingly when models reach such capability levels. The relatively ambiguous net costs of exfiltration count against security levels with higher costs to innovation.”

Cyber, Uplift Level 1: “Security controls and detections at a level generally aligned with RAND SL 2. A model at this capability level could help fairly well-resourced threat actors carry out severe cyber attacks on targets like critical businesses, national government entities, and critical national infrastructure with lower resource expenditure, potentially increasing the frequency of such attacks significantly. However, as above, cyber defense may improve to diminish the impact of AI-assisted cyber attacks. Similarly, the ambiguous net costs of exfiltration count against security levels with higher costs to innovation.”

Machine Learning R&D, Uplift Level 1: “Security controls and detections at a level generally aligned with RAND SL 3. Unrestricted access to models at this level of capability could significantly increase an actor’s ability to progress to yet more powerful models and other critical capabilities. The exfiltration of such a model may therefore have a significant effect on society’s ability to adapt to and govern powerful AI systems, effects that may have long-lasting consequences. Substantially strengthened security is therefore recommended.”

Machine Learning R&D, autonomy level 1: “We recommend security controls and detections at a level generally aligned with RAND SL 4 for this capability threshold, but emphasize that this must be taken on by the frontier AI field as a whole. Unrestricted access to models at this level of capability could give an actor with adequate computational resources the ability to reach capabilities much more powerful than those in the other CCLs listed in a short amount of time. This could be catastrophic if there is no effective way of defending against rapidly improving and potentially superhuman AI systems wielded by threat actors. Therefore, we recommend models at this level of capability have exceptional security even though they may have substantial innovation costs.”

2.2.2.2 Deployment KCIs (35%) 5%
2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 10%

The implicit KCI for misuse is “[the ability to] counter the misuse of critical capabilities in deployments.” However, this is still quite vague – there is no detail as to what would indicate this ability or inability. The measures for the deployment KCI include “developing and improving a suite of safeguards targeting the capability”, suggesting each KRI has a deployment KCI, but it is not clear how the KCIs differ for each KRI.

 

Quotes:

“For misuse: […] [we have] deployment mitigations (such as safety fine-tuning and misuse filtering, detection, and response) intended to counter the misuse of critical capabilities in deployments. For deceptive alignment risk, automated monitoring may be applied to detect and respond to deceptive behavior for models that meet the first deceptive alignment CCL. Note that these mitigations reflect considerations from the perspective of addressing severe risks from powerful capabilities alone; due to this focused scope, other risk management and security considerations may result in more stringent mitigations applied to a model than specified by the Framework.” (p. 3)

 

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%

There are no quantitative deployment KCI thresholds given.

Quotes:

No relevant quotes found.

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 25%

For deceptive alignment, the implicit KCI is “the model is [incapable] of effective instrumental reasoning in ways that [cannot] be monitored.” However, this requires more detail, e.g. what would constitute as ability or inability. It is also not specific to the KRIs.

Quotes:

“Mitigation Approach and Critical Capability Levels.
An initial mitigation approach [for deceptive alignment] focuses on detecting when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control, assuming no additional mitigations were applied. When models reach this capability level, we believe applying an automated monitor to the model’s explicit reasoning (e.g. chain-of-thought output) is an effective mitigation. Once a model is capable of effective instrumental reasoning in ways that cannot be monitored, additional mitigations may be warranted—the development of which is an area of active research.”

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 25%

There is a clear process for pairing KRI and KCIs via the ‘safety case’ model, for both misuse and deceptive alignment domains of risk. However, this should be conducted before the model is developed, to justify why the KCI is high enough to mitigate the risk level if a KRI is crossed.

This could be improved by defining what ‘safe deployment’ is precisely, i.e. the risk tolerance.

Quotes:

Misuse: “The following deployment mitigation process will be applied to models reaching a CCL, allowing for iterative and flexible tailoring of mitigations to each risk and use case.

  1. Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:
    a. Developing and improving a suite of safeguards targeting the capability. This includes, as appropriate, safety fine-tuning, misuse filtering and detection, and response protocols.
    b. Assessing the robustness of these mitigations against the risk posed through assurance evaluations and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as the likelihood and consequences of misuse.
  2. Pre-deployment review of safety case: general availability deployment of a model takes place only after the appropriate corporate governance body determines the safety case regarding each CCL the model has reached to be adequate.
  3. Post-deployment review of safety case: the safety case will be updated through red-teaming and revisions to our threat models. The safeguards for the model may be updated as well to ensure continued adequacy.” (p. 4)

Deceptive alignment:
“Looking forward, we expect the approach to deceptive alignment risks to take a similar form as the deployment mitigation process. For each deceptive alignment risk, AI developers should:

  • Develop a safety case based on the model’s capabilities and the mitigations applied and tested. Such safety cases would make the case that the likelihood of deceptive alignment risk would be low enough for safe deployment, even if the model were trying to meaningfully undermine human control.
  • Test the mitigations applied to models via red-teaming protocols called “control evaluations,” to determine whether they meet the requirements of the safety case.

The approach above relies on two safety cases centering respectively on the claims that:

  1. A model does not meet a threshold CCL and therefore would be incapable of substantially undermining human control.
  2. An automated monitoring process that is validated by a control evaluation is adequate for catching the model’s malicious behavior.

Once neither of these apply, that is, once automated monitoring is no longer adequate/can no longer pass a control evaluation, a model is deemed to have reached the second CCL.”

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 10%

There is not a clear commitment to put development on hold, only that “the response plan may involve putting deployment or further development on hold.” This is also only if a model “may be assessed to pose risks for which readily available mitigations may not be sufficient”, but this is not clear threshold, and appears to be discretionary.

Quotes:

“A model flagged by an alert threshold may be assessed to pose risks for which readily available mitigations (including but not limited to those described below) may not be sufficient. If this happens, the response plan may involve putting deployment or further development on hold until adequate mitigations can be applied. Conversely, where model capabilities remain quite distant from a CCL, a response plan may involve the adoption of additional capability assessment processes to flag when heightened mitigations may be required.” (p. 3)

Back to top

3.1 Implementing Mitigation Measures (50%) 22%

3.1.1 Containment measures (35%) 0%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 0%

The framework explicitly does not define containment measures.

Quotes:

“Here, we rely on the RAND framework to articulate the level of security recommended for each CCL. When we reference RAND security levels, we are referring to the security principles in their framework, rather than the benchmarks (i.e. concrete measures) also described in the RAND report. Because AI security is an area of active research, we expect the concrete measures implemented to reach each level of security to evolve substantially.” (p. 4) with footnote after ‘report.’: “As the authors point out, the ‘security level benchmarks represent neither a complete standard nor a compliance regime—they are provided for informational purposes only and should inform security teams’ decisions rather than supersede them.’ “

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 0%

Whilst the framework outlines general safety cases (see quotes below for context), these appear to only apply for deployment and assurance KCIs, and not security controls (i.e. containment KCIs). There is no mention then of internal validation that containment measures are sufficient, nor proof provided for why they believe their given containment measures to be likely to be sufficient.

Quotes:

Misuse: “The following deployment mitigation process will be applied to models reaching a CCL, allowing for iterative and flexible tailoring of mitigations to each risk and use case.

  1. Development and assessment of mitigations: safeguards and an accompanying safety case are developed by iterating on the following:
    a. Developing and improving a suite of safeguards targeting the capability. This includes, as appropriate, safety fine-tuning, misuse filtering and detection, and response protocols.
    b. Assessing the robustness of these mitigations against the risk posed through assurance evaluations and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as the likelihood and consequences of misuse.
  2. Pre-deployment review of safety case: general availability deployment of a model takes place only after the appropriate corporate governance body determines the safety case regarding each CCL the model has reached to be adequate.
  3. Post-deployment review of safety case: the safety case will be updated through red-teaming and revisions to our threat models. The safeguards for the model may be updated as well to ensure continued adequacy.” (p. 4)

Deceptive alignment:
“Looking forward, we expect the approach to deceptive alignment risks to take a similar form as the deployment mitigation process. For each deceptive alignment risk, AI developers should:

  • Develop a safety case based on the model’s capabilities and the mitigations applied and tested. Such safety cases would make the case that the likelihood of deceptive alignment risk would be low enough for safe deployment, even if the model were trying to meaningfully undermine human control.
  • Test the mitigations applied to models via red-teaming protocols called “control evaluations,” to determine whether they meet the requirements of the safety case.

The approach above relies on two safety cases centering respectively on the claims that:

  1. A model does not meet a threshold CCL and therefore would be incapable of substantially undermining human control.
  2. An automated monitoring process that is validated by a control evaluation is adequate for catching the model’s malicious behavior.

Once neither of these apply, that is, once automated monitoring is no longer adequate/can no longer pass a control evaluation, a model is deemed to have reached the second CCL.”

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 0%

There is no mention of third-party verification of containment measures meeting the threshold.

Quotes:

No relevant quotes found.

3.1.2 Deployment measures (35%) 40%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 50%

The framework mentions some possible deployment measures (‘deployment mitigations’), but without explicit commitment to implementing them. To improve, they should detail precisely the deployment measures which will be implemented to meet the relevant deployment KCI threshold.

Quotes:

“Developing and improving a suite of safeguards targeting the capability. This includes, as appropriate, safety fine-tuning, misuse filtering and detection, and response protocols.” (p. 4)

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

The framework describes a process, assumedly internal, for “assessing the robustness of these mitigations against the risk posed through assurance evaluations and threat modeling research”, but does not detail what these assurance evaluations involve. Further, evidence that the sufficient measures are sufficient ex ante is not given. Instead, it relies on the ” appropriate corporate governance body” and their discretion. To improve, this proof should be garnered as in advance as possible, to be sure that the measures will be sufficient to meet the KCI threshold once the model crosses the relevant KRI threshold.

Quotes:

“The following deployment mitigation process will be applied to models reaching a CCL

  1. (b) Assessing the robustness of these mitigations against the risk posed through assurance evaluations and threat modeling research. The assessment takes the form of a safety case, taking into account factors such as the likelihood and consequences of misuse.
  2. Pre-deployment review of safety case: general availability deployment of a model takes place only after the appropriate corporate governance body determines the safety case regarding each CCL the model has reached to be adequate.
  3. Post-deployment review of safety case: the safety case will be updated through red-teaming and revisions to our threat models. The safeguards for the model may be updated as well to ensure continued adequacy.” (p. 4)
3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 0%

There is no mention of third-party verification of deployment measures meeting the threshold.

Quotes:

No relevant quotes found.

3.1.3 Assurance processes (30%) 28%

3.1.3.1 Credible plans towards the development of assurance properties (40%) 25%

The framework mentions they are “actively researching approaches to addressing models” that reach the highest misalignment capability, instrumental reasoning level 2. However, they do not provide detail on how they will achieve this, or by what point it will need to be intact (i.e., whether assurance processes must be settled before the model has reached some margin of the critical capability).

Quotes:

“Instrumental Reasoning Level 2: […] Mitigation: Future work: We are actively researching approaches to addressing models that reach this CCL.”

3.1.3.2 Evidence that the assurance properties are enough to achieve their corresponding KCI thresholds (40%) 25%

They mention engaging in efficacy assessments for their assurance processes (namely, automated monitoring) by conducting control evaluations, but do not provide further detail for how they will collect evidence on assurance processes’ efficacy. To improve, they should describe the red teaming set up, expertise involved, and methodology.

Quotes:

“Looking forward, we expect the approach to deceptive alignment risks to take a similar form as the deployment mitigation process. For each deceptive alignment risk, AI developers should:

  • Develop a safety case based on the model’s capabilities and the mitigations applied and tested. Such safety cases would make the case that the likelihood of deceptive alignment risk would be low enough for safe deployment, even if the model were trying to meaningfully undermine human control.
  • Test the mitigations applied to models via red-teaming protocols called “control evaluations,” to determine whether they meet the requirements of the safety case.” (p. 7)
3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 50%

They do explicitly outline that safety cases for proving sufficiency of assurance processes require two key assumptions: “1. A model does not meet a threshold CCL and therefore would be incapable of substantially undermining human control” and “2. An automated monitoring process that is validated by a control evaluation is adequate for catching the model’s malicious behavior.”

However, more detail could be given on this assumption. For instance, the framework could further explore the assumptions required for automated monitoring to be successful/the control evaluation to be legitimate requires; some of these might include faithful chain-of-thought reasoning, red-team expertise of the human red team or the weaker monitor, or absence of collusion between the monitor and model. Overall then, to improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.

Quotes:

“The approach above relies on two safety cases centering respectively on the claims that:

  1. A model does not meet a threshold CCL and therefore would be incapable of substantially undermining human control.
  2. An automated monitoring process that is validated by a control evaluation is adequate for catching the model’s malicious behavior.

Once neither of these apply, that is, once automated monitoring is no longer adequate/can no longer pass a control evaluation, a model is deemed to have reached the second CCL.” (p. 7)

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 30%

3.2.1 Monitoring of KRIs (40%) 27%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 50%

Whilst they express commitment to developing intensive elicitation methods, they do not provide justification that their evaluations are comprehensive enough. Further, “we seek to equip the model” only signals an intent, rather than a commitment. Nonetheless, they do detail a commitment to research “a growing number of possible post-training enhancements”, which encompasses elicitation methods. More detail could be added on which elicitation methods they anticipate would be used by different threat actors, under realistic settings.

Quotes:

“In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model.” (p. 3)

“Future Work. […] Capability elicitation: Our evaluators continue to improve their ability to estimate what capabilities may be attainable by different threat actors with access to our models, taking into account a growing number of possible post-training enhancements.”

3.2.1.2 Evaluation frequency (25%) 10%

They demonstrate an intent to run evaluations frequently, according to a “safety buffer”, implying that this pertains to rate of progress of AI capabilities, but do not describe what this safety buffer is or what determines how frequently these are run.

Quotes:

“We intend to evaluate our most powerful frontier models regularly to check whether their AI capabilities are approaching a CCL. We also intend to evaluate any of these models that could indicate an exceptional increase in capabilities over previous models, and where appropriate, assess the likelihood of such capabilities and risks before and during training.
To do so, we will define a set of evaluations called “early warning evaluations,” with a specific “alert threshold” that flags when a CCL may be reached before the evaluations are run again. In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate.” (p. 3)

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 25%

The “safety buffer” quoted here likely refers to the assumption that capability evaluations are underestimating future capabilities, given post-training enhancements. It would be an improvement to make this more explicit. They also note that elicitation efforts must take into account a “growing number of possible post-training enhancements.” More detail on this methodology, e.g. the enhancements used, or the forecasting exercises completed to assure a wide enough safety buffer, would improve the score.

Further, more detail could be added on how they account(ed) for how post-training enhancements’ risk profiles change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.

Quotes:

“[…] we will define a set of evaluations called “early warning evaluations,” with a specific “alert threshold” that flags when a CCL may be reached before the evaluations are run again. In our evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate.” (p. 3)

“Capability elicitation: Our evaluators continue to improve their ability to estimate what capabilities may be attainable by different threat actors with access to our models, taking into account a growing number of possible post-training enhancements.” (p. 8)

3.2.1.4 Vetting of protocols by third parties (15%) 10%

There is no mention of having the evaluation methodology vetted by third parties. However, they do make a discretionary commitment to involve external experts when determining the level of risk after a KRI threshold is crossed, showing some awareness that external opinion is helpful when assessing the risks and capabilities of a model.

Quotes:

“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p. 3)

3.2.1.5 Replication of evaluations by third parties (15%) 25%

There is no mention of having evaluations replicated, though they mention that they “may use additional external evaluators […] if evaluators with relevant expertise are needed to provide an additional signal about a model’s proximity to CCLs.” This only shows partial implementation.

Quotes:

“We may use additional external evaluators to test a model for relevant capabilities, if evaluators with relevant expertise are needed to provide an additional signal about a model’s proximity to CCLs.” (p. 3)

3.2.2 Monitoring of KCIs (40%) 33%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 75%

There is mention of monitoring mitigations “periodically”, directly to assess their efficacy. To improve, more detail could be given on how frequent this is. The implementation of post-mitigation testing, with specific descriptions of efficacy data such as “misuse attempt incidents” shows structured implementation of the criterion. However, the qualifier “drawing on information like” implies that the process is discretionary; an improvement would be to commit to a systematic, ongoing monitoring scheme to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will still be met, when required.

Finally, it is commendable that they conduct a “post-deployment review of safety case”, where the “safeguards for the model may be updated as well to ensure continued adequacy.” More detail could be provided on what would constitute a necessary update.

Quotes:

“The appropriateness and efficacy of applied mitigations should be reviewed periodically, drawing on information like related misuse or misuse attempt incidents; results from continued post-mitigation testing; statistics about our intelligence, monitoring and escalation processes; and updated threat modeling and risk landscape analysis.” (p. 3)

“Post-deployment review of safety case: the safety case will be updated through red-teaming and revisions to our threat models. The safeguards for the model may be updated as well to ensure continued adequacy.” (p. 4)

3.2.2.2 Vetting of protocols by third parties (30%) 10%

External input into mitigation protocols is optional and only ‘informs’ the response plan.

Quotes:

“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p. 3)

3.2.2.3 Replication of evaluations by third parties (30%) 0%

There is no mention of control evaluations/mitigation testing being replicated or conducted by third-parties.

Quotes:

No relevant quotes found.

3.2.3 Transparency of evaluation results (10%) 43%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 50%

They mention sharing information with the government when models have critical capabilities, though the content of this information remains discretionary. There are no commitments to share evaluation reports to the public if models are deployed.

Quotes:

“If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share information with appropriate government authorities where it will facilitate the development of safe AI. Where appropriate, and subject to adequate confidentiality and security measures and considerations around proprietary and sensitive information, this information may include:

  • Model information: characteristics of the AI model relevant to the risk it may pose with its critical capabilities.
  • Evaluation results: such as details about the evaluation design, the results, and any robustness tests.
  • Mitigation plans: descriptions of our mitigation plans and how they are expected to reduce the risk.

We may also consider disclosing information to other external organizations to promote shared learning and coordinated risk mitigation. We will continue to review and evolve our disclosure process over time.” (p. 8)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 18%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 10%

No process is detailed for monitoring for novel risks/actively seeking out novel risks post-deployment, apart from a post-deployment review of the safety case for misuse risks (representing some structured process). To improve, such a process should be detailed – this is especially important as “we cannot detect or rule out the risk of a model significantly undermining human control” is a critical capability level, and so represents “a foreseeable path to severe harm”. Necessarily then, monitoring for changes in this risk profile, or other aspects which may make this risk profile more or less likely, is likely highly relevant for assessing risk. Whilst they state an intent to update their set of risks and mitigations, a monitoring setup specifically to detect novel risk profiles is not detailed.

Quotes:

“CCLs can be determined by identifying and analyzing the main foreseeable paths through which a model could cause severe harm, and then defining the CCLs as the minimal set of capabilities a model must possess to do so” (p. 1)

“Post-deployment review of safety case: the safety case will be updated through red-teaming and revisions to our threat models.” (p. 4)

“Future work: […] Updated set of risks and mitigations: There may be additional risk domains and critical capabilities that fall into scope as AI capabilities improve and the external environment changes. Future work will aim to include additional pressing risks, which may include additional risk domains or higher CCLs within existing domains.”(p. 8)

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 25%

There is no formal mechanism for incorporating risks identified post-deployment into a structured risk modelling process. However, they do indicate that they incorporate risks identified post-deployment, showing some structured implementation, and intend to dedicate future work to incorporating additional risks.

Quotes:

“Post-deployment review of safety case: the safety case will be updated through red-teaming and revisions to our threat models.” (p. 4)

“Future work: […] Updated set of risks and mitigations: There may be additional risk domains and critical capabilities that fall into scope as AI capabilities improve and the external environment changes. Future work will aim to include additional pressing risks, which may include additional risk domains or higher CCLs within existing domains.”(p. 8)

Back to top

4.1 Decision-making (25%) 13%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 0%

No mention of risk owners.

Quotes:

No relevant quotes found.

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No mention of a management risk committee.

Quotes:

No relevant quotes found.

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 50%

The framework outlines fairly detailed protocols for decision-making in terms of the capability levels, but to improve, it should specify more detail on who makes the decisions and the basis for them.

Quotes:

“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.” (p. 3)
“A model flagged by an alert threshold may be assessed to pose risks for which readily available mitigations (including but not limited to those described below) may not be sufficient. If this happens, the response plan may involve putting deployment or further development on hold until adequate mitigations can be applied.” (p. 3)
“For Google models, when alert thresholds are reached, the response plan will be reviewed and approved by appropriate corporate governance bodies”. (p. 7)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 0%

No mention of escalation procedures.

Quotes:

No relevant quotes found.

4.2. Advisory and Challenge (20%) 28%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 0%

No mention of an executive risk officer.

Quotes:

No relevant quotes found.

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 90%

The company has a large amount of councils that advise management on AI risk matters.

Quotes:

“For Google models, when alert thresholds are reached, the response plan will be reviewed and approved by appropriate corporate governance bodies such as the Google DeepMind AGI Safety Council, Google DeepMind Responsibility and Safety Council, and/or Google Trust & Compliance Council.” (p. 7)

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 50%

The framework lists some details regarding their system for monitoring risk levels in terms of the capability levels. To improve, they should monitor risk indicators other than solely capabilities and integrate these for a holistic risk view.

Quotes:

“Critical Capability Levels. These are capability levels at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm.” (p. 2)
“We intend to evaluate our most powerful frontier models regularly to check whether their AI capabilities are approaching a CCL.” (p. 3)
“We will define a set of evaluations called ‘early warning evaluations,’ with a specific ‘alert threshold’ that flags when a CCL may be reached before the evaluations are run again.” (p. 3)

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 0%

No mention of people that challenge decisions.

Quotes:

No relevant quotes found.

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 25%

The framework refers to reviews of relevant information by the advisory committees. However, to improve, it should make more clear what risk information is reported to senior management and in what format.

Quotes:

“The appropriateness and efficacy of applied mitigations should be reviewed periodically, drawing on information like related misuse or misuse attempt incidents; results from continued post-mitigation testing; statistics about our intelligence, monitoring and escalation processes; and updated threat modeling and risk landscape analysis.” (p. 3)
“For Google models, when alert thresholds are reached, the response plan will be reviewed and approved by appropriate corporate governance bodies such as the Google DeepMind AGI Safety Council, Google DeepMind Responsibility and Safety Council, and/or Google Trust & Compliance Council.” (p. 7)

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 5%

4.3.1 The company has an internal audit function involved in AI governance (50%) 0%

No mention of an internal audit function.

Quotes:

No relevant quotes found.

4.3.2 The company involves external auditors (50%) 10%

The framework mentions potentially involving external expertise, but it is tentative. Further, it does not mention external independent review.

Quotes:

“When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed.” (p. 3)
“We may use additional external evaluators to test a model for relevant capabilities, if evaluators with relevant expertise are needed to provide an additional signal about a model’s proximity to CCLs.” (p. 3)

4.4 Oversight (20%) 25%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 0%

No mention of a Board risk committee.

Quotes:

No relevant quotes found.

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 50%

There are several governance entities listed in the framework that seem to be providing some level of oversight. To improve further, the company should clarify whether these are advisory bodies or oversight bodies, as per the Three Lines model.

Quotes:

“For Google models, when alert thresholds are reached, the response plan will be reviewed and approved by appropriate corporate governance bodies such as the Google DeepMind AGI Safety Council, Google DeepMind Responsibility and Safety Council, and/or Google Trust & Compliance Council.” (p. 7)

4.5 Culture (10%) 7%

4.5.1 The company has a strong tone from the top (33.3%) 10%

The framework includes a few references that reinforces the tone from the top, but would benefit from more substantial commitments to managing risk.

Quotes:

“It is intended to complement Google’s existing suite of AI responsibility and safety practices, and enable AI innovation and deployment consistent with our AI Principles.” (p. 1)
“We expect the Framework to evolve substantially as our understanding of the risks and benefits of frontier models improves, and we will publish substantive revisions as appropriate.” (p. 1)

4.5.2 The company has a strong risk culture (33.3%) 10%

The framework includes a few references to updating the approach over time, which is important for risk culture. To improve, more aspects such as training and internal transparency would be needed.

Quotes:

“We may change our approach and recommendations over time as we gain experience and insights on the projected capabilities of future frontier models.” (p. 1)

4.5.3 The company has a strong speak-up culture (33.3%) 0%

No mention of elements of speak-up culture.

Quotes:

No relevant quotes found.

4.6 Transparency (5%) 20%

4.6.1 The company reports externally on what their risks are (33.3%) 25%

The framework states which capabilities that the company is tracking as part of this framework. To improve its score, the company could specify how it will provide information regarding risks going forward in e.g. model cards.

Quotes:

“We specify protocols for the detection of capability levels at which models may pose severe risks (which we call “Critical Capability Levels (CCLs)”), and articulate mitigation approaches to address such risks. At present, the Framework primarily addresses misuse risk, but we also include an exploratory section addressing deceptive alignment risk, focusing on capability levels at which such risks may begin to arise. For each type of risk, we define here a set of CCLs and a mitigation approach for them”. (p. 1)

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 25%

The framework includes some mentions of the governance structure, in the shape of the various councils involved, but does not provide sufficient detail on other governance bodies involved in the process. Further improvement in score could be gained by a more elaborate governance section.

Quotes:

“For Google models, when alert thresholds are reached, the response plan will be reviewed and approved by appropriate corporate governance bodies such as the Google DeepMind AGI Safety Council, Google DeepMind Responsibility and Safety Council, and/or Google Trust & Compliance Council. The Google DeepMind AGI Safety Council will periodically review the implementation of the Framework.” (p. 7)
“We may also consider disclosing information to other external organizations to promote shared learning and coordinated risk mitigation. We will continue to review and evolve our disclosure process over time”. (p. 8)

4.6.3 The company shares information with industry peers and government bodies (33.3%) 10%

The framework suggests potential information sharing, but the language is fairly vague, with e.g. “may” and “aim to”. For a higher score, the company would need to add precision.

Quotes:

“If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share information with appropriate government authorities where it will facilitate the development of safe AI.” (p. 8)
“We may also consider disclosing information to other external organizations to promote shared learning and coordinated risk mitigation”. (p. 8)

Back to top