Google DeepMind

Total Score:

Weak

1.5/5

Non Existent

Strong

Best practices

Risk Identification

Strong

Substantial

Moderate

Weak

Very
Weak

Moderate

2.5/5

Learn more

Risk Tolerance & Analysis

Very
Weak

Weak

Moderate

Substantial

Strong

Very Weak

0.9/5

Learn more

Risk Mitigation

Strong

Substantial

Moderate

Weak

Very
Weak

Weak

1.2/5

Learn more

Thank you!
Your PDF will be sent soon

Oops! Something went wrong while submitting the form.

Risk Identification

‍In risk identification, we assess whether an AI developer is:

Approaching in an appropriate way risks outlined by the literature.
Doing extensive open-ended red teaming to identify new risks.
Leveraging a diverse range of risk identification techniques, including threat modeling when appropriate, to adequately identify new threats.

Risk Identification

Strong

Substantial

Moderate

Weak

Very Weak

Moderate

2.5/5

Best-in-Class

DeepMind has published the most comprehensive dangerous capability taxonomy by a major AI developer.
DeepMind’s LLM risk taxonomy is the only taxonomy of risks published by a major AI developer and one of the most comprehensive ones.
Google’s External Safety Testing process for Gemini Pro 1.5 is the best one that has been shared to date. The report details the external testing process which includes unstructured red teaming, along with a severity based filtering of the findings.
DeepMind's paper "Evaluating Frontier Models for Dangerous Capabilities" introduces significant advances in risk assessment and threat modeling, including the use of superforecasters to predict future model performance and a quantitative methodology to assess a model's likelihood of achieving specific tasks.

Highlights

The paper "Model evaluation for extreme risks" presents foundational threat modeling work, especially through a taxonomy of dangerous capabilities evaluations.
DeepMind performed unstructured red teaming on Gemini Pro 1.5 to identify societal, biological, nuclear and cyber risks.
In the Frontier Safety Framework, DeepMind identifies four main risk categories: Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D. They state: "We have conducted preliminary analyses of the Autonomy, Biosecurity, Cybersecurity and Machine Learning R&D domains. Our initial research indicates that powerful capabilities of future models seem most likely to pose risks in these domains."
The Gemini 1.5 paper, reports multiple safety evaluations, including for bias and privacy.
DeepMind researchers have conducted a literature review on misaligned AI threat models and developed a consensus threat model among their AI safety research team.

Weaknesses

While DeepMind mentions conducting "preliminary analyses" to identify risk categories in the Frontier Safety Framework, it does not provide a detailed methodology or justification for its selection.
The Frontier Safety Framework does not mention open-ended red teaming to identify new risk factors.

Thank you!
Your PDF will be sent soon

Oops! Something went wrong while submitting the form.

Risk Tolerance
& Analysis

In risk tolerance and analysis, we assess whether the AI developers have defined:

A global risk tolerance.
Operational capabilities thresholds and their equivalent risk. Those have to be defined with precision and breadth.
Corresponding objectives of risk mitigation measures: AI developers should establish clear objectives for risk mitigation measures. These objectives should be grounded in strong rationales, including threat modeling, to justify that they are sufficient to address the identified risks and align with the organization's risk tolerance.
Evaluation protocols detailing procedures for measuring the model's capabilities and ensuring that capability thresholds are not exceeded without detection.

Global Risk Tolerance

Strong

Substantial

Moderate

Weak

Very Weak

Non Existent

0/5

Global Risk Tolerance

Weaknesses

DeepMind does not state any global risk tolerance, even qualitatively.

Operational Risk Tolerance

Strong

Substantial

Moderate

Weak

Very Weak

Very Weak

1/5

Operational Risk Tolerance

Best-in-class

The Frontier Safety Framework is the first to explicitly reference Security Levels.

Highlights

The Frontier Safety Framework defines the first qualitative critical capability levels (CCL) for the four risks identified. For example, the first CCL for autonomy is the following: “Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents”.
The containment objectives reference security levels introduced in a RAND report.

Weaknesses

Deepmind states that the CCLs “are capability levels at which, absent mitigation measures, models may pose heightened risk.” While they justify why CCL1 models pose risks, they do not justify why a model with capability levels below CCL1 does not pose “heightened risk”, which is more important.
DeepMind makes soft commitments (using "would") to stop scaling if the mitigations are not ready when a CCL is reached: “A model may reach evaluation thresholds before mitigations at appropriate levels are ready. If this happens, we would put on hold further deployment or development, or implement additional protocols (such as the implementation of more precise early warning evaluations for a given CCL) to ensure models will not reach CCLs without appropriate security mitigations, and that models with CCLs will not be deployed without appropriate deployment mitigations.”
The CCLs would benefit from more quantitative characterizations along with clear measurement procedures and thresholds.
The containment and deployment objectives presented in the Frontier Safety Framework are not linked to specific capability levels yet: “When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results“. Without this link–and additional justification for why the proposed mitigation objectives would be sufficient to keep risks below the global risk tolerance once capability thresholds are reached–the mitigation objectives lack needed guidance.

Evaluation Protocols

Strong

Substantial

Moderate

Weak

Very Weak

Weak

1.5/5

Tolerance & Analysis Score = 1/4 × Global Risk Tolerance + 1/2 × Operational Risk Tolerance + 1/4 × Evaluation Protocols

Evaluation Protocols

Highlights

DeepMind defines test frequency in terms of both compute and time: "We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress."
DeepMind conducted third-party pre-deployment evaluations on Gemini 1.5 for societal risks, radiological and nuclear risks, and cyber risks using API access.
DeepMind's Gemini 1.5 model cards provide some details on the evaluation methodologies, particularly through the thorough research paper "Evaluating Frontier Models for Dangerous Capabilities".

Weaknesses

DeepMind lacks firm commitment to testing frequency: The use of "aiming to" suggests flexibility rather than a strict requirement.
DeepMind does not justify why their elicitation techniques suffice to elicit capabilities that external actors could obtain.

‍

Thank you!
Your PDF will be sent soon

Oops! Something went wrong while submitting the form.

Risk Mitigation

In risk mitigation, we assess whether:

The proposed risk mitigation measures, which include both deployment and containment strategies, are well-planned and clearly specified.
There is a strong case for assurance properties to actually reduce risks, and the assumptions these properties are operating under are clearly stated.

Containment Measures

Strong

Substantial

Moderate

Weak

Very Weak

Very Weak

1/5

Containment Measures

Highlights

DeepMind provides high-level operationalization of the first four levels of mitigation objectives with some specific measures: for example for level 1: “Limited access to raw representations of the most valuable models, including isolation of development models from production models. Specific measures include model and checkpoint storage lockdown, SLSA Build L3 for model provenance, and hardening of ML platforms and tools.”

Weaknesses

While we acknowledge Google's position as one of the most advanced companies in information security, DeepMind does not properly report their security measures and commit to their implementation.
DeepMind does not justify why their mitigation measures are sufficient to achieve the mitigation objectives.
DeepMind lacks commitments to follow the measures: “[...] security mitigations that may be applied to model weights to prevent their exfiltration.”

‍

Deployment Measures

Strong

Substantial

Moderate

Weak

Very Weak

Non Existent

Weak

1.5/5

Deployment Measures

Highlights

DeepMind provides high-level operationalization of the first three levels of mitigation objectives with some specific measures. For example for level 1: “Application, where appropriate, of the full suite of prevailing industry safeguards targeting the specific capability, including safety fine-tuning, misuse filtering and detection, and response protocols.”
DeepMind outlines high-level mechanisms to assess the adequacy of mitigation measures to achieve mitigation objectives. For example for level 1: “Periodic red-teaming to assess the adequacy of mitigations.” and level 2: “Afterward, similar mitigations as Level 1 are applied, but deployment takes place only after the robustness of safeguards has been demonstrated to meet the target.”

Weaknesses

DeepMind lacks commitments to follow the measures:“[...] levels of deployment mitigations that may be applied to models and their descendants to manage access to and limit the expression of critical capabilities in deployment.”

Assurance Properties

Strong

Substantial

Moderate

Weak

Very Weak

Very Weak

1/5

For Risk Mitigation, all the grades: Containment Mitigation, Deployment Mitigation and Assurance Properties have the same weights.

Assurance Properties

Highlights

DeepMind provides a theoretically-grounded technical defense of debate:
- Some theoretical results motivating empirical work on AI safety debate
- Detailed motivation and qualitative argument in AI Safety via Debate
DeepMind provides a moderately detailed case in defense of interpretability as a research direction:
- Investigation of the scalability of various components of interpretability providing mixed results
- Pursuit of statistical methods to bound the odds to not find a node causally responsible for a behavior

Weaknesses

DeepMind has not published an official safety plan, though some safety employees have published unofficial pieces in their own, or a team, capacity. As a result, DeepMind has not made key assumptions explicit.

Thank you!
Your PDF will be sent soon

Oops! Something went wrong while submitting the form.

Sections

Best-in-class: These are elements where the company outperforms all the others. They represent industry-leading practices.
Highlights: These are the company's strongest points within the category, justifying its current grade.
Weaknesses: These are the areas that prevent the company from achieving a higher score.

‍

References

The main source of information is their Frontier Safety Framework. Unless otherwise specified, all information and references are derived from this document.