1.5/5
Risk Identification
Weak
2.5/5
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Risk Tolerance & Analysis
Weak
0.9/5
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Risk Mitigation
Weak
1.2/5
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Your PDF will be sent soon
Risk Identification
In risk identification, we assess whether an AI developer is:
- Approaching in an appropriate way risks outlined by the literature.
- Doing extensive open-ended red teaming to identify new risks.
- Leveraging a diverse range of risk identification techniques, including threat modeling when appropriate, to adequately identify new threats.
1 - Some risks are in scope of the risk management process. Some efforts of open-ended red teaming are reported, along with very basic threat and risk modeling.
2 - A number of risks are in the scope of the risk management process, but some important ones are missing. Significant efforts of open-ended red teaming are reported, along with significant threat modeling efforts.
3 - Most of the important and commonly discussed risks are in scope of the risk management process. Consequential red teaming is precisely reported, along with significant threat modeling and structured risk identification techniques usage.
4 - Nearly all the risks covered in the relevant literature are in scope of the risk management process. There is a methodology outlining how structured risk identification across the lifecycle is performed, precisely characterized red teaming (including from external parties) is carried out, along with advanced and broad threat and risk modeling.
5 - There is a comprehensive, continued, and detailed effort to ensure all risks are found and addressed. The red teaming and threat and risk modeling effort is extremely extensive, quantified, jointly integrated with structured risk identification efforts, and conducted with third parties.
2.5/5
Best-in-Class
- DeepMind has published the most comprehensive dangerous capability taxonomy by a major AI developer.
- DeepMind’s LLM risk taxonomy is the only taxonomy of risks published by a major AI developer and one of the most comprehensive ones.
- Google’s External Safety Testing process for Gemini Pro 1.5 is the best one that has been shared to date. The report details the external testing process which includes unstructured red teaming, along with a severity based filtering of the findings.
- DeepMind's paper "Evaluating Frontier Models for Dangerous Capabilities" introduces significant advances in risk assessment and threat modeling, including the use of superforecasters to predict future model performance and a quantitative methodology to assess a model's likelihood of achieving specific tasks.
Highlights
- The paper "Model evaluation for extreme risks" presents foundational threat modeling work, especially through a taxonomy of dangerous capabilities evaluations.
- DeepMind performed unstructured red teaming on Gemini Pro 1.5 to identify societal, biological, nuclear and cyber risks.
- In the Frontier Safety Framework, DeepMind identifies four main risk categories: Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D. They state: "We have conducted preliminary analyses of the Autonomy, Biosecurity, Cybersecurity and Machine Learning R&D domains. Our initial research indicates that powerful capabilities of future models seem most likely to pose risks in these domains."
- The Gemini 1.5 paper, reports multiple safety evaluations, including for bias and privacy.
- DeepMind researchers have conducted a literature review on misaligned AI threat models and developed a consensus threat model among their AI safety research team.
Weaknesses
- While DeepMind mentions conducting "preliminary analyses" to identify risk categories in the Frontier Safety Framework, it does not provide a detailed methodology or justification for its selection.
- The Frontier Safety Framework does not mention open-ended red teaming to identify new risk factors.
Your PDF will be sent soon
Risk Tolerance
& Analysis
In risk tolerance and analysis, we assess whether the AI developers have defined:
- A global risk tolerance.
- Operational capabilities thresholds and their equivalent risk. Those have to be defined with precision and breadth.
- Corresponding objectives of risk mitigation measures: AI developers should establish clear objectives for risk mitigation measures. These objectives should be grounded in strong rationales, including threat modeling, to justify that they are sufficient to address the identified risks and align with the organization's risk tolerance.
- Evaluation protocols detailing procedures for measuring the model's capabilities and ensuring that capability thresholds are not exceeded without detection.
1 - Global risk tolerance is qualitatively defined.
E.g., “Our system should not increase the likelihood of extinction risks”.
2 - Global risk tolerance is quantitatively defined for casualties.
3 - Global risk tolerance is quantitatively defined for casualties and economic damages, with adequate ranges and rationale for the decision.
4 - Global risk tolerance is quantitatively defined for casualties, economic damages, and other high-severity risks (e.g., large-scale manipulation of public opinion), with robust methodology and decision-making processes to decide the tolerance (e.g., public consultation).
5 - Global risk tolerance is clearly and quantitatively defined for all significant threats and risks known in the literature. Any significant deviations in risk tolerance from industry norms are clearly justified and explained (e.g., through a comprehensive benefit/cost analysis).
0/5
Weaknesses
- DeepMind does not state any global risk tolerance, even qualitatively.
1 - Some important capability thresholds are qualitatively defined and their corresponding mitigation objectives are qualitatively defined as well.
2 - Some important capability thresholds are precisely defined, and their corresponding mitigations are precisely defined as well.
3 - Almost all important hazardous capability thresholds and their corresponding mitigation objectives are precisely defined and grounded in extensive threat and risk modeling.
4 - All hazardous capabilities are precisely defined. The corresponding mitigation objectives are quantitatively defined and grounded in extensive threat and risk modeling. Assurance property targets are operationalized.
5 - All hazardous capabilities have a precisely defined threshold. Corresponding mitigation objectives are quantified and grounded in comprehensive threat and risk modeling with a clear and in-depth methodology. Assurance property targets are operationalized and justified.
1/5
Best-in-class
- The Frontier Safety Framework is the first to explicitly reference Security Levels.
Highlights
- The Frontier Safety Framework defines the first qualitative critical capability levels (CCL) for the four risks identified. For example, the first CCL for autonomy is the following: “Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents”.
- The containment objectives reference security levels introduced in a RAND report.
Weaknesses
- Deepmind states that the CCLs “are capability levels at which, absent mitigation measures, models may pose heightened risk.” While they justify why CCL1 models pose risks, they do not justify why a model with capability levels below CCL1 does not pose “heightened risk”, which is more important.
- DeepMind makes soft commitments (using "would") to stop scaling if the mitigations are not ready when a CCL is reached: “A model may reach evaluation thresholds before mitigations at appropriate levels are ready. If this happens, we would put on hold further deployment or development, or implement additional protocols (such as the implementation of more precise early warning evaluations for a given CCL) to ensure models will not reach CCLs without appropriate security mitigations, and that models with CCLs will not be deployed without appropriate deployment mitigations.”
- The CCLs would benefit from more quantitative characterizations along with clear measurement procedures and thresholds.
- The containment and deployment objectives presented in the Frontier Safety Framework are not linked to specific capability levels yet: “When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results“. Without this link–and additional justification for why the proposed mitigation objectives would be sufficient to keep risks below the global risk tolerance once capability thresholds are reached–the mitigation objectives lack needed guidance.
1 - Elements of the evaluation methodologies are described. The testing frequency is defined in terms of multiples of compute.
2 - The testing frequency is defined in terms of multiples of compute and there is a commitment to following it. The evaluation protocol is well-defined and includes relevant elicitation techniques. Independent third parties conduct pre-deployment evaluations with API access.
3 - The testing frequency is defined in terms of both multiples of compute and time and there is a commitment to following it. The evaluation protocol is well-defined and incorporates state-of-the-art elicitation techniques. A justification is provided demonstrating that these techniques are comprehensive enough to elicit capabilities that could be found and exercised by external actors. AI developers implement and justify measures (such as appropriate safety buffers), to ensure protocols can effectively detect capability threshold crossings. Independent third parties conduct pre-deployment evaluations with fine-tuning access.
4 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and provides a rationale for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes state-of-the-art elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing.
5 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and a rationale is provided for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes relevant elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing and third parties are granted permission and resources to independently run their own evaluations, to verify the accuracy of the evaluation results.
1.5/5
Highlights
- DeepMind defines test frequency in terms of both compute and time: "We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress."
- DeepMind conducted third-party pre-deployment evaluations on Gemini 1.5 for societal risks, radiological and nuclear risks, and cyber risks using API access.
- DeepMind's Gemini 1.5 model cards provide some details on the evaluation methodologies, particularly through the thorough research paper "Evaluating Frontier Models for Dangerous Capabilities".
Weaknesses
- DeepMind lacks firm commitment to testing frequency: The use of "aiming to" suggests flexibility rather than a strict requirement.
- DeepMind does not justify why their elicitation techniques suffice to elicit capabilities that external actors could obtain.
Your PDF will be sent soon
Risk Mitigation
In risk mitigation, we assess whether:
- The proposed risk mitigation measures, which include both deployment and containment strategies, are well-planned and clearly specified.
- There is a strong case for assurance properties to actually reduce risks, and the assumptions these properties are operating under are clearly stated.
1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.
3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.
4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.
5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
1/5
Highlights
- DeepMind provides high-level operationalization of the first four levels of mitigation objectives with some specific measures: for example for level 1: “Limited access to raw representations of the most valuable models, including isolation of development models from production models. Specific measures include model and checkpoint storage lockdown, SLSA Build L3 for model provenance, and hardening of ML platforms and tools.”
Weaknesses
- While we acknowledge Google's position as one of the most advanced companies in information security, DeepMind does not properly report their security measures and commit to their implementation.
- DeepMind does not justify why their mitigation measures are sufficient to achieve the mitigation objectives.
- DeepMind lacks commitments to follow the measures: “[...] security mitigations that may be applied to model weights to prevent their exfiltration.”
1 : High-level description of the countermeasures & no commitment to follow them + no evidence that they’re sufficient to preserve risks below defined levels.
2 : Clearly defined countermeasures they plan to use by default, preliminary qualitative evidence of effectiveness.
3 : Self-reported evidence of sufficiency or using methods that have been shown very effective in similar contexts, is building the evaluations required to assess future sufficiency (with conditional policy to stop if not met) or committed to using methods that have been shown to be effective in future contexts.
4 : Third-party certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to deal with the risk from future systems.
5 : Concrete countermeasures are described, vetted, the AI developer is committed to applying them past certain risk thresholds, and there is a broad consensus that they are sufficient to reduce risk for both current and future systems.
1.5/5
Highlights
- DeepMind provides high-level operationalization of the first three levels of mitigation objectives with some specific measures. For example for level 1: “Application, where appropriate, of the full suite of prevailing industry safeguards targeting the specific capability, including safety fine-tuning, misuse filtering and detection, and response protocols.”
- DeepMind outlines high-level mechanisms to assess the adequacy of mitigation measures to achieve mitigation objectives. For example for level 1: “Periodic red-teaming to assess the adequacy of mitigations.” and level 2: “Afterward, similar mitigations as Level 1 are applied, but deployment takes place only after the robustness of safeguards has been demonstrated to meet the target.”
Weaknesses
- DeepMind lacks commitments to follow the measures:“[...] levels of deployment mitigations that may be applied to models and their descendants to manage access to and limit the expression of critical capabilities in deployment.”
1 - Limited pursuit of some assurance properties, sparse evidence of how promising they are to reduce risks.
2 - Pursuit of some assurance properties along with research results indicating that they may be promising. Some of the key assumptions the assurance properties are operating under are stated.
3 - Pursuit of assurance properties, some evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. The assumptions the assurance properties are operating under are stated but some important ones are missing.
4 - Pursuit of assurance properties, solid evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. All the assumptions the assurance properties are operating under are stated.
5 - Broad consensus that one assurance property is likely to work, is being strongly pursued, and there is a strong case for it to be sufficient. All the assumptions the assurance properties are operating under are clearly stated and justified.
1/5
Highlights
- DeepMind provides a theoretically-grounded technical defense of debate:
- Some theoretical results motivating empirical work on AI safety debate
- Detailed motivation and qualitative argument in AI Safety via Debate
- DeepMind provides a moderately detailed case in defense of interpretability as a research direction:
- Investigation of the scalability of various components of interpretability providing mixed results
- Pursuit of statistical methods to bound the odds to not find a node causally responsible for a behavior
Weaknesses
- DeepMind has not published an official safety plan, though some safety employees have published unofficial pieces in their own, or a team, capacity. As a result, DeepMind has not made key assumptions explicit.
Your PDF will be sent soon
Sections
Best-in-class: These are elements where the company outperforms all the others. They represent industry-leading practices.
Highlights: These are the company's strongest points within the category, justifying its current grade.
Weaknesses: These are the areas that prevent the company from achieving a higher score.
References
The main source of information is their Frontier Safety Framework. Unless otherwise specified, all information and references are derived from this document.