What would a complete risk management framework look like?
Risk Identification
1. Approaching risks outlined by the literature in an appropriate way
Resources such as the MIT AI Risk Repository can be used to review a comprehensive set of risks. The initial iteration of the current framework focuses on the following high-level risk categories derived from the MIT taxonomy:
- Discrimination & toxicity
- Privacy & security
- Misinformation
- Malicious actors & misuse
- AI system safety, failures & limitations
- Human-computer interaction
AI developers should only exclude some of these risks from the scope of their assessment if there is a widespread scientific agreement that the specific risk does not significantly apply to the AI model under consideration. This decision should be clearly justified and documented.
Example: Based on the literature, we expect the risks X, Y, Z to be significant at the scale of the model we intend to develop. Therefore, we will consider them for the rest of the assessment.
2. Conducting extensive open-ended red teaming to identify new hazards
Following the initial risk assessment based on the literature, developers should engage in extensive in-house and second/third-party open-ended red teaming efforts conducted throughout the AI system's life cycle. The primary objectives of this red teaming effort are to:
- Identify novel risks, vulnerabilities, and failure modes that may not be covered in the existing literature.
- Challenge assumptions and blind spots in the current understanding of AI risks.
- Provide an adversarial perspective to help anticipate and mitigate potential malicious use cases or unintended consequences.
This red teaming effort must be clearly defined, with a methodology describing how it systematically explores the AI system for new hazards. The red team must have appropriate expertise to properly identify the hazards, and it should have adequate resources, time and access to the model.
Example:
a) Open-ended red teaming methodology & results
We provided API access to third-party expert red teamers X, Y and Z (more information in Annex B) at multiple points during the training run and we provided API and fine-tuning access to the final version of the most powerful model. We tasked them with exploring emerging capabilities of the model and reporting any new findings which may increase by 0.1 percentage point our estimate of the chances that our system causes 1000 deaths or more. In total, they reported 37 findings.
b) Relevant information regarding red teamers, their expertise and time spent
Annex B - Red Teamers
The red teamers were free of any conflict of interest and were protected from any legal action by our whistleblowing policy that can be found at this link. Those measures are aimed at incentivizing them to find as many problems as possible.
Red teaming X spent 156 hours, and has the following amount of expertise and experience in bio risks and LLM jailbreaking: …
Red teaming Y spent 32 hours, and …
3. Leveraging a diverse range of risk identification techniques
Organizations should leverage a set of risk identification techniques, including threat modeling, to gain a thorough understanding of the potential threats identified in the literature and during the open-ended red teaming exercises. The output of this last exercise in the risk identification phase should be detailed and actionable risk scenarios.
Threat modeling should include three key components:
- Potential attackers should be identified, along with their motivations and resources, to understand who might attempt to exploit vulnerabilities in the AI system and why.
- Potential attack vectors and vulnerabilities in the AI system should be analyzed by examining the system architecture, components, and interfaces to identify entry points and weaknesses that could be exploited.
- Identified risk scenarios should be prioritized based on their likelihood and potential impact, allowing the organization to focus on the risks that are most likely to lead to a breach of their risk tolerance.
The results of the threat modeling work should be well documented, including the methodologies used, experts involved, and the prioritized list of identified risks. This documentation should be shared with relevant stakeholders and used to develop effective risk mitigation strategies.
For novel high-severity risks identified during open-ended red teaming, an additional exploratory risk identification effort should be undertaken to identify potential blind spots. This process should involve both internal and external experts to ensure a thorough and unbiased analysis. The primary goal is to uncover risks that may have been overlooked or unconsidered during previous risk assessments using techniques such as the Fishbone Diagram (M. Coccia, 2018), a structured method that helps identify and categorize the potential causes of an event, including less obvious or indirect contributors.
Example:
a) Use of an explicit process to explore and triage potential vulnerabilities
37 potential hazards were found in the red-teaming exercises. Based on a 2h long fishbone diagram run by our safety team, along with external expert Z, we ruled out 29 of those findings.
b) In-depth threat modeling for vulnerabilities most likely to change the risk profile
Taking into account the risk identified in the literature and the 8 remaining hazards, we’ve conducted a number of threat modeling exercises, which we release here, erasing the details that could cause national security concerns. This threat modeling, along with a Delphi study ran with 10 experts (listed in Annex B), led us to pick 10 reference scenarios available in Annex C. We decided from there to focus a significant amount of our risk assessment efforts during the training run and pre-deployment testing on these 10 scenarios that we consider representative of the most likely events that could cause high-severity damages.
Risk Tolerance and Analysis
1. Global risk tolerance
Example:
a) Methodology to set the tolerance
Based on other industries’ risk tolerance and a public consultation that we co-ran in collaboration with Y (more details in Annex D), following the example of the NRC 1983 consultation to define similar thresholds for nuclear safety, we decided to commit to the following risk tolerance:
b) Risk tolerance set for relevant severities
Severity: Risk Tolerance:
>1000 deaths < 0.01% per year across our systems
>1 death < 0.1% per year across our systems
> severe psychological or physical harms Once per year across our systems
caused to one individual
c) Coverage of the relevant units of risk
We decided to use distinct risk tolerances for risks of different nature, considering that it didn’t make sense to make everything fungible. Hence, for fundamental rights and epistemic erosion, we decided to use the following risk tolerance:
...
2. Operational risk tolerances
Operational capabilities thresholds
These thresholds should be operational, meaning that they have to be designed to provide a clear signal when they are approached or breached. The precision in defining these thresholds ensures that they serve as effective triggers for implementing risk mitigation strategies.
Risk mitigation objectives
- Containment measures: the set of security measures that allow controlling degrees of access to the model for various stakeholders. Mitigation objectives in that category can be expressed in security levels (Nevo et al., 2024).
- Deployment measures: the set of measures that allow controlling the potential for misuse of the model in dangerous domains and its propensity to cause accidental risks. Mitigation objectives in that category can for example be expressed in terms of severity of the worst findings during red-teaming exercises.
- Assurance properties: the set of properties that make it possible to provide affirmative safety assurance past significant levels of dangerous capabilities. Targets in that category can be expressed in terms of benchmark target, ability to solve a particular problem or amount of confidence one may want to have when ruling out a particular risk. Past significant levels of dangerous capabilities, capability evaluations are no longer sufficient to provide guarantees to demonstrate the absence of risk that a model presents (Clymer et al., 2024). Hence, given the pace of AI development, model providers should pursue research into model properties that can provide such evidence. We call those assurance properties. AI developers should have a clear target goal for assurance properties and provide strong justification that those are sufficient.
Example (for both operational capabilities thresholds and risk mitigation objective):
a) Linking risk thresholds to capabilities thresholds
We used a methodology detailed in Annex E which allows keeping the risk below our defined risk tolerance. In short, this methodology helps to determine, using methods based on expert inputs, how to allocate our risk across the different risk scenarios identified in the risk identification step.
b) Allocating capabilities/risk budget based on benefits & strategy
Because of our focus on getting top-tier coding capabilities, we determined that our system would be riskiest when it comes to cyber offense aspects (scenarios 3 and 8).
c) Determining thresholds & mitigations objectives with experts-based inputs & in-depth threat modeling
Using expert-based consultations, to whom we provided reference scenarios (details in Annex A) we determined the following thresholds on our benchmarks that we use as indicators of the harms we’ve modeled:
1. 60% on SWE Bench (unassisted), which we estimate corresponds to 1%/year of >$500M economic damages with our current mitigations and with a deployment to 1 000 000 users/day. Based on our threat modeling effort available in Annex C, we expect the largest sources of risk to arise from:
a. Scenario 3
b. Scenario 8
d) Discussion of the mitigation objectives and corresponding decrease in risk
We estimate those scenarios are at most about 0.1% likely to happen each, which is our target, if we reach the following mitigations objectives:
a. Security measures: it takes more than $1B by a state actor to steal our model.
b. Deployment measures: our model is impossible to jailbreak to execute actions Y and Z, i.e., no one among our red teamers or in the world has shown hints that they were able to do so, even given favorable conditions and an attack budget of $1M.
c. We differentially accelerate the development of defensive cybersecurity applications, while preventing the access of SOTA systems to malicious actors for at least a year.
2. 10% increase on magnification uplift studies for undergraduate or less experts…
…
…
For each threshold, Annex F contains a discussion, referencing the scenario analysis conducted, citing experts’ rationales to justify that the capabilities thresholds and mitigation objectives are sufficient to remain below our risk tolerance.
Example of assurance properties objective:
a) Clarity on main assurance properties bets
Past dangerous capabilities thresholds, our main bet to be able to make an affirmative safety case is to have advanced interpretability of our system. We expect interpretability to be the main way to gain confidence in the safety of a post-mitigation model which, when tested without mitigations implemented, demonstrated a disposition toward deception.
b) Operationalization of targets for this bet
We intend to operationalize our advanced interpretability with a benchmark. In the meantime, we defined a task that our interpretability techniques should be able to succeed at:
- Rule out the presence of a backdoor with >99.9% success.
c) Rationale for the sufficiency of those targets
We justify this level of interpretability to be sufficient for affirmative safety cases for each failure mode below:
Power-seeking drives could be caught using the following experiment …
3. Evaluation protocols
The frequency of these evaluations should be specified, expressed both in terms of relative variation of effective compute and time intervals. AI developers must provide a rationale for why they consider this chosen frequency sufficient to detect significant capability changes.
To ensure robustness and reliability, the evaluation protocols should be vetted by independent third parties. These third parties should also be granted permission and resources to independently run their own evaluations, verifying the accuracy of the results.
Additionally, AI developers should commit to sharing evaluation results with relevant stakeholders as appropriate.
Example:
a) Capabilities elicitation techniques & justification for capabilities upper bound
For the capability threshold X, to elicit the capabilities of our model, we, along with external red teamers, develop model-specific scaffolding, prompting and fine-tuning, that expert red teamers X spent Y numbers of hours to refine. We increase the performance of the model in zero-shot by Z%, which is higher than what any previous models’ post-training enhancements were able to provide. Therefore, we’re confident that no user in deployment will be able to reach a higher level of capabilities in the next 3 months.
Additionally, we commit to performing evaluations every 2x effective compute increase or major algorithmic breakthrough and every 3 months to account for post-training enhancement. The most rapidly emerging capability we know of, early in-context learning (C. Olsson et al., 2022) appeared after a 5x compute increase, so a 2x interval provides a sufficient safety margin.
Risk Mitigation
1. Deployment and Containment Mitigations
AI developers should provide clear and detailed descriptions of concrete measures for deployment and containment mitigations. The AI developers must commit to implementing these measures when risk levels surpass predefined thresholds, as established in the operational risk tolerances section. The measures should be vetted by third parties, affirming that they are sufficient to achieve the mitigation objectives defined in the previous section, for both current and future AI systems.
Example:
a) Measures planned to reach mitigation objectives
To reach the mitigation objective X, we commit to implement the following security measures:
1. Implement strict application allowlisting (that is, only specific binaries are allowed to execute on devices with access)
.…
b) Rationale justifying that those measures are sufficientSecurity researchers from third party X and Y were provided with access to all the necessary data and vetted the correct implementation of those measures.
A wide range of experts (X, Y, Z) agree that those security measures are sufficient to reach our mitigation objective. More specifically, we intend those measures to be sufficient as long as condition C is not fulfilled.
2. Assurance Properties
Example:
a) Rationale for feasibility & existing progress
We think it is possible to reach the target levels of interpretability because of the major progress and success that have been made through Y, because of the scaling laws that we have found on Z, and because of the rapid progress on the intermediary interpretability metrics that we defined in paper P.
Example:
a) Operating assumptions for the plan
Our core development model is that it is possible to build AI systems with expert-level capabilities with transformers, and a post-training process comparable to today’s state-of-the-art.Based on Delphi studies informed by scaling laws led internally, we expect the timelines of development of expert-level transformer architectures across all cognitive tasks to occur with 50% certainty by 2029.
The technical assumptions underlying our assurance properties that are the most uncertain are the following:
1. ...
2. ...