This dimension evaluates the extent to which the company has implemented comprehensive risk mitigation strategies across three critical areas: containment (controlling access to AI models), deployment (preventing misuse and accidental harms), and assurance processes (providing affirmative evidence of safety). Additionally, it assesses whether the company continuously monitors both Key Risk Indicators (KRIs) and Key Control Indicators (KCIs) throughout the AI system's lifecycle, from training through deployment.
3.1 Implementing Mitigation Measures (50%)
3.1.1 Containment measures (35%)
3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%)
- The planned or already implemented containment measures are precisely defined for all containment KCI thresholds.
3.1.1.2 Proof that containment measures are sufficient to meet the KCI thresholds (40%)
- Proof is provided to justify that the containment measures provided are sufficient to meet the relevant containment KCI prior to their implementation (i.e., before the corresponding KRI threshold is crossed.) That is, the containment measures suggested are shown to actually satisfy the relevant containment KCI threshold, or at least evidence for why they believe these measures are likely to satisfy the KCI threshold is given with confidence levels.
- Partial credit is given if there exists a process for soliciting proof. However, to gain marks over 50, the first item should be satisfied.
- The implementation of the KRI-KCI pairing is predictable in advance, leaving as little to discretion as possible.
- The sufficiency criteria is determined ex ante. There is a justification for why this criteria is adequate proof.
3.1.1.3 Strong third-party verification process to verify that the containment measures meet the KCI thresholds (100% if greater than the weighted average of 3.1.1.1 and 3.1.1.2)
- There is an external structured process for proving that containment measures are sufficient to meet the relevant containment KCI, such as through a security audit, prior to its implementation (i.e., before the corresponding KRI threshold is crossed).- Detail is provided on how experts are chosen, with the following details: required expertise from experts and guarantee of independence
- External reports are made available (with sensitive information redacted), to give a sense of the third parties confidence that the measures meet the threshold.
3.1.2 Deployment measures (35%)
3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%)
- The planned or already implemented deployment measures are precisely defined for all deployment KCI thresholds.
3.1.2.2 Proof that deployment measures are sufficient to meet the KCI thresholds (40%)
- There is a pre emptive justification that the measures are sufficient to meet the relevant deployment KCI (e.g., this quantity of rejection finetuning would enable us to reach our target of 99.9 percent of jailbreak resistance, as shown by these experiments [...]) Partial credit if there is some process for soliciting such evidence.
- Partial credit is given if there exists a process for soliciting proof. However, to gain marks over 50, the first item should be satisfied.
- The implementation of the KRI-KCI pairing is predictable in advance, leaving as little to discretion as possible.
- The sufficiency criteria is determined ex ante. There is a justification for why this criteria is adequate proof.
3.1.2.3 Strong third-party verification process to verify that the containment measures meet the KCI thresholds (100% if greater than the weighted average of 3.1.2.1 and 3.1.2.2)
- There is an external structured process for proving that deployment measures are sufficient to meet the relevant deployment KCI, such as external red-teaming of safeguards. - Detail is provided on how experts are chosen, with the following details: required expertise from experts and guarantee of independence
- External reports are made available (with sensitive information redacted), to give a sense of the third parties confidence that the measures meet the threshold.
3.1.3 assurance processes (30%)
3.1.3.1 Credible plans towards the development of assurance properties (40%)
- Frameworks must acknowledge whether current assurance processes are insufficient to meet the required assurance process KCI.
- If insufficient, the framework should articulate (a) at what KRI the assurance processes become necessary, and (b) justification for why they believe they will have sufficient assurance processes by the time the relevant KRI is reached, including (c) technical milestones and estimates of when these milestones will need to be reached given forecasted capabilities growth
- If no plans are given, 3.1.3.2 must be at least 50%.
3.1.3.2 Evidence that the assurance processes are sufficient to achieve their corresponding KCI thresholds (40%)
- Process is defined for how they will empirically determine/collect evidence to show that assurance processes are credible, such as via demonstration with model organisms or theoretical proofs like Irving et al. (2023)
3.1.3.3 The underlying assumptions that are essential for assurance processes to be sufficient in 3.1.3.2 are clearly outlined, and justified (20%)
- The framework outlines the key technical assumptions underlying their current planned assurance approach. For instance, no undetected sandbagging, chain of thought is faithful, etc.
- There is justification for why these assumptions are valid, i.e. stress testing of these assumptions is performed and results are presented with confidence levels.
3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds
3.2.1 Monitoring of kris (40%)
3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%)
- For KRIs that require capability assessments (evaluations), the framework provides justification for why their elicitation methods will upper bound the elicited capabilities given the resources and methods available to potential threat actors, for the relevant risk model for that KRI. For instance, evaluations may be performed on a model that is finetuned for the task, or the helpful only model, to account for the case that threat actors attain weights / remove the effect of sandbagging and upper bound capabilities.
- There is detail on these elicitation methods, e.g. the amount of compute used for finetuning.
3.2.1.2 Specification of evaluation frequency is given in measurable, predictable terms (effective compute variation and fixed time periods) (25%)
- The framework outlines some process for regularly re-conducting evaluations with maximal elicitation, to factor in previously unaccounted post training enhancements or elicitation methods.
- The frequency should be based on both time (e.g., every 6 months) and according to scaling progress (effective computing power used in training triggers more advanced KRIs).
- The chosen frequency is justified with clear reasoning.
3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%)
- For KRIs that require capability assessments (evaluations), there is an explicit documentation of (a) the specific methodologies used to either incorporate post-training enhancements into capability measurements, and/or (b) the size of the safety/uncertainty margin in order to account for possible post-training enhancements that occur after evaluation is complete, with justification for the size of this margin based on forecasting exercises given the speed of progress of previous post-training enhancements.
- The uncertainty margin accounts for, or updates on, how post-training enhancements change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.
3.2.1.4 Vetting of KRI assessment protocols by third parties (15%)
- There is a process for independent third parties to review the internal methods for assessing KRI status, including evaluation methodologies.
- Detail is provided on how experts are chosen, with the following details: required expertise from experts and guarantee of independence
3.2.1.5 KRI assessments are conducted, replicated or audited by third parties (15%)
- There is a process for assessing KRI assessment results externally (i.e., by independent third parties), to ensure that KRI assessments are accurate. This could materialise as internal KRI assessments being replicated by external parties (or audited), or KRI assessments being outsourced to third parties.
3.2.2 monitoring of kcis (40%)
3.2.2.1 Detailed description of safeguard efficacy methodology, with empirical justification that the KCI measures will continue to satisfy KCI thresholds (40%)
- The framework describes systematic, ongoing monitoring to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will be met, when required.
- There is a justification that threshold detection will fit within suitable confidence levels. The framework includes failure mode analysis or some other methodology to minimise chance of failure.
3.2.2.2 Vetting of KCI assessment protocols by third parties (30%)
- There is a process for independent third parties to review the methods for assessing the efficacy of KCi measures.
3.2.2.3 KCI assessments are conducted, replicated or audited by third parties (30%)
- There is a process for assessing KCIs internally and externally (i.e., by independent third parties), to ensure that KCI assessments are accurate. This means either internal KCI assessments are replicated by external parties (audited), or KCI assessments are outsourced to third parties.
- Detail is provided on how experts are chosen, with the following details: required expertise from experts, and guarantee of independence.
3.2.3 transparency of kri/kci assessment results (10%)
3.2.3.1 Sharing of KRI and KCI assessment results with relevant stakeholders as appropriate (85%)
- If a KRI is crossed for any risk domain, the company commits to notifying regulators/the relevant government authorities in a timely manner.
- All KRI and KCI assessments (i.e., evaluations) are public, with predefined criteria.
3.2.3.2 Commitment from the company to not interfere with nor suppress external KRI/KCI assessments' findings (15%)
- The framework commits to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently.
3.2.4 monitoring for novel risks (10%)
3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%)
- There is a structured process for identifying novel risks domains or novel risk models within known risk domains.-
- There is justification for why this process will identify novel risks.
3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%)
- Novel risks or risk pathways identified via monitoring post-deployment trigger further risk modeling and scenario analysis. This may include updating multiple or all risk models. (For instance, encountering evidence of instrumental reasoning via open-ended red teaming likely requires updates to multiple risk models.)