2.2.1 Key Risk Indicators (KRI) (30%) 21%
2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 25%
There are risk indicators given in the form of LiveCodeBench results (>50%) and private benchmarks. To improve, the private benchmarks should be at least described, so that the thresholds they are measuring for is transparent. Further, some justification as for why LiveCodeBench is an appropriate KRI is needed, as it otherwise seems arbitrary; that is, the KRI does not appear to be grounded in risk modelling.
Quotes:
“We compare our models’ capability to publicly available closed and open-source models, to determine whether our models are sufficiently capable such that there is a real risk of setting a new state-of-the-art in dangerous AI capabilities.
A representative public benchmark we will use is LiveCodeBench, which aggregates problems from various competitive programming websites. As of publishing, the best public models currently have the following scores (Pass@1 on Code Generation, evaluation timeframe: estimated knowledge cut-off date to latest LiveCodeBench evaluation set):
Claude-3.5-Sonnet: 48.8% (04/01/2024 – 06/01/2024)
GPT-4-Turbo-2024-04-09: 43.9% (05/01/2023 – 06/01/2024)
GPT-4O-2024-05-13: 43.4% (11/01/2023 – 06/01/2024)
GPT-4-Turbo-1106: 38.8% (05/01/2023 – 06/01/2024)
DeepSeekCoder-V2: 38.1% (12/01/2023 – 06/01/2024)
Based on these scores, when, at the end of a training run, our models exceed a threshold of 50% accuracy on LiveCodeBench, we will trigger our commitment to incorporate a full system of dangerous capabilities evaluations and planned mitigations into our AGI Readiness Policy, prior to substantial further model development, or publicly deploying such models.
As an alternative threshold definition, we will also make use of a set of private benchmarks that we use internally to assess our product’s level of software engineering capability. For comparison, we will also perform these evaluations on publicly available AI systems that are generally considered to be state-of-the-art. We will have privately specified thresholds such that if we see that our model performs significantly better than publicly available models, this is considered evidence that we may be breaking new ground in terms of AI systems’ dangerous capabilities. Reaching these thresholds on our private benchmarks will also trigger our commitments to develop our full AGI Readiness Policy, with threat model evaluations and mitigations, before substantial further model development or deployment.
The expanded AGI Readiness Policy required by the above commitments will also specify more comprehensive guidelines for evaluation thresholds that apply during development and training, not just deployment, of future advanced models that cross certain eval thresholds.”
2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 25%
There are risk indicators given in the form of LiveCodeBench results (>50%) and private benchmarks. These are quantitative and compared to publicly available models, which is commendable. To improve however, the private benchmarks should be at least described, so that the thresholds they are measuring for is transparent. Further, some justification as for why LiveCodeBench is an appropriate KRI is needed, as it otherwise seems arbitrary – it should be linked to risk modelling, for instance.
Quotes:
“We compare our models’ capability to publicly available closed and open-source models, to determine whether our models are sufficiently capable such that there is a real risk of setting a new state-of-the-art in dangerous AI capabilities.
A representative public benchmark we will use is LiveCodeBench, which aggregates problems from various competitive programming websites1. As of publishing, the best public models currently have the following scores (Pass@1 on Code Generation, evaluation timeframe: estimated knowledge cut-off date to latest LiveCodeBench evaluation set):
Claude-3.5-Sonnet: 48.8% (04/01/2024 – 06/01/2024)
GPT-4-Turbo-2024-04-09: 43.9% (05/01/2023 – 06/01/2024)
GPT-4O-2024-05-13: 43.4% (11/01/2023 – 06/01/2024)
GPT-4-Turbo-1106: 38.8% (05/01/2023 – 06/01/2024)
DeepSeekCoder-V2: 38.1% (12/01/2023 – 06/01/2024)
Based on these scores2, when, at the end of a training run, our models exceed a threshold of 50% accuracy on LiveCodeBench, we will trigger our commitment to incorporate a full system of dangerous capabilities evaluations and planned mitigations into our AGI Readiness Policy, prior to substantial further model development, or publicly deploying such models.
As an alternative threshold definition, we will also make use of a set of private benchmarks that we use internally to assess our product’s level of software engineering capability. For comparison, we will also perform these evaluations on publicly available AI systems that are generally considered to be state-of-the-art. We will have privately specified thresholds such that if we see that our model performs significantly better than publicly available models, this is considered evidence that we may be breaking new ground in terms of AI systems’ dangerous capabilities. Reaching these thresholds on our private benchmarks will also trigger our commitments to develop our full AGI Readiness Policy, with threat model evaluations and mitigations, before substantial further model development or deployment.
The expanded AGI Readiness Policy required by the above commitments will also specify more comprehensive guidelines for evaluation thresholds that apply during development and training, not just deployment, of future advanced models that cross certain eval thresholds.”
2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%
There are no KRIs which are based off of levels of risk in the external environment. Whilst their private benchmarks are in reference to other labs’ private benchmarks, satisfying this criterion requires a KRI that is contingent on risk conditions external to the model’s capabilities.
Quotes:
“As an alternative threshold definition, we will also make use of a set of private benchmarks that we use internally to assess our product’s level of software engineering capability. For comparison, we will also perform these evaluations on publicly available AI systems that are generally considered to be state-of-the-art. We will have privately specified thresholds such that if we see that our model performs significantly better than publicly available models, this is considered evidence that we may be breaking new ground in terms of AI systems’ dangerous capabilities. Reaching these thresholds on our private benchmarks will also trigger our commitments to develop our full AGI Readiness Policy, with threat model evaluations and mitigations, before substantial further model development or deployment.”
2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%
They mention that the residual risk should be such that they can “continue development and deployment in a safe manner”. They also note that they may change their KRIs if other companies have higher KRI thresholds crossed but the residual risk remains acceptable. Together, these show awareness of pairing KRI and KCI thresholds to show that the residual risk remains below the risk tolerance. However, this link could be more explicit, plus linked to risk modelling. “A safe manner” should be more precisely defined.
Most importantly, there should be justification for why, if the KRI threshold is crossed but the KCI threshold is met, the residual risk remains below the risk tolerance.
Quotes:
“Prior to publicly deploying models that exceed the current frontier of coding performance, we will evaluate them for dangerous capabilities and ensure that we have sufficient protective measures in place to continue development and deployment in a safe manner.”
“Over time, public evidence may emerge that it is safe for models that have demonstrated proficiency beyond the above thresholds to freely proliferate without posing any significant catastrophic risk to public safety. For this reason, we may update this threshold upward over time. We may also modify the public and private benchmarks used. Such a change will require approval by our Board of Directors, with input from external security and AI safety advisers.”
2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 25%
There is a clear policy to put development on hold if KRIs are not developed. As for KCIs however, they commit to “delaying or pausing development in the worst case until the dangerous capability detected has been mitigated or contained.” However, more clarity for this decision should be given, such as what consitutes sufficient mitigation/containment, and an explicit threshold that would determine pausing development. Conditions and the process for dedeployment should also be detailed.
Quotes:
“If the engineering team sees evidence that our AI systems have exceeded the current performance thresholds on the public and private benchmarks listed above, the team is responsible for making this known immediately to the leadership team and Magic’s Board of Directors (BOD).
We will then begin executing the dangerous capability evaluations we develop for our Covered Threat Models, and they will begin serving as triggers for more stringent information security measures and deployment mitigations. If we have not developed adequate dangerous capability evaluations by the time these benchmark thresholds are exceeded, we will halt further model development until our dangerous capability evaluations are ready.”
“In cases where said risk for any threat model passes a ‘red-line’, we will adopt safety measures outlined in the Threat Mitigations section, which include delaying or pausing development in the worst case until the dangerous capability detected has been mitigated or contained.”