2.2.1 Key Risk Indicators (KRI) (30%) 22%
2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 25%
The framework describes two types of KRIs: those used for the ‘leading indicator assessment’, and those used for the ‘deeper capability assessment’.
For the leading indicator assessment KRIs, they give categories of benchmarks, but not the actual benchmarks nor their thresholds which are of sufficiently high risk. This could use more detail and could be more grounded in risk modelling.
For the deeper capability assessment KRIs, there could be more detail as to what “meaningful” or “significantly” constitute for the purpose of their risk assessment. However, these KRIs are grounded in risk modelling. To improve, KRIs should map directly to evaluation tests performed.
Quotes:
For instance, CBRN: “High: The model provides a meaningful uplift to an expert’s (PhD level education in related fields) ability to develop and deliver a novel CBRN threat. The model provides a meaningful uplift to a medium-skilled actor’s (e.g., STEM education) ability to develop and deliver a known CBRN threat. Critical: The model provides a meaningful uplift to an expert’s ability to develop a highly dangerous novel threat or significantly lowers the barriers to a low-skilled actor developing and delivering a known CBRN threat.” (p. 11)
“Through the processes described in this framework, Microsoft’s most advanced models are assessed for leading indicators of the framework’s high-risk capabilities. This is done using state-of-the-art benchmarks for the following advanced general-purpose capabilities, identified as precursors to high-risk capabilities:
- General reasoning
- Scientific and mathematical reasoning
- Long-context reasoning
- Spatial understanding and awareness
- Autonomy, planning, and tool use
- Advanced software engineering” (p. 5)
Footnote 1, after “benchmarks”: “For a benchmark to be included in our suite of leading indicator assessments it must: 1) have low saturation (i.e., the best performing models typically score lower than 70%); 2) measure an advanced capability, for example, mathematical reasoning, rather than an application-oriented capability like financial market prediction; and 3) have a sufficient number of prompts to account for non-determinism in model output.” (p. 5)
“Deeper capability assessment provides a robust indication of whether a model possesses a tracked capability and, if so, whether this capability is at a low, medium, high, or critical risk level, informing decisions about appropriate mitigations and deployment. We use qualitative capability thresholds to guide this classification process as they offer important flexibility across different models and contexts at a time of nascent and evolving understanding of frontier AI risk assessment and management practice.” (p. 5)
2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 10%
The framework describes two types of KRIs: those used for the ‘leading indicator assessment’, and those used for the ‘deeper capability assessment’.
For the leading indicator assessment KRIs, they give categories of benchmarks, but not the actual benchmarks nor their thresholds which are of sufficiently high risk. This could use more detail. However, these could likely be quantitatively defined.
For the deeper capability assessment KRIs, they explicitly do not have quantitative thresholds, preferring qualitative indicators. However, quantitative thresholds need not be inflexible, and in order to have transparency in risk decisions and provide clear guidance, KRIs should be quantitative where possible.
They should still publish the actual evaluations and thresholds which they currently operate under. This is because KRI-KCI pairings should be as predictable in advance as possible/allowing as little discretion as possible, and a qualitative threshold may be more arbitrary than a conservative quantitative estimate, until improved risk indicators can be developed.
Quotes:
“Through the processes described in this framework, Microsoft’s most advanced models are assessed for leading indicators of the framework’s high-risk capabilities. This is done using state-of-the-art benchmarks for the following advanced general-purpose capabilities, identified as precursors to high-risk capabilities:
- General reasoning
- Scientific and mathematical reasoning
- Long-context reasoning
- Spatial understanding and awareness
- Autonomy, planning, and tool use
- Advanced software engineering” (p. 5)
“Deeper capability assessment provides a robust indication of whether a model possesses a tracked capability and, if so, whether this capability is at a low, medium, high, or critical risk level, informing decisions about appropriate mitigations and deployment. We use qualitative capability thresholds to guide this classification process as they offer important flexibility across different models and contexts at a time of nascent and evolving understanding of frontier AI risk assessment and management practice.” (p. 5)
2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 10%
Whilst there is some indication that external risks must also be monitored and potentially used as a KRI, details on what these external risks are, how they are monitored, or the threshold that determines that a KRI has been crossed are not given.
Quotes:
“The results of capability evaluation and an assessment of risk factors external to the model then inform a determination as to whether a model has a tracked capability and to what level.” (p. 6)
“In addition to high-risk capabilities, a broader set of risks are governed when Microsoft develops and deploys AI technologies. Under Microsoft’s comprehensive AI governance program, frontier models—as well as other models and AI systems—are subject to relevant evaluation, with mitigations then applied to bring overall risk to an appropriate level.
Information on model or system performance, responsible use, and suggested system-level evaluations is shared with downstream actors integrating models into systems, including external system developers and deployers and teams at Microsoft building models. […] Our efforts to assess and mitigate risks related to this framework’s tracked
capabilities benefit from this broadly applied governance program, which is continuously improved. The remainder of this framework addresses more specifically the assessment and mitigation of risks relating to the framework’s tracked capabilities.” (p. 4)
2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%
There is a clear acknowledgment that KRIs and KCIs pair together to bring residual risk below the risk tolerance, or “an acceptable level”. However, this is not grounded in risk modelling, and this fact is not proven or given justification for each (or any) risk domain. Further, their risk assessment is contingent on other companies’ risk tolerance: “This holistic risk assessment also considers the marginal capability uplift a model may provide over and above currently available tools and information, including currently available open-weights models.”
Quotes:
“This framework assesses Microsoft’s most advanced AI models for signs that they may have these capabilities and, if so, whether the capability poses a low, medium, high, or critical risk to national security or public safety (more detail in Appendix I). This classification then
guides the application of appropriate and proportionate mitigations so that a model’s risks remain at an acceptable level.” (p. 3)
“The framework monitors Microsoft’s most capable AI models for leading indicators of high-risk capabilities and triggers deeper assessment if leading indicators are observed. As and when risks, are identified, proportional mitigations are applied so that risks are kept at an appropriate level. This approach provides confidence that highly capable models are identified before relevant risks emerge.” (p. 2)
“This holistic risk assessment also considers the marginal capability
uplift a model may provide over and above currently available tools and information, including currently available open-weights models.” (p. 7)
2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 25%
There is a clear commitment to putting development (and deployment) on hold if a risk cannot be sufficiently mitigated. To improve, this could have more detail, for instance by linking to clear KCI thresholds so that the decision to pause is unambiguous. A process for pausing development could also be developed.
Quotes:
“If, during the implementation of this framework, we identify a risk we cannot sufficiently mitigate, we will pause development and deployment until the point at which mitigation practices evolve to meet the risk.” (p. 8)
“The leading indicator assessment is run during pre-training, after pre-training is complete, and prior to deployment to ensure a comprehensive assessment as to whether a model warrants deeper inspection. This also allows for pause, review, and the application of mitigations as appropriate if a model shows signs of significant capability improvements.” (p. 5)