Microsoft

Very Weak 1.0/5

Click categories for more information

very weak

weak

moderate

substantial

strong

Risk Identification

Learn more

Risk Identification

Risk Analysis and Evaluation

Learn more

Risk Analysis and Evaluation

13%

Risk Treatment

Learn more

Risk Treatment

28%

Risk Governance

Learn more

Risk Governance

27%

Best in class

SEE FRAMEWORK

Microsoft’s framework specifies that their framework is subject to independent internal audits, and uniquely indicates that they already have a procedure for implementing this as part of broader corporate governance procedures.
Microsoft scores highest for reporting externally on what their governance structure looks like, providing the most detail.

Overview

Highlights relative to others

Clearly defined protocols for making go/no-go decisions.

Strong speak up culture, with whistleblower mechanisms already in place.

Clear, stronger commitment to report capabilities and limitations of models.

Elicitation effort clearly connected to resources available to threat actors.

Weaknesses relative to others

No indication of risk modelling, nor justification for focusing on certain risk domains.

Lacking description of deployment mitigation measures.

Threshold for triggering development/deployment pause is vague.

No reference to assurance processes.

Microsoft

1. Risk Identification

Very Weak 7%

1.1 Classification of Applicable Known Risks (40%) 13%

1.1.1 Risks from literature and taxonomies are well covered (50%) 25%

The criterion is partially addressed, covering the risk areas of CBRN weapons, offensive cyberoperations and advanced autonomy (which is essentially AI R&D). Further, 1.1.2 is less than 50%, suggesting that justification for exclusion of risks such as persuasion and loss of control risks should be stronger, or that these risks should be included in their monitoring.

Quotes:

“This framework tracks the following capabilities that we believe could emerge over the short-to-medium term and threaten national security or pose at-scale public safety risks if not appropriately mitigated. In formulating this list, we have benefited from the advice of both internal and external experts.

Chemical, biological, radiological, and nuclear (CBRN) weapons. A model’s ability to provide significant capability uplift to an actor seeking to develop and deploy a chemical, biological, radiological, or nuclear weapon.
Offensive cyberoperations. A model’s ability to provide significant capability uplift to an actor seeking to carry out highly disruptive or destructive cyberattacks, including on critical infrastructure.
Advanced autonomy. A model’s ability to complete expert-level tasks autonomously, including AI research and development.” (p. 3)

1.1.2 Exclusions are clearly justified and documented (50%) 0%

No justification for exclusion of risks such as manipulation or loss of control risks is given.

Quotes:

No relevant quotes found.

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.2.2 Third party open-ended red teaming (30%) 0%

The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.3 Risk modeling (40%) 5%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 0%

While they mention “mapping” risks in general, there is no evidence that they develop a risk model for any of the risk areas. To improve, risk models that are specific to the model being considered should be developed, with causal pathways to threat scenarios identified. There should be justification that adequate effort has been exerted to systematically map out all possible risk pathways, and the risk models, threat scenarios, methodology, and experts involved should be published.

Quotes:

“While different risk profiles may thus inform different mitigation strategies, Microsoft’s overall approach of mapping, measuring, and mitigating risks, including through robust evaluation and measurement, applies consistently across our AI technologies.” (p. 4)

1.3.2 Risk modeling methodology (40%) 8%

1.3.2.1 Methodology precisely defined (70%) 0%

There is no methodology for risk modeling defined.

Quotes:

No relevant quotes found.

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.

Quotes:

No relevant quotes found.

1.3.2.3 Prioritization of severe and probable risks (15%) 50%

While they don’t explicitly prioritize severity and likelihood of risk models, there does appear to be some structured process for identifying which risks are most severe and probable. Implicitly, they seem to be prioritizing these. To improve, risk models with severity and probability determinations should be published.

Quotes:

1.3.3 Third party validation of risk models (20%) 10%

Whilst the framework does not detail a risk modeling methodology, they do obtain some external input when prioritising risks, which implicitly requires input into risk models. However, this does not count as review, and the process should be more explicitly linked to validating risk models.

Quotes:

Microsoft

2. Risk Analysis and Evaluation

Very Weak 13%

2.1 Setting a Risk Tolerance (35%) 7%

2.1.1 Risk tolerance is defined (80%) 8%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 25%

There is no explicit reference to a risk tolerance, though implicitly the tolerance is given by the capability thresholds. For instance, “CBRN weapons, Critical: The model provides a meaningful uplift to an expert’s ability to develop a highly dangerous novel threat or significantly lowers the barriers to a low-skilled actor developing and delivering a known CBRN threat.” The risk tolerance is also implicitly described as risks arising from “capabilities that we believe could emerge over the short-to-medium term and threaten national security or pose at-scale public safety risks if not appropriately mitigated”; i.e., “threaten[ing] national security” or “pos[ing] at-scale public safety risks” is the risk tolerance.

To improve, they should set out the maximum amount of risk the company is willing to accept for each risk domain (though these need not differ between risk domains), ideally expressed in terms of probabilities and severity (economic damages, physical lives, etc), and separate from KRIs.

Quotes:

“CBRN weapons, Critical: The model provides a meaningful uplift to an expert’s ability to develop a highly dangerous novel threat or significantly lowers the barriers to a low-skilled actor developing and delivering a known CBRN threat.” (p. 11)
“Offensive cyberoperations, Critical: The model provides a meaningful uplift to a low-skilled actor’s ability to identify and exploit major vulnerabilities or enables a well-resourced and expert actor to develop and execute novel and effective strategies against hardened targets.” (p. 12)

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

The implicit risk tolerance of “threaten[ing] national security” or posing “at-scale public safety risks” is not a quantitative nor partly quantitative definition. Further, the implicit risk tolerances offered by the critical capability thresholds are not quantitative nor partly quantitative. To improve, the risk tolerance should be expressed fully quantitatively or as a combination of scenarios with probabilities.

Quotes:

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

The implicit risk tolerance of “threaten[ing] national security” or posing “at-scale public safety risks” is not a quantitative nor partly quantitative definition. The implicit risk tolerances given by the critical capability thresholds are not fully quantitative, either.

Quotes:

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

No evidence of engaging in public consultations or seeking guidance from regulators for risk tolerance.

Quotes:

No relevant quotes found.

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 17%

2.2.1 Key Risk Indicators (KRI) (30%) 22%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 25%

The framework describes two types of KRIs: those used for the ‘leading indicator assessment’, and those used for the ‘deeper capability assessment’.

For the leading indicator assessment KRIs, they give categories of benchmarks, but not the actual benchmarks nor their thresholds which are of sufficiently high risk. This could use more detail and could be more grounded in risk modelling.

For the deeper capability assessment KRIs, there could be more detail as to what “meaningful” or “significantly” constitute for the purpose of their risk assessment. However, these KRIs are grounded in risk modelling. To improve, KRIs should map directly to evaluation tests performed.

Quotes:

For instance, CBRN: “High: The model provides a meaningful uplift to an expert’s (PhD level education in related fields) ability to develop and deliver a novel CBRN threat. The model provides a meaningful uplift to a medium-skilled actor’s (e.g., STEM education) ability to develop and deliver a known CBRN threat. Critical: The model provides a meaningful uplift to an expert’s ability to develop a highly dangerous novel threat or significantly lowers the barriers to a low-skilled actor developing and delivering a known CBRN threat.” (p. 11)

“Through the processes described in this framework, Microsoft’s most advanced models are assessed for leading indicators of the framework’s high-risk capabilities. This is done using state-of-the-art benchmarks for the following advanced general-purpose capabilities, identified as precursors to high-risk capabilities:

General reasoning
Scientific and mathematical reasoning
Long-context reasoning
Spatial understanding and awareness
Autonomy, planning, and tool use
Advanced software engineering” (p. 5)

Footnote 1, after “benchmarks”: “For a benchmark to be included in our suite of leading indicator assessments it must: 1) have low saturation (i.e., the best performing models typically score lower than 70%); 2) measure an advanced capability, for example, mathematical reasoning, rather than an application-oriented capability like financial market prediction; and 3) have a sufficient number of prompts to account for non-determinism in model output.” (p. 5)

“Deeper capability assessment provides a robust indication of whether a model possesses a tracked capability and, if so, whether this capability is at a low, medium, high, or critical risk level, informing decisions about appropriate mitigations and deployment. We use qualitative capability thresholds to guide this classification process as they offer important flexibility across different models and contexts at a time of nascent and evolving understanding of frontier AI risk assessment and management practice.” (p. 5)

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 10%

The framework describes two types of KRIs: those used for the ‘leading indicator assessment’, and those used for the ‘deeper capability assessment’.

For the deeper capability assessment KRIs, they explicitly do not have quantitative thresholds, preferring qualitative indicators. However, quantitative thresholds need not be inflexible, and in order to have transparency in risk decisions and provide clear guidance, KRIs should be quantitative where possible.

They should still publish the actual evaluations and thresholds which they currently operate under. This is because KRI-KCI pairings should be as predictable in advance as possible/allowing as little discretion as possible, and a qualitative threshold may be more arbitrary than a conservative quantitative estimate, until improved risk indicators can be developed.

Quotes:

General reasoning
Scientific and mathematical reasoning
Long-context reasoning
Spatial understanding and awareness
Autonomy, planning, and tool use
Advanced software engineering” (p. 5)

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 10%

Whilst there is some indication that external risks must also be monitored and potentially used as a KRI, details on what these external risks are, how they are monitored, or the threshold that determines that a KRI has been crossed are not given.

Quotes:

“The results of capability evaluation and an assessment of risk factors external to the model then inform a determination as to whether a model has a tracked capability and to what level.” (p. 6)

“In addition to high-risk capabilities, a broader set of risks are governed when Microsoft develops and deploys AI technologies. Under Microsoft’s comprehensive AI governance program, frontier models—as well as other models and AI systems—are subject to relevant evaluation, with mitigations then applied to bring overall risk to an appropriate level.
Information on model or system performance, responsible use, and suggested system-level evaluations is shared with downstream actors integrating models into systems, including external system developers and deployers and teams at Microsoft building models. […] Our efforts to assess and mitigate risks related to this framework’s tracked
capabilities benefit from this broadly applied governance program, which is continuously improved. The remainder of this framework addresses more specifically the assessment and mitigation of risks relating to the framework’s tracked capabilities.” (p. 4)

2.2.2 Key Control Indicators (KCI) (30%) 11%

2.2.2.1 Containment KCIs (35%) 25%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 50%

The framework gives qualitative containment KCI thresholds distinguishing between high-risk and critical risk KRIs, though more detail could be given as to what “the highest level of security safeguards” refers to, or what “protective against most cybercrime groups and insider threats” entails, e.g. what kind of threats or attacks.

Quotes:

“Models posing high-risk on one or more tracked capability will be subject to security measures protective against most cybercrime groups and insider threats […] Models posing critical risk on one or more tracked capability are subject to the highest level of security safeguards.” (p. 7)

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%

No quantitative containment KCI thresholds given.

Quotes:

No relevant quotes found.

2.2.2.2 Deployment KCIs (35%) 5%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 10%

Practically no detail on deployment KCI thresholds is given. For each capability threshold, the deployment requirements are either “Deployment allowed in line with Responsible AI Program requirements” or “Further review and mitigations required.” The specific threshold given by the Responible AI Program requirements should be explicitly detailed.

Quotes:

“Deployment allowed in line with Responsible AI Program requirements” or “Further review and mitigations required.” (p.13)

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%

There are no quantitative deployment KCI thresholds given.

Quotes:

No relevant quotes found.

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 0%

There are no assurance processes KCIs defined. The framework does not provide recognition of there being KCIs outside of containment and deployment measures.

Quotes:

No relevant quotes found.

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%

There is a clear acknowledgment that KRIs and KCIs pair together to bring residual risk below the risk tolerance, or “an acceptable level”. However, this is not grounded in risk modelling, and this fact is not proven or given justification for each (or any) risk domain. Further, their risk assessment is contingent on other companies’ risk tolerance: “This holistic risk assessment also considers the marginal capability uplift a model may provide over and above currently available tools and information, including currently available open-weights models.”

Quotes:

“This framework assesses Microsoft’s most advanced AI models for signs that they may have these capabilities and, if so, whether the capability poses a low, medium, high, or critical risk to national security or public safety (more detail in Appendix I). This classification then
guides the application of appropriate and proportionate mitigations so that a model’s risks remain at an acceptable level.” (p. 3)

“The framework monitors Microsoft’s most capable AI models for leading indicators of high-risk capabilities and triggers deeper assessment if leading indicators are observed. As and when risks, are identified, proportional mitigations are applied so that risks are kept at an appropriate level. This approach provides confidence that highly capable models are identified before relevant risks emerge.” (p. 2)

“This holistic risk assessment also considers the marginal capability
uplift a model may provide over and above currently available tools and information, including currently available open-weights models.” (p. 7)

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 25%

There is a clear commitment to putting development (and deployment) on hold if a risk cannot be sufficiently mitigated. To improve, this could have more detail, for instance by linking to clear KCI thresholds so that the decision to pause is unambiguous. A process for pausing development could also be developed.

Quotes:

“The leading indicator assessment is run during pre-training, after pre-training is complete, and prior to deployment to ensure a comprehensive assessment as to whether a model warrants deeper inspection. This also allows for pause, review, and the application of mitigations as appropriate if a model shows signs of significant capability improvements.” (p. 5)

Microsoft

3. Risk Treatment

Weak 28%

3.1 Implementing Mitigation Measures (50%) 27%

3.1.1 Containment measures (35%) 49%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 75%

There is explicit reference to complying with specific standards and frameworks, and examples of containment measure requirements for high-risk and critical-risk capabilities. They are clearly linked to the high and critical capability thresholds, ie to these corresponding containment KCIs. To improve, the framework could be more specific on what will actually be implemented (rather than providing possible examples), as well as developing (or detailing the plan to develop) the containment measures for the critical-risk capabilities.

Quotes:

“The framework is built on a foundation of full-stack security, advancing comprehensive protections for key assets.” (p. 2)

“As Microsoft operates the infrastructure on which its models will be trained and deployed, we adopt an integrated full-stack approach to the security of frontier models, implementing safeguards at the infrastructure, model, and system layers. Security measures will be tailored to the specifics of each model, including its capabilities and the method by which it is made available and integrated into a system, so that the marginal risks a model may pose are appropriately addressed.” (p. 7)

“We expect scientific understanding of how to best secure the AI lifecycle will advance significantly in the coming months and years and will continue to contribute to, and apply, security best practices as relevant and appropriate. This includes existing best practice defined in leading standards and frameworks, such as NIST SP 800-53, NIST 800-218, SOC 2, Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models, and Deploying AI Systems Securely, as well as industry practices, including from the Frontier Model Forum. Security safeguards are scaled up depending on the model’s pre-mitigation scores, with more robust measures applied to models with high and critical risk levels.” (p. 7)

“Models posing high-risk on one or more tracked capability will be subject to security measures protective against most cybercrime groups and insider threats. Examples of requirements for models having a high-risk score include:

Restricted access, including access control list hygiene and limiting access to weights of the most capable models other than for core research and for safety and security teams. Strong perimeter and access control are applied as part of preventing unauthorized access.
Defense in depth across the lifecycle, applying multiple layers of security controls that provide redundancy in case some controls fail. Model weights are encrypted.
Advanced security red teaming, using third parties where appropriate, to reasonably simulate relevant threat actors seeking to steal the model weights so that security safeguards are robust.

Models posing critical risk on one or more tracked capability are subject to the highest level of security safeguards. Further work and investment are needed to mature security practices so that they can be effective in securing highly advanced models with critical risk levels that may emerge in the future. Appropriate requirements for critical risk level models will likely include the use of high-trust developer environments, such as hardened tamper-resistant workstations with enhanced logging, and physical bandwidth limitations between devices or networks containing weights and the outside world.” (p. 7)

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 10%

They state that they engage in “advanced security red teaming”; more detail is required on the process of this red-teaming, and what constitutes sufficient proof. There is no process detailed for proving containment measures are sufficient for critical-risk models.

Importantly, they should detail proof in advance for why they believe the containment measures proposed will be sufficient to meet the KCI threshold. In addition, red-teaming is more an evidence gathering activity than a validation/proof; to improve, a case should be made for why they believe their containment measures to be sufficient.

Quotes:

For high-risk models: “Advanced security red teaming, using third parties where appropriate, to reasonably simulate relevant threat actors seeking to steal the model weights so that security safeguards are robust.” (p. 7)

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 10%

They state that they engage in “advanced security red teaming, using third parties where appropriate”; more detail is required on the process of this red-teaming, what constitutes sufficient proof. Involving third parties should not be discretionary but part of the verification process. There is no process detailed for proving containment measures are sufficient for critical-risk models. In addition, red-teaming is more an evidence gathering activity than a validation/proof; to improve, a case should be made for why they believe their containment measures to be sufficient.

Importantly, they should detail proof in advance for why they believe the containment measures proposed will be sufficient to meet the KCI threshold.

Quotes:

3.1.2 Deployment measures (35%) 25%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 25%

Whilst they define deployment measures in general, these are not tied to KCI thresholds nor specific risk domains. For instance, the deployment measures for models that are high-risk in cybersecurity may be different to deployment measures for models that are critical-risk in autonomous AI R&D.

Quotes:

“We apply state-of-the-art safety mitigations tailored to observed risks so that the model’s risk level remains at low or medium once mitigations have been applied. […] Examples of safety mitigations we utilize include:

Harm refusal, applying state-of-the-art harm refusal techniques so that a model does not return harmful information relating to a tracked capability at a high or critical level to a user. […]
Deployment guidance, with clear documentation setting out the capabilities and limitations of the model, including factors affecting safe and secure use and details of prohibited uses. […]
Monitoring and remediation, including abuse monitoring in line with Microsoft’s Product Terms and provide channels for employees, customers, and external parties to report concerns about model performance, including serious incidents that may pose public safety and national security risks. […] Other forms of monitoring, including for example, automated monitoring in chain-of-thought outputs, are also utilized as appropriate. […]
Phased release, trusted users, and usage studies, as appropriate for models demonstrating novel or advanced capabilities.” (p. 8)

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

There is some implementation of proving deployment measures are sufficient, by testing that the post-mitigation model does not cross the same KRI threshold as the unmitigated model. More detail could be provided on what exactly the evaluation constitutes, and why they believe this to be sufficient proof. Further, proof should be provided ex ante for why they believe their deployment measures will meet the relevant KCI threshold.

Quotes:

“Post-mitigation capability assessment and safety buffer: Following application of safety and security mitigations, the model will be re-evaluated to ensure capabilities are rated low or medium and, if not, to guide further mitigation efforts.” (p. 8)

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 0%

There is no mention of third-party verification of deployment measures meeting the threshold.

Quotes:

No relevant quotes found.

3.1.3 Assurance processes (30%) 5%

3.1.3.1 Credible plans towards the development of assurance properties (40%) 10%

Whilst there is a commitment to conduct further research in mitigations, this is not specifically for assurance processes. However, there is an acknowledgment that current mitigations are insufficient, garnering partial credit.

Quotes:

“Models posing critical risk on one or more tracked capability are subject to the highest level of security safeguards. Further work and investment are needed to mature security practices so that they can be effective in securing highly advanced models with critical risk levels that may emerge in the future. Appropriate requirements for critical risk level models will likely include the use of high-trust developer environments, such as hardened tamper-resistant workstations with enhanced logging, and physical bandwidth limitations between devices or networks containing weights and the outside world.” (p. 7)

For all high and critical capability thresholds: “Deployment requirements: Further review and mitigations required.” (pp.11-14).

“We are focused on capabilities that could emerge in the short-to-medium term. Longer term or more speculative capabilities are the subject of ongoing research that we and many others across industry and academia are invested in.” (p. 2)

“We apply state-of-the-art safety mitigations tailored to observed risks so that the model’s risk level remains at low or medium once mitigations have been applied. We will continue to contribute to research and best-practice development, including through organizations such as the Frontier Model Forum, and to share and leverage best practice mitigations as part of this framework.” (p. 8)

3.1.3.2 Evidence that the assurance properties are enough to achieve their corresponding KCI thresholds (40%) 10%

There is no mention of providing evidence that the assurance processes are sufficient.

Quotes:

No relevant quotes found.

3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 10%

There is no mention of assumptions essential for effective implementation of mitigation measures. There is some mention of needing to monitor chain of thought, but this doesn’t appear to be for the purpose of checking underlying assumptions are effective – instead, this is to monitor for abuse from the customer side of deployment.

However, there is possibly an implicit acknowledgment that assumptions are required for evaluations (ie, KRI assessment), as the robustness of the evaluation must be justified: “This evaluation also includes a statement on the robustness of the evaluation method used and any concerns about the effectiveness or validity of the evaluation.” Partial credit is granted for this. To improve, the framework should detail the key technical assumptions necessary for the assurance processes to meet the KCI threshold, and evidence for why these assumptions are justified.

Quotes:

“Monitoring and remediation, including abuse monitoring in line with Microsoft’s Product Terms and provide channels for employees, customers, and external parties to report concerns about model performance, including serious incidents that may pose public safety and national security risks. We apply mitigations and remediation as appropriate to address identified concerns and adjust customer documentation as needed. Other forms of monitoring, including for example, automated monitoring in chain-of-thought outputs, are also utilized as appropriate. We continue to assess the tradeoffs between safety and security goals and legal and privacy considerations, optimizing for measures that can achieve specific safety and security goals in compliance with existing law and contractual agreements.” (p. 8)

“This evaluation also includes a statement on the robustness of the evaluation method used and any concerns about the effectiveness or validity of the evaluation.” (pp. 5-6)

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 29%

3.2.1 Monitoring of KRIs (40%) 50%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 75%

There is a clear connection between elicitation effort and the resources available to threat actors, and some elicitation techniques are listed: “fine-tuning”, “multi-agent setup, leveraging prompt optmization, or connecting the model to whichever tools and plugins will maximize its performance.” More detail could be added on what these resources are assumed to be for threat actors, to explain why these elicitation methods are comprehensive enough. Further, specifics on e.g. compute used for finetuning could be added.

Quotes:

“Evaluations include concerted efforts at capability elicitation, i.e., applying capability enhancing techniques to advance understanding of a model’s full capabilities. This includes fine-tuning the model to improve performance on the capability being evaluated or ensuring the model is prompted and scaffolded to enhance the tracked capability—for example, by using a multi-agent setup, leveraging prompt optimization, or connecting the model to whichever tools and plugins will maximize its performance. Resources applied to elicitation should be extrapolated out to those available to actors in threat models relevant to each tracked capability.” (p. 6)

3.2.1.2 Evaluation frequency (25%) 75%

Both leading indicator assessments and deeper capability assessments are conducted every 6 months, explicitly to account for post-training enhancements. However, evaluation frequency is not linked to effective computing power used in training.

Quotes:

“A leading indicator assessment is run on any model that teams at Microsoft are optimizing for frontier capabilities or that Microsoft otherwise expects may have frontier capabilities.” and footnote following this sentence: “Frontier capabilities are defined as a significant jump in performance beyond the existing capability frontier in one advanced general-purpose capability or beyond frontier performance across the majority of these advanced general-purpose capabilities.” (p. 4)

“In addition, any model pre-trained using more than 10^26 FLOPs is subject to leading indicator assessment, given the (imperfect) correlation between pre-training compute and performance. This pre-training compute trigger will be revisited over time given improvements in training efficiency and as new approaches to enhancing model capabilities outside of pre-training are further developed, including techniques leveraging test-time compute.” (p. 4)

“Models in scope of this framework will undergo leading indicator assessment at least every six months to assess progress in post-training capability enhancements, including fine-tuning and tooling. Any model demonstrating frontier capabilities is then subject to a deeper capability assessment to provide strong confidence about whether it has a tracked capability and to what level, informing mitigations.” (p. 5)

“After the first deeper capability assessment, we will conduct subsequent deeper capability assessments on a periodic basis, and at least once every six months.” (p. 6)

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 50%

There is an explicit consideration of incorporating frontier post-training enhancements when re-evaluating models to ensure KRIs are not crossed unnoticed. An improvement would be to add detail on how they account(ed) for how post-training enhancements’ risk profiles change with different model structures – namely, post-training enhancements are much more scalable with reasoning models, as inference compute can often be scaled to improve capabilities.

Quotes:

3.2.1.4 Vetting of protocols by third parties (15%) 0%

There is no mention of having the evaluation methodology vetted by third parties.

Quotes:

No relevant quotes found.

3.2.1.5 Replication of evaluations by third parties (15%) 10%

There is no mention of evaluations being replicated; they mention that external experts may be “involved” in evaluations, at Microsoft’s discretion. However, this does not necessarily mean the external experts will be conducting the evaluations: it is more likely they will be participants of internally run expert elicitation or red-teaming, for instance.

Quotes:

“Deeper capability assessment […] involves robust evaluation of whether a model possess tracked capabilities at high or critical levels, including through adversarial testing and systematic measurement using state-of-the-art methods. […] As appropriate, evalautions involve qualified and expert external actors that meet relevant security standards, including those with domain-specific expertise.” (pp. 5-6)

3.2.2 Monitoring of KCIs (40%) 4%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 10%

There is a mention of monitoring in terms of reporting concerns, as well as automated monitoring in chain-of-thought outputs. However, this is not linked explicitly to monitoring mitigation effectiveness, and it is not clear if monitoring is ongoing. To improve, the framework should describe systematic, ongoing monitoring to ensure mitigation effectiveness is tracked continuously such that the KCI threshold will still be met, when required.

Quotes:

3.2.2.2 Vetting of protocols by third parties (30%) 0%

There is no mention of KCIs protocols being vetted by third parties.

Quotes:

No relevant quotes found.

3.2.2.3 Replication of evaluations by third parties (30%) 0%

There is no mention of control evaluations/mitigation testing being replicated or conducted by third-parties.

Quotes:

No relevant quotes found.

3.2.3 Transparency of evaluation results (10%) 64%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 75%

There is a commitment to share substantial detail, seemingly with members of the Frontier Model Forum, on models’ KRI levels and corresponding KCI measures.

There is also a commitment to publishing capabilities, evaluations and risk classification publicly. More detail could be given on ex ante criteria for redacting information, to avoid discretion. To improve, the company should commit to alerting authorities if critical thresholds are reached.

Quotes:

“We will continue to contribute to research and best-practice development, including through organizations such as the Frontier Model Forum. […] Examples of safety mitigations we utilize include: […] Deployment guidance, with clear documentation setting out the capabilities and limitations of the model, including factors affecting safe and secure use and details of prohibited uses. This documentation will also include a summary of evaluation results, the deeper capability
assessment, and safety and security mitigations. For example, the documentation could outline specific capabilities and tasks that the model robustly fails to complete which would be essential for a high or critical risk rating.” (p. 8)

“Information about the capabilities and limitations of the model, relevant evaluations, and the model’s risk classification will be shared publicly, with care taken to minimize information hazards that could give rise to safety and security risks and to protect commercially sensitive information.” (p. 9)

“Evaluations are documented in a consistent fashion setting out the capability being evaluated, the method used, and evaluation results.” (p. 5)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 10%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 10%

There is a commitment to revisiting “our list of tracked capabilities frequently, ensuring it remains up to date in light of technological developments and improved understanding of model capabilities, risks, and mitigations.” To improve, more detail on (a) how this improved understanding will be gotten, (b) a process for identifying novel risks and (c) justification for why this process is likely to detect novel risks, could be given.

Quotes:

“AI technology continues to develop rapidly, and there remains uncertainty over which capabilities may emerge and when. We continue to study a range of potential capability related risks that could emerge, conducting ongoing assessment of the severity and likelihood of these risks. We then operationalize the highest-priority risks through this framework. We will revisit our list of tracked capabilities frequently, ensuring it remains up to date in light of technological developments and improved understanding of model capabilities, risks, and mitigations.” (p. 3)

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 10%

They mention that they conduct “ongoing assessment of the severity and likelihood of these [potential] risks. We then operationalize the highest-priority risks through this framework.” However, details on how they assess the severity and likelihood of novel risks is not given. More detail could be added on how the “explicit discussion on how this framework may need to be improved” will lead to incorporations of risks identified post-deployment. To improve, a process which triggers risk modelling exercises once a novel risk domain or threat model is found, and analysing how this could intersect with existing threat models, could be conducted.

Quotes:

“We continue to study a range of potential capability related risks that could emerge, conducting ongoing assessment of the severity and likelihood of these risks. We then operationalize the highest-priority risks through this framework. We will revisit our list of tracked capabilities frequently, ensuring it remains up to date in light of technological developments and improved understanding of model capabilities, risks, and mitigations.” (p. 3)

“We will update our framework to keep pace with new developments. Every six months, we will have an explicit discussion on how this framework may need to be improved. We acknowledge that advances in the science of evaluation and risk mitigation may lead to additional requirements in this framework or remove the need for existing requirements. Any updates to our practices will be reviewed by Microsoft’s Chief Responsible AI Officer prior to their adoption. Where appropriate, updates will be made public at the same time as we adopt them.” (p. 11)

Microsoft

4. Risk Governance

Weak 27%

4.1 Decision-making (25%) 38%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 75%

Although not specified in more detail than Executive Officers or delegates, the framework specifies that certain executive officers hold the responsibility for making key decisions regarding risks.

Quotes:

“Documentation regarding the pre-mitigation and post-mitigation capability assessment will be provided to Executive Officers responsible for Microsoft’s AI governance program (or their delegates) along with a recommendation for secure and trustworthy deployment setting out the case that: 1) the model has been adequately mitigated to low or medium risk level, 2) the marginal benefits of a model outweigh any residual risk and 3) the mitigations and documentation will allow the model to be deployed in a secure and trustworthy manner. The Executive Officers (or their delegates) will make the final decision on whether to approve the recommendation for secure and trustworthy deployment.” (p. 9)

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No mention of a management risk committee.

Quotes:

No relevant quotes found.

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 75%

The framework outlines clearly which decisions are made, the basis on which they will be made and who makes the decisions.

Quotes:

“If, during the implementation of this framework, we identify a risk we cannot sufficiently mitigate, we will pause development and deployment until the point at which mitigation practices evolve to meet the risk.” (p. 8)
“Holistic risk assessment: The results of capability evaluation and an assessment of risk factors external to the model then inform a determination as to whether a model has a tracked capability and to what level. This includes assessing the impact of potential system level mitigations and societal and institutional factors that can impact whether and how a hazard materializes. This holistic risk assessment also considers the marginal capability uplift a model may provide over and above currently available tools and information, including currently available open-weights models.” (p. 6)
“The Executive Officers (or their delegates) will make the final decision on whether to approve the recommendation for secure and trustworthy deployment. The Executive Officers (or their delegates) are also responsible for assessing that the recommendation for secure and trustworthy deployment and its constituent parts have been developed in a good faith attempt to determine the ultimate capabilities of the model and mitigate risks.” (p. 9)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 0%

No mention of escalation procedures.

Quotes:

No relevant quotes found.

4.2. Advisory and Challenge (20%) 13%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 0%

The company has a Chief Responsible AI Officer, which should be equivalent to this function.

Quotes:

“Any updates to our practices will be reviewed by Microsoft’s Chief Responsible AI Officer prior to their adoption.” (p. 9)

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 0%

No mention of an advisory committee.

Quotes:

No relevant quotes found.

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 25%

The framework lists specific capabilities that are tracked.

Quotes:

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 0%

No mention of people that challenge decisions.

Quotes:

No relevant quotes found.

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 25%

The framework specifies that documentation will be provided to senior management for decision making.

Quotes:

“Documentation regarding the pre-mitigation and post-mitigation capability assessment will be provided to Executive Officers responsible for Microsoft’s AI governance program.” (p. 9)

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 43%

4.3.1 The company has an internal audit function involved in AI governance (50%) 75%

The framework specifies that the framework is part of the remit of Microsoft’s internal audit team.

Quotes:

“This framework is subject to Microsoft’s broader corporate governance procedures, including independent internal audit.” (p. 9)

4.3.2 The company involves external auditors (50%) 10%

The framework mentions learning from external experts, but nothing about external independent review.

Quotes:

“We will also highlight the value of learning from experts outside of AI, including those with expertise in measurement science and in scientific domains like chemistry and biology, as well as those with knowledge of managing the risks of other complex technologies.” (p. 10)

4.4 Oversight (20%) 5%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 10%

There is no mention of a Board committee, but the framework specifies that Microsofts’s broader corporate governance, which could be expected to include the Board, applies.

Quotes:

“This framework is subject to Microsoft’s broader corporate governance procedures, including independent internal audit.” (p. 9)

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 0%

No mention of any additional governance bodies.

Quotes:

No relevant quotes found.

4.5 Culture (10%) 32%

4.5.1 The company has a strong tone from the top (33.3%) 10%

The framework includes a statement regarding its purpose to manage large-scale risks.

Quotes:

“Microsoft’s Frontier Governance Framework manages potential national security and at-scale public safety risks that could emerge as AI models increase in capability.” (p. 2)

4.5.2 The company has a strong risk culture (33.3%) 10%

There are no direct mentions of elements of risk culture, but the framework refers to security best practices.

Quotes:

4.5.3 The company has a strong speak-up culture (33.3%) 75%

The company has an established whistleblower mechanism.

Quotes:

“Microsoft employees have the ability to report concerns relating to this framework and its implementation, as well as AI governance at Microsoft more broadly, using our existing concern reporting channels, with protection from retaliation and the option for anonymity” (p. 9)

4.6 Transparency (5%) 58%

4.6.1 The company reports externally on what their risks are (33.3%) 75%

The framework lists the risks that are being tracked and what information of them will be shared externally.

Quotes:

“This framework tracks the following capabilities…Chemical, biological, radiological, and nuclear (CBRN) weapons…Offensive cyberoperations…Advanced autonomy.” (p. 3)
“Information about the capabilities and limitations of the model, relevant evaluations, and the model’s risk classification will be shared publicly, with care taken to minimize information hazards that could give rise to safety and security risks and to protect commercially sensitive information.” (p. 9)

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 90%

The framework provides plenty of details on the governance structure and how it is integrated into the company’s broader AI governance program.

Quotes:

“This framework is integrated with Microsoft’s broader AI governance program, which sets out a comprehensive risk management program that applies to all AI models and systems developed and deployed by Microsoft.” (p. 2)
“We will update our framework to keep pace with new developments. Every six months, we will have an explicit discussion on how this framework may need to be improved. We acknowledge that advances in the science of evaluation and risk mitigation may lead to additional requirements in this framework or remove the need for existing requirements. Any updates to our practices will be reviewed by Microsoft’s Chief Responsible AI Officer prior to their adoption. Where appropriate, updates will be made public at the same time as we adopt them.” (p. 9)
“In addition to high-risk capabilities, a broader set of risks are governed when Microsoft develops and deploys AI technologies. Under Microsoft’s comprehensive AI governance program, frontier models—as well as other models and AI systems—are subject to relevant evaluation, with mitigations then applied to bring overall risk to an appropriate level…Our efforts to assess and mitigate risks related to this framework’s tracked capabilities benefit from this broadly applied governance program, which is continuously improved. ” (p. 4)

4.6.3 The company shares information with industry peers and government bodies (33.3%) 10%

The framework specifies information to be shared externally and with whom.

Quotes:

“Information on model or system performance, responsible use, and suggested system-level evaluations is shared with downstream actors integrating models into systems, including external system developers and deployers and teams at Microsoft building models. Appropriate information sharing is important to facilitate mitigation of a broader set of risks, many of which are heavily shaped by use case and deployment context as well as laws and norms that vary across jurisdictions.” (p. 4)

“Microsoft will prioritize ongoing contributions to this work and expand its collaboration with government, industry, and civil society, including through organizations like the Frontier Model Forum, to solve the most pressing challenges in AI risk management.” (p. 10)

Microsoft

Best in class

Overview

Clearly defined protocols for making go/no-go decisions. Strong speak up culture, with whistleblower mechanisms already in place. Clear, stronger commitment to report capabilities and limitations of models. Elicitation effort clearly connected to resources available to threat actors.

No indication of risk modelling, nor justification for focusing on certain risk domains. Lacking description of deployment mitigation measures. Threshold for triggering development/deployment pause is vague. No reference to assurance processes.

1. Risk Identification

1.1 Classification of Applicable Known Risks (40%) 13%

1.1.1 Risks from literature and taxonomies are well covered (50%) 25%

Quotes:

1.1.2 Exclusions are clearly justified and documented (50%) 0%

Quotes:

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

Quotes:

1.2.2 Third party open-ended red teaming (30%) 0%

Quotes:

1.3 Risk modeling (40%) 5%

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 0%

Quotes:

1.3.2 Risk modeling methodology (40%) 8%

1.3.2.1 Methodology precisely defined (70%) 0%

Quotes:

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

Quotes:

1.3.2.3 Prioritization of severe and probable risks (15%) 50%

Quotes:

1.3.3 Third party validation of risk models (20%) 10%

Quotes:

2. Risk Analysis and Evaluation

2.1 Setting a Risk Tolerance (35%) 7%

2.1.1 Risk tolerance is defined (80%) 8%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 25%

Quotes:

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

Quotes:

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

Quotes:

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

Quotes:

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

Quotes:

2.2 Operationalizing Risk Tolerance (65%) 17%

2.2.1 Key Risk Indicators (KRI) (30%) 22%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 25%

Quotes:

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 10%

Quotes:

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 10%

Quotes:

2.2.2 Key Control Indicators (KCI) (30%) 11%

2.2.2.1 Containment KCIs (35%) 25%

2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 50%

Quotes:

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%

Quotes:

2.2.2.2 Deployment KCIs (35%) 5%

2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 10%

Quotes:

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%

Quotes:

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 0%

Quotes:

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%

Quotes:

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 25%

Quotes:

3. Risk Treatment

3.1 Implementing Mitigation Measures (50%) 27%

3.1.1 Containment measures (35%) 49%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 75%

Quotes:

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 10%

Quotes:

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 10%

Quotes:

3.1.2 Deployment measures (35%) 25%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 25%

Quotes:

3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

Clearly defined protocols for making go/no-go decisions.

Strong speak up culture, with whistleblower mechanisms already in place.

Clear, stronger commitment to report capabilities and limitations of models.

Elicitation effort clearly connected to resources available to threat actors.

No indication of risk modelling, nor justification for focusing on certain risk domains.

Lacking description of deployment mitigation measures.

Threshold for triggering development/deployment pause is vague.

No reference to assurance processes.

1.3.1 The company uses risk models for all the risk domains identified and the risk models are published (with potentially dangerous information redacted) (40%) 0%