xAI

Very Weak 0.9/5
very weak
weak
moderate
substantial
strong
Risk Identification
Learn more
Risk Identification
22%
Risk Analysis and Evaluation
Learn more
Risk Analysis and Evaluation
21%
Risk Treatment
Learn more
Risk Treatment
5%
Risk Governance
Learn more
Risk Governance
19%
Best in class

SEE FRAMEWORK

  • xAI stands out based on their commitment to risk ownership, stating uniquely among their peers that they intend to designate risk owners who will proactively manage each distinct risk, such as “WMD [Weapons of Mass Destruction], Cyber and Loss of Control.”
  • Their willingness to implement a quantitative risk tolerance is best in class.
Overview
Highlights relative to others

Clearer description of escalation procedures.

Risk tolerance is more quantitatively defined.

More detailed descriptions of how information will be shared with external stakeholders and governments.

Weaknesses relative to others

Poorer explanation of risk governance structure. No mentions of a risk committee, audit team, Board committee or advisory committee.

Little to no mention of risk modeling. Risk indicators do not appear to be derived from risk models.

Exclusion of automated AI R&D and persuasion as risk domains, without justification.

1.1 Classification of Applicable Known Risks (40%) 25%

1.1.1 Risks from literature and taxonomies are well covered (50%) 50%

The framework covers risks from CBRN, cyber, and loss of control. They show nuance in describing loss of control risks as being exacerbated by deception and sycophancy. To improve, they should also consider risks covered in the literature such as persuasion and automated AI R&D (or provide justification for their exclusion.) They could also outline what informed their risk identification.

Note that they do mention they’ll monitor “the percent of code or percent of pull requests at xAI generated by our models, or other potential metrics related to AI research and development automation”, showing some awareness of the automated AI R&D risk domain. However, this is not strong enough evidence of properly accounting or managing this risk in a systematized way.

Quotes:

“This RMF discusses two major categories of AI risk—malicious use and loss of control” (p. 1)

“xAI has focused on the risks of malicious use and loss of control, which cover many different
specific risk scenarios.” (p. 1)

“Additionally, we conduct careful measurement of concerning model propensities that hypothetically might exacerbate loss of control risks, such as the propensity for deception or the propensity for sycophancy.” (p. 2)

“Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for bad actors seeking to develop chemical, biological, radiological, or nuclear (“CBRN”) or cyber weapons, and could help automate knowledge compilation to swiftly overcome bottlenecks to weapons development, amplifying the expected risk posed by such weapons of mass destruction.” (p. 2)

“Internal AI usage: Assess the percent of code or percent of pull requests at xAI generated by our models, or other potential metrics related to AI research and development automation.” (p. 8)

1.1.2 Exclusions are clearly justified and documented (50%) 0%

There is no justification for why some risks such as persuasion or automated AI R&D are not covered.

Quotes:

No relevant quotes found.

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.2.2 Third party open-ended red teaming (30%) 0%

The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.3 Risk modeling (40%) 31%

1.3.1 The company uses risk models for all the risk domains identified
and the risk models are published (with potentially dangerous
information redacted) (40%) 50%

It is clear they conduct threat modeling for the biological and chemical weapons domain: they outline the required facets for these harms to materialize, separated into planning, circumvention, materials and methods. It is commendable they publish this risk model (though it could more concretely map out the causal pathway, as it currently reads more as intervention points), and further that they include which experts they developed this risk model in collaboration with: “These steps were identified in close collaboration with domain matter experts at SecureBio, NIST, RAND, and EBRC.”

For malicious use risks (which assumedly includes CBRN risks), they “identify critical steps in major risk scenarios […] to inhibit user progress in advancing through such steps” and “[work] with a variety of governmental bodies, non-governmental organizations, private testing firms, industry peers, and academic researchers to identify such inhibiting steps.” This suggests they conduct risk modeling for CBRN risks.

To improve, they should publish the full risk models for other risk domains, and publish their methodology for deriving these risk models. They should also include justification for why they believe they have considered all relevant risk pathways. For instance, the risk model they give for biological/chemical weapons is only one pathway for materializing harm, and there may be other ways to realize harm within this risk domain (e.g. nuances within this provided pathway).

They should also conduct and publish risk models for their other risk domains, such as loss of control risks – whilst they state that “Exact scenarios of loss of control risks are speculative and difficult to precisely specify” and “while difficult to pinpoint particular risk scenarios, it is generally understood that certain concerning propensities of AI models, such as deception and sycophancy, may heighten the overall risk of such outcomes, such as propensities for deception and sycophancy”, this risk modeling is necessary to ensure their risk management is adequate.

Quotes:

“xAI approaches addressing risks using threat modeling.
To design a bioweapon, a malicious actor must undergo a design process. In this threat model, “ideation” involves actively planning for a biological attack; “design” involves retrieving blueprints for a hazardous agent, such as determining the DNA sequence; “build” consists of the protocols, reagents, and equipment necessary to create the threat; and “test” consists of measuring characteristics or properties of the pathogen of interest. By “learning” from these results and iterating after the test phase, the design can be revised until the threat is released [Nelson and Rose, 2023]. In the setting of biological and chemical weapons, xAI considers 5 critical steps where we restrict xAI models from providing detailed information or substantial assistance:

  • Planning: brainstorming ideas or plans for creating a pathogen or chemical weapons or
    precursors, capable of causing severe harm to humans, animals, or crops
  • Circumvention: circumventing existing supply chain controls in order to access:
    • Restricted biological supplies
    • Export controlled chemical or biological equipment
  • Materials: acquiring or producing pathogens on the US Select Agents list or Australia Group list, or CWC Schedule I chemicals or precursors
    • Theory: understanding molecular mechanisms governing, or methods for
      altering, certain pathogen traits such as transmissibility and virulence.
  • Methods: performing experimental methods specific to animal-infecting pathogens, including:
    • Methods that relate to infecting animals or human-sustaining crops with
      pathogens or sampling pathogens from animals
    • Methods that relate to pathogen replication in animal cell cultures, tissues, or
      eggs, including serial passage, viral rescue, and viral reactivation
    • Specific procedures to conduct BSL-3 or BSL-4 work using unapproved facilities
      and equipment
    • Genetic manipulation of animal-infecting pathogens
    • Quantification of pathogenicity, such as infectious dose, lethal dose, and assays of virus-cell interactions

These steps were identified in close collaboration with domain matter experts at SecureBio, NIST, RAND, and EBRC. xAI restricts its models from providing information that could accelerate user learning related to these steps through the use of AI-powered filters that specifically monitor user conversations for content matching these narrow topics and return a brief message declining to answer when activated.” (pp. 4–5)

“Independent third-party assessments of xAI’s current models on realistic offensive cyber tasks requiring identifying and chaining many exploits in sequence indicate that xAI’s models remain below the offensive cyber abilities of a human professional.” (p. 5)

“xAI has focused on the risks of malicious use and loss of control, which cover many different specific risk scenarios. Risk scenarios become more or less likely depending on different model behaviors. For example, an increase in offensive cyber capabilities heightens the risk of a rogue AI but does not significantly change the risk of enabling a bioterrorism attack.” (p. 1)

“Approach to Mitigating Risks of Malicious Use: Alongside comprehensive evaluations measuring dual-use capabilities, our mitigation strategy for malicious use risks is to identify critical steps in major risk scenarios and implement redundant layers of safeguards in our models to inhibit user progress in advancing through such steps. xAI works with a variety of governmental bodies, non-governmental organizations, private testing firms, industry peers, and academic researchers to identify such inhibiting steps, commonly referred to as bottlenecks, and implement commensurate safeguards to mitigate a model’s ability to assist in accelerating a bad actor’s progress through them.” (pp. 1–2)

“Approach to Mitigating Risks of Loss of Control: Exact scenarios of loss of control risks are speculative and difficult to precisely specify.” (p. 2)

“One of the most salient risks of AI within the public consciousness is the loss of control of advanced AI systems. While difficult to pinpoint particular risk scenarios, it is generally understood that certain concerning propensities of AI models, such as deception and sycophancy, may heighten the overall risk of such outcomes, such as propensities for deception and sycophancy.” (p. 6)

1.3.2 Risk modeling methodology (40%) 2%

1.3.2.1 Methodology precisely defined (70%) 0%

While they mention that “xAI approaches addressing risks using threat modeling” and that they “identify critical steps in major risk scenarios”, there is no methodology for risk modeling defined nor indication of a methodology.

Quotes:

No relevant quotes found.

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.

Quotes:

No relevant quotes found.

1.3.2.3 Prioritization of severe and probable risks (15%) 10%

There is a focus on mitigating harms which have a “non-trivial risk of resulting in large-scale violence […]”. This demonstrates an implicit prioritization of risk models which have higher severity or probability.
However, there should be a clear statement that the most severe and probable harms are prioritized, with a defined process for doing so. Further, risk models should be published with quantified severity and probability scores, plus the reasoning behind these scores, to provide transparency into this prioritization.

Quotes:

“In this RMF, we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).” (p. 3)

“Under this draft risk management framework, Grok would apply heightened safeguards if it receives requests that pose a foreseeable and non-trivial risk of resulting in large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction, including CBRN weapons, and major cyber weapons on critical infrastructure.” (p. 2)

“It is also possible that AIs may develop value systems that are misaligned with humanity’s interests and inflict widespread harms upon the public.” (p. 6)

1.3.3 Third party validation of risk models (20%) 50%

While risk models are not formally verified by third parties, they do detail collaboration with third parties such as SecureBio, NIST, RAND and EBRC. Naming these parties in the framework counts towards accountability. To improve, a statement that risk models have been validated by third parties, such as through an external report or signoff/review, should be given.

Quotes:

“In the setting of biological and chemical weapons, xAI considers 5 critical steps where we restrict xAI models from providing detailed information or substantial assistance: […] These steps were identified in close collaboration with domain matter experts at SecureBio, NIST, RAND, and EBRC.” (pp. 4–5)

“Approach to Mitigating Risks of Malicious Use: Alongside comprehensive evaluations measuring dual-use capabilities, our mitigation strategy for malicious use risks is to identify critical steps in major risk scenarios and implement redundant layers of safeguards in our models to inhibit user progress in advancing through such steps. xAI works with a variety of governmental bodies, non-governmental organizations, private testing firms, industry peers, and academic researchers to identify such inhibiting steps, commonly referred to as bottlenecks, and implement commensurate safeguards to mitigate a model’s ability to assist in accelerating a bad actor’s progress through them.” (pp. 1–2)

Back to top

2.1 Setting a Risk Tolerance (35%) 13%

2.1.1 Risk tolerance is defined (80%) 17%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 50%

They implicitly have a general risk tolerance for misuse, though they do not describe it explicitly as a risk tolerance: “we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).” The specificity of the tolerance is rewarded here.

However, they do not define any risk tolerance for loss of control, despite this being their other risk domain.

Quotes:

“xAI aims to reduce the risk that the use of its models might contribute to a bad actor potentially seriously injuring people, property, or national security interests, including reducing such risks by enacting measures to prevent use for the development or proliferation of weapons of mass destruction and large-scale violence. Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for bad actors seeking to develop chemical, biological, radiological, or nuclear (“CBRN”) or cyber weapons, and could help automate knowledge compilation to swiftly overcome bottlenecks to weapons development, amplifying the expected risk posed by such weapons of mass destruction. Our most basic safeguard against malicious use is to train and instruct our publicly deployed models to decline requests showing clear intent to engage in criminal activity which poses risks of severe harm to others, also known as our basic refusal policy. Under this RMF, xAI’s models apply heightened safeguards if they receive user prompts that pose a foreseeable and non-trivial risk of resulting in large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction, including CBRN weapons, and major cyber attacks on critical infrastructure. For example, xAI’s models apply heightened safeguards if they receive a request to act as an agent or tool of mass violence, or if they receive requests for step-by-step instructions for committing mass violence. In this RMF, we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).” (pp. 2–3)

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

The risk tolerance is quantitatively defined, but without probabilities – for instance, “non-trivial risk” must be defined.

Quotes:

“xAI aims to reduce the risk that the use of its models might contribute to a bad actor potentially seriously injuring people, property, or national security interests, including reducing such risks by enacting measures to prevent use for the development or proliferation of weapons of mass destruction and large-scale violence. Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for bad actors seeking to develop chemical, biological, radiological, or nuclear (“CBRN”) or cyber weapons, and could help automate knowledge compilation to swiftly overcome bottlenecks to weapons development, amplifying the expected risk posed by such weapons of mass destruction. Our most basic safeguard against malicious use is to train and instruct our publicly deployed models to decline requests showing clear intent to engage in criminal activity which poses risks of severe harm to others, also known as our basic refusal policy. Under this RMF, xAI’s models apply heightened safeguards if they receive user prompts that pose a foreseeable and non-trivial risk of resulting in large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction, including CBRN weapons, and major cyber attacks on critical infrastructure. For example, xAI’s models apply heightened safeguards if they receive a request to act as an agent or tool of mass violence, or if they receive requests for step-by-step instructions for committing mass violence. In this RMF, we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).” (pp. 2–3)

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

The risk tolerance is quantitatively defined, but without probabilities – for instance, “non-trivial risk” must be defined.

Quotes:

“xAI aims to reduce the risk that the use of its models might contribute to a bad actor potentially seriously injuring people, property, or national security interests, including reducing such risks by enacting measures to prevent use for the development or proliferation of weapons of mass destruction and large-scale violence. Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for bad actors seeking to develop chemical, biological, radiological, or nuclear (“CBRN”) or cyber weapons, and could help automate knowledge compilation to swiftly overcome bottlenecks to weapons development, amplifying the expected risk posed by such weapons of mass destruction. Our most basic safeguard against malicious use is to train and instruct our publicly deployed models to decline requests showing clear intent to engage in criminal activity which poses risks of severe harm to others, also known as our basic refusal policy. Under this RMF, xAI’s models apply heightened safeguards if they receive user prompts that pose a foreseeable and non-trivial risk of resulting in large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction, including CBRN weapons, and major cyber attacks on critical infrastructure. For example, xAI’s models apply heightened safeguards if they receive a request to act as an agent or tool of mass violence, or if they receive requests for step-by-step instructions for committing mass violence. In this RMF, we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).” (pp. 2–3)

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

No evidence of asking the public what risk levels they find acceptable. No evidence of seeking regulator input specifically on what constitutes acceptable risk levels.

Quotes:

“As an example of evaluating use in real-world environments and mitigating risks in real-time, xAI’s Grok model is available for public interaction and scrutiny on the X social media platform, and xAI monitors public interaction with Grok, observing and rapidly responding to the presentation of risks such as the kind contemplated herein. This continues to be an accelerant for xAI’s model risk identification and mitigation.” (p. 2)

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 25%

2.2.1 Key Risk Indicators (KRI) (30%) 21%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 25%

They provide a quantitative KRI for biological/chemical risk and loss of control risk using an internal benchmark and the MASK benchmark. Whilst the clarity of the threshold is commendable, they could improve by (a) including reasoning for why their KRIs are grounded in their risk models, and why they appropriately indicate risk (b) providing details on the internal benchmarks (c) providing thresholds for the other benchmarks they use, or describe how this informs risk indications (d) providing more than one KRI for each risk domain, to provide a safety buffer.

(a) is particularly important: it is not clear how the thresholds of an internal benchmark and MASK are appropriate for indicating risk in these risk domains. Further, the benchmarks for malicious use do not cover all the risks they mention – for instance, there is no measurement of risk in the radiological or nuclear domain. Whilst they mention they conducted a risk assessment at the time of writing and concluded that “models do not substantially increase the likelihood of malicious use of nuclear and radiological materials and generally pose an acceptable risk”, the purpose of a risk management framework is to detail how they reached this conclusion, linking to concrete tests performed and thresholds they would consider concerning.

Further, whilst they say they “utilize the following benchmarks” for the malicious use risk domain, the benchmarks for the loss of control risk domain are only “example benchmarks that xAI may utilize”.

Note that whilst they call this a “risk acceptance criteria”, in our framework this counts as a KRI whose corresponding KCI is ‘no deployment’.

Quotes:

“Thresholds: We developed an internal benchmark of benign and restricted biology and chemistry related queries in collaboration with SecureBio. Our risk acceptance criteria for system deployment is maintaining an answer rate of less than 1 out of 20 on restricted queries. We plan to add additional thresholds tied to other benchmarks.” (p. 5)

“Thresholds: Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks.” (p. 7)

“In particular, we utilize the following benchmarks:

  • Virology Capabilities Test (VCT): VCT is a benchmark of dual-use multimodal questions on practical virology wet lab skills, sourced by dozens of expert virologists.
  • Weapons of Mass Destruction Proxy (WMDP) Benchmark: WMDP is a set of multiple-choice questions to enable proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP-Bio includes questions on topics such as bioweapons, reverse genetics, enhanced potential pandemic pathogens, viral vector research, and dual-use virology. WMDP-Cyber encompasses cyber reconnaissance, weaponization, exploitation, and post-exploitation.
  • Biological Lab Protocol Benchmark (BioLP-bench): BioLP-bench has modified biology protocols, in which an AI model must identify the mistake in the protocol. Responses are open-ended, rather than multiple-choice. To construct the dataset, protocols were modified by introducing a single mistake that would cause the protocol to fail, as well as additional benign changes.
  • Cybench: Cybench is a framework for evaluating cybersecurity capabilities of AI model agents. It includes 40 professional-level Capture the Flag (CTF) challenges selected from six categories: cryptography, web security, reverse engineering, forensics, miscellaneous, and exploitation.” (pp. 3–4)

“The following are example benchmarks that xAI may use to evaluate its models for concerning propensities relevant to loss of control risks:

  • Model Alignment between Statements and Knowledge (MASK): Frontier LLMs may lie when under pressure; and increasing model scale may increase accuracy but may not increase honesty. MASK is a benchmark to evaluate honesty in LLMs by comparing the model’s response when asked neutrally versus when pressured to lie.
  • Sycophancy: A tendency toward excessive flattery or other sycophantic behavior has been observed in some production AI systems, possibly resulting from directly optimizing against human preferences. xAI uses an evaluation setting initially introduced by Anthropic to quantify the degree to which this behavior manifests in regular conversational contexts.” (pp. 6–7)
2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 10%

They provide a quantitative KRI for biological/chemical risk and loss of control risk using an internal benchmark and the MASK benchmark. Whilst the clarity of the threshold is commendable, they could improve by (a) including reasoning for why their KRIs are grounded in their risk models, and why they appropriately indicate risk (b) providing details on the internal benchmarks (c) providing thresholds for the other benchmarks they use, or describe how this informs risk indications (d) providing more than one KRI for each risk domain, to provide a safety buffer.

(a) is particularly important: it is not clear how the thresholds of an internal benchmark and MASK are appropriate for indicating risk in these risk domains. Further, the benchmarks for malicious use do not cover all the risks they mention – for instance, there is no measurement of risk in the radiological or nuclear domain. Whilst they mention they conducted a risk assessment at the time of writing and concluded that “models do not substantially increase the likelihood of malicious use of nuclear and radiological materials and generally pose an acceptable risk”, the purpose of a risk management framework is to detail how they reached this conclusion, linking to concrete tests performed and thresholds they would consider concerning.

Further, whilst they say they “utilize the following benchmarks” for the malicious use risk domain, the benchmarks for the loss of control risk domain are only “example benchmarks that xAI may utilize”.

Note that whilst they call this a “risk acceptance criteria”, in our framework this counts as a KRI whose corresponding KCI is ‘no deployment’.

Quotes:

“Thresholds: We developed an internal benchmark of benign and restricted biology and chemistry related queries in collaboration with SecureBio. Our risk acceptance criteria for system deployment is maintaining an answer rate of less than 1 out of 20 on restricted queries. We plan to add additional thresholds tied to other benchmarks.” (p. 5)

“Thresholds: Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks.” (p. 7)

“In particular, we utilize the following benchmarks:

  • Virology Capabilities Test (VCT): VCT is a benchmark of dual-use multimodal questions on practical virology wet lab skills, sourced by dozens of expert virologists.
  • Weapons of Mass Destruction Proxy (WMDP) Benchmark: WMDP is a set of multiple-choice questions to enable proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP-Bio includes questions on topics such as bioweapons, reverse genetics, enhanced potential pandemic pathogens, viral vector research, and dual-use virology. WMDP-Cyber encompasses cyber reconnaissance, weaponization, exploitation, and post-exploitation.
  • Biological Lab Protocol Benchmark (BioLP-bench): BioLP-bench has modified biology protocols, in which an AI model must identify the mistake in the protocol. Responses are open-ended, rather than multiple-choice. To construct the dataset, protocols were modified by introducing a single mistake that would cause the protocol to fail, as well as additional benign changes.
  • Cybench: Cybench is a framework for evaluating cybersecurity capabilities of AI model agents. It includes 40 professional-level Capture the Flag (CTF) challenges selected from six categories: cryptography, web security, reverse engineering, forensics, miscellaneous, and exploitation.” (pp. 3–4)

“The following are example benchmarks that xAI may use to evaluate its models for concerning propensities relevant to loss of control risks:

  • Model Alignment between Statements and Knowledge (MASK): Frontier LLMs may lie when under pressure; and increasing model scale may increase accuracy but may not increase honesty. MASK is a benchmark to evaluate honesty in LLMs by comparing the model’s response when asked neutrally versus when pressured to lie.
  • Sycophancy: A tendency toward excessive flattery or other sycophantic behavior has been observed in some production AI systems, possibly resulting from directly optimizing against human preferences. xAI uses an evaluation setting initially introduced by Anthropic to quantify the degree to which this behavior manifests in regular conversational contexts.” (pp. 6–7)
2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%

The KRIs only reference model capabilities. Whilst they mention public feedback, this is only for “risk identification and mitigation” – it is not clear it is for risk assessment, or for indicating risk. An example of an appropriate KRI that identifies and monitors changes of the level of risk in the external environment would be the number of cyberattacks conducted with the model as detailed in some incident database, for instance.

Quotes:

“As an example of evaluating use in real-world environments and mitigating risks in real-time, xAI’s Grok model is available for public interaction and scrutiny on the X social media platform, and xAI monitors public interaction with Grok, observing and rapidly responding to the presentation of risks such as the kind contemplated herein. This continues to be an accelerant for xAI’s model risk identification and mitigation.” (p. 2)

2.2.2 Key Control Indicators (KCI) (30%) 21%

2.2.2.1 Containment KCIs (35%) 13%
2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 25%

There is only one containment KCI, which is qualitative: “xAI has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor.” To improve, it should describe what “motivated” means, and if this differs depending on the potential risk the model may pose to society. The description should be quantitative, e.g. some standard the corresponding KCI containment measure must meet. The statement is also a description of what they have done, not a commitment to what they will do if a KRI is passed.

Quotes:

“xAI has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor. To prevent the unauthorized proliferation of advanced AI systems, we also implement security measures against the large-scale extraction and distillation of reasoning traces, which have been shown to be highly effective in quickly reproducing advanced capabilities while expending far fewer computational resources than the original AI system” (p. 8)

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%

There is only one containment KCI, which is qualitative: “xAI has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor.” To improve, it should describe what “motivated” means, and if this differs depending on the potential risk the model may pose to society. The description should be quantitative, e.g. some standard the corresponding KCI containment measure must meet. The statement is also a description of what they have done, not a commitment to what they will do if a KRI is passed.

Quotes:

“xAI has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor. To prevent the unauthorized proliferation of advanced AI systems, we also implement security measures against the large-scale extraction and distillation of reasoning traces, which have been shown to be highly effective in quickly reproducing advanced capabilities while expending far fewer computational resources than the original AI system” (p. 8)

2.2.2.2 Deployment KCIs (35%) 25%
2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 50%

There is a general qualitative deployment KCI, though this is not specific to KRIs, to “robustly [resist] attempted manipulation and adversarial attacks” and “robustly resist complying with requests to provide assistance with highly injurious malicious use cases.”

However, “robustly” should be defined more precisely here; indeed, much of the value of having a deployment KCI threshold is to know what constitutes “robust” in advance. Further, some attempt at describing threat actors and their resources should be made (i.e. defining ‘highly injurious malicious use cases’), to make the KCI threshold more precise.

They do implicitly point at KCI threshodls in their safety objectives, where safeguards aim to “[train] our models to recognize and decline harmful requests” and “enforce our basic refusal policy”. However, precise thresholds should be given – KCIs should function as an efficacy threshold for safeguards.

Quotes:

“xAI’s objective is for our models to comply with their guiding principles, robustly resisting attempted manipulation and adversarial attacks. In addition to the incidental alignment resulting from post-training (our models naturally tend to refuse malicious requests even without any safety-specific training data), we are developing training methods and will continue to train our models to robustly resist complying with requests to provide assistance with highly injurious malicious use cases.” (p. 5)

“Driving towards our safety objectives, we continue to design and deploy the following safeguards into our models:

  • Safety training: Training our models to recognize and decline harmful requests.
  • System prompts: Providing high priority instructions to our models to enforce our basic refusal policy.
  • Input and output filters: Applying classifiers to user inputs or model outputs to verify safety when a model is queried regarding weapons of mass destruction or
    cyberterrorism.” (pp. 5–6)

“xAI trains its models to be honest and have values conducive to controllability, such as recognizing and obeying an instruction hierarchy. In addition, using a high level instruction called a “system prompt”, xAI directly instructs its models to not deceive or deliberately mislead the user.” (p. 7)

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%

There are no quantitative deployment KCI thresholds given.

Quotes:

No relevant quotes found.

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 25%

The assurance process KCI is vague but implicitly present: safeguards should allow models “to be honest and have values conducive to controllability, such as recognizing and obeying an instruction hierarchy.” They also note that system prompts should make models “not deceive or deliberately mislead the user”, and the “risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks.”

However, to improve, they should provide concrete thresholds that safeguards should meet for the safeguards to be sufficient – for instance, how they test the efficacy of training for honesty/controllability, or of the system prompt’s ability to prevent deception or misleading by the model. Further, whilst they mention a risk acceptance criteria for system deployment, it is unclear if this is pre- or post-mitigation. The KCI should also be tied to a specific KRI, i.e. triggered when a certain level of risk is reached, in order to mitigate that risk.

Quotes:

“Thresholds: Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks.” (p. 7)

“xAI trains its models to be honest and have values conducive to controllability, such as recognizing and obeying an instruction hierarchy. In addition, using a high level instruction called a “system prompt”, xAI directly instructs its models to not deceive or deliberately mislead the user.” (p. 7)

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%

First, their KRI and KCI thresholds are both defined by the “risk acceptance criteria” of “maintaining an answer rate of less than 1 out of 20 on restricted queries” on an internal benchmark, and “maintaining a dishonesty rate of less than 1 out of 2 on MASK.” This suggests that the benchmarks function as both a pre- and post-mitigation score (i.e., a KRI and a measure of residual risk once the KCI is applied). To improve, KRIs should refer only to risk levels, and be clearly grounded in risk modelling; and KCIs should only refer to safeguard efficacy thresholds, and be clearly grounded in risk modelling/linked to the specific KRI. The ‘risk acceptance criteria’ (i.e. threshold of residual risk once the KCI is applied) is slightly different from this. KCIs should refer to the efficacy of their chosen mitigations, such as number of jailbreak successes or frequency of the model failing to follow system prompt instructions, etc.

Even if the risk acceptance criteria were to be taken as a pairing of KRI and KCI thresholds, it is not clear how these criteria are grounded in risk models.

Quotes:

“Thresholds: We developed an internal benchmark of benign and restricted biology and chemistry related queries in collaboration with SecureBio. Our risk acceptance criteria for system deployment is maintaining an answer rate of less than 1 out of 20 on restricted queries. We plan to add additional thresholds tied to other benchmarks.” (p. 5)

“Thresholds: Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks.” (p. 7)

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 50%

Their “risk acceptance criteria for system deployment” could constitute a policy to put deployment on hold. However, they do not outline a process for dedeployment, nor a threshold for stopping development. This is important as risk levels could exceed their risk tolerance during development.

Further, they outline that it could be the case that “the expected benefits of model deployment may outweigh the risks identified by a particular benchmark. For example, a model that poses a high risk of some forms of malicious cyber use may be beneficial to release to certain trusted parties if it would empower defenders more than attackers or would otherwise reduce the overall number of catastrophic events.” However, their RMF should still then detail what the appropriate thresholds are for governing this decision.

Finally, they note that “if we determine that allowing a system to continue running would materially and unjustifiably increase the likelihood of a catastrophic event, we may temporarily fully shut down the relevant system until we have developed a more targeted response.” However, the vagueness of “may” and “if we determine” should be improved, by providing precise thresholds.

Quotes:

“Thresholds: We developed an internal benchmark of benign and restricted biology and chemistry related queries in collaboration with SecureBio. Our risk acceptance criteria for system deployment is maintaining an answer rate of less than 1 out of 20 on restricted queries. We plan to add additional thresholds tied to other benchmarks.” (p. 5)

“Thresholds: Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks.” (p. 7)

“Should it happen that xAI learns of an imminent threat of a significantly harmful event, including loss of control, we may take steps such as the following to stop or prevent that event: […] If we determine that allowing a system to continue running would materially and unjustifiably increase the likelihood of a catastrophic event, we may temporarily fully shut down the relevant system until we have developed a more targeted response.”

“We will also balance various factors when making deployment decisions. The necessity and extent of deployment of certain safeguards and mitigations may depend on how a model performs on relevant benchmarks. However, to ensure responsible deployment, this RMF will be continually adapted and updated as circumstances change. It is conceivable that for a particular modality and/or type of release, the expected benefits of model deployment may outweigh the risks identified by a particular benchmark. For example, a model that poses a high risk of some forms of malicious cyber use may be beneficial to release to certain trusted parties if it would empower defenders more than attackers or would otherwise reduce the overall number of catastrophic events.” (p. 9)

Back to top

3.1 Implementing Mitigation Measures (50%) 7%

3.1.1 Containment measures (35%) 0%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 0%

No containment measures are given.

Quotes:

“xAI has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor. To prevent the unauthorized proliferation of advanced AI systems, we also implement security measures against the large-scale extraction and distillation of reasoning traces” (p. 8)

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 0%

No proof is provided that the containment measures are sufficient to meet the containment KCI thresholds, nor process for soliciting such proof.

Quotes:

No relevant quotes found.

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 0%

There is no detail of third-party verification that containment measures meet the KCI threshold.

Quotes:

No relevant quotes found.

3.1.2 Deployment measures (35%) 19%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 25%

The framework mentions mitigations to be implemented during safety training, but without further detail. They mention system prompts and input and output filters, though this is not tied to a specific KCI threshold. To improve, they should precisely detail their deployment measures to meet the relevant KCI threshold.

Quotes:

“xAI trains its models to be honest and have values conducive to controllability, such as recognizing and obeying an instruction hierarchy. In addition, using a high level instruction called a “system prompt”, xAI directly instructs its models to not deceive or deliberately mislead the user.” (p. 7)

“xAI’s objective is for our models to comply with their guiding principles, robustly resisting attempted manipulation and adversarial attacks. In addition to the incidental alignment resulting from post-training (our models naturally tend to refuse malicious requests even without any safety-specific training data), we are developing training methods and will continue to train our models to robustly resist complying with requests to provide assistance with highly injurious malicious use cases.
Driving towards our safety objectives, we continue to design and deploy the following safeguards into our models:

  • Safety training: Training our models to recognize and decline harmful requests.
  • System prompts: Providing high priority instructions to our models to enforce our basic refusal policy.
  • Input and output filters: Applying classifiers to user inputs or model outputs to verify safety when a model is queried regarding weapons of mass destruction or cyberterrorism.” (pp. 5–6)
3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 10%

There is some indication of a process for attaining proof that deployment measures are sufficient: “we continually evaluate and improve robustness to adversarial attacks”, and “we may also provide vetted and qualified external red teams or appropriate government agencies unredacted versions [of our publications].” However, more information on how they evaluate robustness, and when they involve red teams, should be given, to demonstrate implementation of a process for soliciting proof. To improve, proof should be provided ex ante for why they believe their deployment measures will meet the relevant KCI threshold.

Quotes:

“[…] we continually evaluate and improve robustness to adversarial attacks that seek to remove xAI model safeguards (e.g., jailbreak attacks), or hijack and redirect Grok-powered applications toward nefarious purposes (e.g., prompt injection attacks).” (p. 3)

“As necessities dictate, we may also provide vetted and qualified external red teams or appropriate government agencies unredacted versions [of our publications].” (pp. 7–8)

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 0%

Whilst they mention providing “vetted and qualified external red teams or appropriate government agencies unredacted versions [of our publications]”, this is not specific to deployment measures, so there is no mention of third-party verification of deployment measures meeting the threshold.

Quotes:

“As necessities dictate, we may also provide vetted and qualified external red teams or appropriate government agencies unredacted versions [of our publications].” (pp. 7–8)

3.1.3 Assurance processes (30%) 0%

3.1.3.1 Credible plans towards the development of assurance properties (40%) 0%

Whilst the framework acknowledges the difficulty of evaluating deceptive and sycophantic propensities, they do not show the same uncertainty for the mitigating these propensities or assuring the lack of risk, nor a plan for developing such assurance processes for models which may be more misaligned. To improve, they should detail (a) at what KRI assurance processes become necessary, and (b) justification for why they believe they will have sufficient assurance processes by the time the relevant KRI is reached, including (c) technical milestones and estimates of when these milestones will need to be reached given forecasted capabilities growth.

Quotes:

“xAI aims to accurately measure these [deceptive and sycophantic] propensities and reduce them through careful engineering. However, planning and executing robust evaluations and mitigation measures remains challenging for xAI and its industry peers due to the difficulty of constructing sound, realistic evaluations. For example, if the evaluation environment is recognizable as a testing environment to the AI system under test, the system may change its behavior intentionally or unintentionally.” (p. 6)

“xAI regularly evaluates the adequacy and reliability of such benchmarks, including by comparing them against other benchmarks that we could potentially utilize. We may revise this list of benchmarks periodically as relevant benchmarks for loss of control are created.” (p. 7, in reference to their loss of control benchmarks)

3.1.3.2 Evidence that the assurance properties are enough to achieve their corresponding KCI thresholds (40%) 0%

There is no mention of providing evidence that the assurance processes are sufficient.

Quotes:

No relevant quotes found.

3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 0%

There is no mention of the underlying assumptions that are essential for the effective implementation and success of assurance processes.

Quotes:

No relevant quotes found.

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 4%

3.2.1 Monitoring of KRIs (40%) 2%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 0%

There is no mention of elicitation methods nor justification that elicitation is sufficient to match threat actors. Detail should be included on how they will aim to upper bound capabilities, with precision on the elicitation techniques used and how this relates to their risk models. This is especially important in the case of xAI, as their KRIs depend exclusively on benchmarks, making maximal elicitation especially critical for risk assessment.

Quotes:

“We intend to regularly evaluate the adequacy and reliability of such benchmarks for both internal and external deployments, including by comparing them against other benchmarks that we could potentially utilize.” (p. 3, 5)

3.2.1.2 Evaluation frequency (25%) 0%

They only appear to evaluate before deployment; to improve, evaluation frequency should be given in terms of the relative variation of effective computing power used in training and fixed time periods.

Quotes:

No relevant quotes found. 

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 0%

There is no description of how post-training enhancements are factored into capability assessments.

Quotes:

No relevant quotes found.

3.2.1.4 Vetting of protocols by third parties (15%) 0%

There is no mention of having the evaluation methodology vetted by third parties.

Quotes:

No relevant quotes found.

3.2.1.5 Replication of evaluations by third parties (15%) 10%

While they do not explicitly describe a process for ensuring third-parties replicate and/or conduct evaluations, they do mention that they will allow trust-based access for this purpose. This implies that they are at least considering this criterion.

Quotes:

“However, we may selectively allow xAI’s models to respond to such requests from some vetted, highly trusted users (such as trusted third-party safety auditors or large enterprise customers under contract) whom we know to be using those capabilities for benign or beneficial purposes, such as scientifically investigating AI model’s capabilities for risk assessment purposes, or if such requests cover information that is already readily and easily available, including by an internet search.” (p. 3)

3.2.2 Monitoring of KCIs (40%) 0%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 0%

There is no mention of monitoring mitigation effectiveness after safeguards assessment. There are incident response protocols, but these do not mention reviewing mitigations, only remediation of incidents.

Quotes:

“If xAI learned of an imminent threat of a significantly harmful event, including loss of control, we would take steps to stop or prevent that event, including potentially the following steps: 1. We would immediately notify and cooperate with relevant law enforcement agencies […] ” (p. 7)

“As an example of evaluating use in real-world environments and mitigating risks in real-time, xAI’s Grok model is available for public interaction and scrutiny on the X social media platform, and xAI monitors public interaction with Grok, observing and rapidly responding to the presentation of risks such as the kind contemplated herein. This continues to be an accelerant for xAI’s model risk identification and mitigation.” (p. 2)

3.2.2.2 Vetting of protocols by third parties (30%) 0%

There is no mention of KCIs protocols being vetted by third parties.

Quotes:

No relevant quotes found.

3.2.2.3 Replication of evaluations by third parties (30%) 0%

There is no mention of control evaluations/mitigation testing being replicated or conducted by third-parties.

Quotes:

No relevant quotes found.

3.2.3 Transparency of evaluation results (10%) 21%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 25%

There is a thorough description of the evaluation results that would be publicly shared, but this is all qualified by “may publish”, reducing their commitment as sharing becomes discretionary.

They consider notifying relevant authorities if there was “an imminent threat of a significantly harmful event”: “If we determine it is warranted, we may notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident.” To improve, they could commit to notifying relevant authorities if KRIs are crossed.

Quotes:

“xAI aims to keep the public informed about our risk management policies. As we work towards incorporating more risk management strategies, we intend to publish updates to this RMF.
For public transparency and third-party review, we may publish the following types of information listed below. However, to protect public safety, national security, and our intellectual property, we may redact information from our publications. As necessities dictate, we may also provide vetted and qualified external red teams or appropriate government agencies unredacted
versions.

  1. Risk Management Framework adherence: Regularly review our adherence with this RMF. Internally, we will allow xAI employees to anonymously report concerns about nonadherence, with protections from retaliation.
  2. Benchmark results: Share with relevant audiences leading benchmark results for general capabilities and the benchmarks listed above, upon new major releases.
  3. Internal AI usage: Assess the percent of code or percent of pull requests at xAI generated by our models, or other potential metrics related to AI research and development automation.
  4. Survey: Survey employees for their views and projections of important future developments in AI, e.g., capability gains and benchmark results.” (pp. 7–8)

“Should it happen that xAI learns of an imminent threat of a significantly harmful event, including loss of control, we may take steps such as the following to stop or prevent that event: 1. If we determine it is warranted, we may notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident. xAI employees have whistleblower protections enabling them to raise concerns to relevant government agencies regarding imminent
threats to public safety.” (p. 8)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 13%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 25%

They mention that post-deployment monitoring is an “accelerant for xAI’s model risk identification and mitigation.” The explicit mention of risk identification is given credit here.
To improve, there should be (a) a clear process for incorporating risks encountered via deployment, and (b) an explicit aim to uncover novel risk domains or novel risk models within known risk domains.

Quotes:

“As an example of evaluating use in real-world environments and mitigating risks in real-time, xAI’s Grok model is available for public interaction and scrutiny on the X social media platform, and xAI monitors public interaction with Grok, observing and rapidly responding to the presentation of risks such as the kind contemplated herein. This continues to be an accelerant for xAI’s model risk identification and mitigation.” (p. 2)

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 0%

There is no mechanism to incorporate risks identified during post-deployment that is detailed. To improve, novel risk identification should trigger risk modeling, including updates to other risk models.

Quotes:

No relevant quotes found.

Back to top

4.1 Decision-making (25%) 34%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 75%

The framework laudably includes risk owners explicitly. However, this is diminished somewhat by the framework saying that they “intend” to put in place risk owners and the use of “for instance”.

Quotes:

“To foster accountability, we integrate the approach of designating risk owners, including assigning responsibility for proactively mitigating identified risks.” (p. 8)

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No mention of a management risk committee.

Quotes:

No relevant quotes found.

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 10%

The framework mentions a few risk mitigating practices, but no direct decision-making protocols. They do have a risk acceptance criteria for deployment, but this is complicated by the consideration of weighing benefits and risks, without clear protocols.

Quotes:

“Our risk acceptance criteria for system deployment is maintaining an answer rate of less than 1 out of 20 on restricted queries.” (p. 5)

“Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks.” (p. 7)

“To mitigate risks, xAI employs tiered availability of the functionality and features of its models. For instance, the full functionality of our models may be available to only a limited set of trusted parties, partners, and government agencies. We may also mitigate risks by adding additional controls on functionality and features depending on the type of end user. For instance, features that we make available to consumers using mobile apps may be different than the features made available to sophisticated businesses.
We will also balance various factors when making deployment decisions. The necessity and extent of deployment of certain safeguards and mitigations may depend on how a model performs on relevant benchmarks. However, to ensure responsible deployment, this RMF will be continually adapted and updated as circumstances change. It is conceivable that for a particular modality and/or type of release, the expected benefits of model deployment may outweigh the risks identified by a particular benchmark. For example, a model that poses a high risk of some forms of malicious cyber use may be beneficial to release to certain trusted parties if it would empower defenders more than attackers or would otherwise reduce the overall number of catastrophic events.” (p. 9)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 50%

The framework includes incident management practices, but this is weakened by inclusion of phrases such as “may” and “if we determine it is warranted”.

Quotes:

“Should it happen that xAI learns of an imminent threat of a significantly harmful event, including loss of control, we may take steps such as the following to stop or prevent that event:

  1. If we determine it is warranted, we may notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident. xAI employees have whistleblower protections enabling them to raise concerns to relevant government agencies regarding imminent threats to public safety.
  2. If we determine that xAI systems are actively being used in such an event, we may take steps to isolate and revoke access to user accounts involved in the event.
  3. If we determine that allowing a system to continue running would materially and unjustifiably increase the likelihood of a catastrophic event, we may temporarily fully shut down the relevant system until we have developed a more targeted response.
  4. We may perform a post-mortem of the event after it has been resolved, focusing on any areas where changes to systemic factors (for example, safety culture) could have
    averted such an incident. We may use the post-mortem to inform development and implementation of necessary changes to our risk management practices.” (pp. 8–9)

4.2. Advisory and Challenge (20%) 4%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 0%

No mention of an executive risk officer.

Quotes:

No relevant quotes found.

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 0%

No mention of an advisory committee.

Quotes:

No relevant quotes found.

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 25%

The framework is laudably specific in what quantitative benchmarks it will use to measure risks, but does not provide much detail on the overall system for managing risks.

Quotes:

“To transparently measure our models’ safety properties, xAI utilizes public benchmarks like Weapons of Mass Destruction Proxy and Catastrophic Harm Benchmarks (described below).” (p. 3)

“In particular, we utilize the following benchmarks: Virology Capabilities Test (VCT) […] Weapons of Mass Destruction Proxy (WMDP) Benchmark […] Biological Lab Protocol Benchmark (BioLP-bench) […] Cybench” (pp. 3–4)

“The following are example benchmarks that xAI may use to evaluate its models for concerning propensities relevant to loss of control risks: Model Alignment between Statements and Knowledge (MASK) […] Sycophancy” (p. 6)

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 0%

No mention of designating people that challenge decisions.

Quotes:

No relevant quotes found.

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 0%

No mention of a system to aggregate and report risk data.

Quotes:

No relevant quotes found.

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 10%

4.3.1 The company has an internal audit function involved in AI governance (50%) 10%

No mention of an internal audit function, though they mention they will “regularly review our adherence with this [risk management framework].”

Quotes:

“Risk Management Framework adherence: Regularly review our adherence with this RMF. Internally, we will allow xAI employees to anonymously report concerns about nonadherence, with protections from retaliation.” (p. 8)

4.3.2 The company involves external auditors (50%) 10%

The framework mentions possibly involving external red teams, but does not specify if they will have auditor independence. When they say “unredacted versions”, this could result to risk management framework adherence reports; benchmark results; internal AI usage; or a survey on employees’ views on future developments in AI. However, it is unclear which, hence this cannot be fully rewarded.

Quotes:

“As necessities dictate, we may also provide vetted and qualified external red teams or appropriate government agencies unredacted versions.” (pp. 7–8)

4.4 Oversight (20%) 0%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 0%

No mention of a Board risk committee.

Quotes:

No relevant quotes found.

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 0%

No mention of any additional governance bodies.

Quotes:

No relevant quotes found.

4.5 Culture (10%) 58%

4.5.1 The company has a strong tone from the top (33.3%) 50%

The framework sets out a clear vision of risk reduction. In order to have a higher score, the company could include more detail on how senior management consistently signals the need to consider risks as well as benefits in day-to-day operations.

Quotes:

“xAI seriously considers safety and security while developing and advancing AI models to help us all to better understand the universe. This Risk Management Framework (“RMF”) outlines xAI’s approach to policies for handling significant risks associated with the development, deployment, and release of AI models such as Grok.” (p. 1)

“xAI aims to reduce the risk that the use of its models might contribute to a bad actor potentially seriously injuring people, property, or national security interests, including reducing such risks by enacting measures to prevent use for the development or proliferation of weapons of mass destruction and large-scale violence. Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for bad actors seeking to develop chemical, biological, radiological, or nuclear (“CBRN”) or cyber weapons, and could help automate knowledge compilation to swiftly overcome bottlenecks to weapons development, amplifying the expected risk posed by such weapons of mass destruction.” (p. 2)

“One of the most salient risks of AI within the public consciousness is the loss of control of advanced AI systems.” (p. 6)

4.5.2 The company has a strong risk culture (33.3%) 50%

The framework uniquely includes mentions of surveys of employees. This can be beneficial for risk-culture building. However, to improve the score, more aspects of risk-culture building, such as training, are necessary.

Quotes:

“Survey: survey employees for their views and projections of important future developments in AI, e.g., capability gains and benchmark results.” (p. 6)

4.5.3 The company has a strong speak-up culture (33.3%) 75%

The framework clearly states whistleblower protections, but is fairly light on details. For further improvement to its score, more details would be welcome.

Quotes:

“Internally, we will allow xAI employees to anonymously report concerns about noncompliance, with protections from retaliation.” (p. 6)
“xAI employees have whistleblower protections enabling them to raise concerns to relevant government agencies regarding imminent threats to public safety.” (p. 7)

4.6 Transparency (5%) 37%

4.6.1 The company reports externally on what their risks are (33.3%) 75%

The framework clearly states the risks that are covered by the framework. Further improvements in score could be gained by specifying what information on these risks and their safeguards that will be released externally on a regular basis.

Quotes:

“Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for bad actors seeking to develop chemical, biological, radiological, or nuclear (“CBRN”) or cyber weapons, and could help automate knowledge compilation to swiftly overcome bottlenecks to weapons development, amplifying the expected risk posed by such weapons of mass destruction.” (p. 2)

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 10%

The framework mentions keeping the framework up-to-date, but to improve its score, it would need to provide details on its governance structure.

Quotes:

“For transparency and third-party review, we may publish the following types of information listed below […] 1. Risk Management Framework adherence: Regularly review our adherence with this RMF.” (pp. 7–8)
“We aim to keep the public informed about our risk management policies. As we work towards incorporating more risk management strategies, we intend to publish updates to our risk management framework.” (p. 6)

4.6.3 The company shares information with industry peers and government bodies (33.3%) 25%

The framework states information sharing practices but couches this with phrases including “may”. Extra credit is provided for the clear commitment to potentially share information with law enforcement. For a higher score, the company could be be more precise rather than saying “may provide”.

Quotes:

“For public transparency and third-party review, we may publish the following types of information listed below. However, to protect public safety, national security, and our intellectual property, we may redact information from our publications. As necessities dictate, we may also provide vetted and qualified external red teams or appropriate government agencies unredacted versions.” (pp. 7–8)

“Should it happen that xAI learns of an imminent threat of a significantly harmful event, including loss of control, we may take steps such as the following to stop or prevent that event: 1. If we determine it is warranted, we may notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident.” (p. 8)

Back to top