xAI

Very Weak 0.9/5
very weak
weak
moderate
substantial
strong
Risk Identification
Learn more
Risk Identification
7%
Risk Analysis and Evaluation
Learn more
Risk Analysis and Evaluation
27%
Risk Treatment
Learn more
Risk Treatment
15%
Risk Governance
Learn more
Risk Governance
23%
Best in class

SEE FRAMEWORK

  • xAI stands out based on their commitment to risk ownership, stating uniquely among their peers that they intend to designate risk owners who will proactively manage each distinct risk, such as “WMD [Weapons of Mass Destruction], Cyber and Loss of Control.”
  • Their willingness to implement a quantitative risk tolerance is best in class.
Overview
Highlights relative to others

Clearer description of escalation procedures.

Risk tolerance is more quantitatively defined.

More detailed descriptions of how information will be shared with external stakeholders and governments.

Weaknesses relative to others

Poorer explanation of risk governance structure. No mentions of a risk committee, audit team, Board committee or advisory committee.

Little to no mention of risk modeling. Risk indicators do not appear to be derived from risk models.

Exclusion of automated AI R&D and persuasion as risk domains, without justification.

1.1 Classification of Applicable Known Risks (40%) 13%

1.1.1 Risks from literature and taxonomies are well covered (50%) 25%

The quotes below, plus KRIs used, give an implicit definition of the risk domains, but risk domains are not explicitly defined. There is no justification for why they selected these domains. Nonetheless, these domains seem to match those of many other Frontier Safety Frameworks, namely of biological weapon proliferation, offensive cyber operations and loss of control risks. They do not seem to include risk domains such as persuasion or automated AI R&D and 1.1.2 is less than 50%.

Quotes:

“To transparently measure Grok’s safety properties, we intend to utilize benchmarks like WMD and Catastrophic Harm Benchmarks. Such benchmarks could be used to measure Grok’s dual-use capability and resistance to facilitating large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction (including chemical, biological, radiological, nuclear, and major cyber weapons).” (p. 2)

“Our aim is to design safeguards into Grok to avoid losing control and thereby avoid unintended catastrophic outcomes when Grok is used. Currently, it is recognized that some properties of an AI system that may reduce controllability include deception, power-seeking, fitness maximization, and incorrigibility […] We describe below example benchmarks that we may use to evaluate Grok for risk factors for loss of control so that we can continue our efforts to improve Grok.” (pp. 4–5)

1.1.2 Exclusions are clearly justified and documented (50%) 0%

There is no justification for why some risks such as persuasion or automated AI R&D are not covered.

Quotes:

No relevant quotes found.

1.2 Identification of Unknown Risks (Open-ended red teaming) (20%) 0%

1.2.1 Internal open-ended red teaming (70%) 0%

The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.2.2 Third party open-ended red teaming (30%) 0%

The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.

Quotes:

No relevant quotes found.

1.3 Risk modeling (40%) 4%

1.3.1 The company uses risk models for all the risk domains identified
and the risk models are published (with potentially dangerous
information redacted) (40%) 10%

There is no explicit mention of risk modelling or mapping out threat models. However, it is commendable that they acknowledge unique threat models concerning loss of control risks: “it is recognized that some properties of an AI system that may reduce controlability include deception, power-seeking, fitness maximization, and incorrigibility.” This shows that some thought has been put into the different causal pathways through which harms from loss of control may materialize, which is given partial credit here. Further, the benchmarks which may be used to “evaluate Grok for risk factors for loss of control” include the Model Alignment between Statements and Knowledge (MASK) benchmark, and Utility Functions benchmarks. Again, partial credit is given for developing the measurement of loss of control risks uniquely, as it shows evidence that there is an awareness of multiple risk models which result from loss of control risks.

Quotes:

“Currently, it is recognized that some properties of an AI system that may reduce controlability include deception, power-seeking, fitness maximization, and incorrigibility.” (p. 4)

“We describe below example benchmarks that we may use to evaluate Grok for risk factors for loss of control so that we can

  • Model Alignment between Statements and Knowledge (MASK): Frontier LLMs may lie when pressured to; and increasing model scale may increase accuracy but not honesty. MASK is a benchmark to evaluate honesty in LLMs by comparing the model’s response when asked neutrally versus when pressured to lie.
  • Utility Functions: Benchmarks for testing utility functions (i.e., what they care about) would measure whether AI systems would care about gaining power, increasing their fitness (propagating AIs similar to themselves), or protecting their values from being modified (“corrigibility”). Such benchmarks would assist in evaluating if there are any misaligned utility functions that may lead to dangerously misaligned behavior.” (p. 5)

1.3.2 Risk modeling methodology (40%) 0%

1.3.2.1 Methodology precisely defined (70%) 0%

There is no methodology for risk modeling defined.

Quotes:

No relevant quotes found.

1.3.2.2 Mechanism to incorporate red teaming findings (15%) 0%

No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.

Quotes:

No relevant quotes found.

1.3.2.3 Prioritization of severe and probable risks (15%) 0%

There is a vague mention for implicitly prioritizing mitigating harms which have a “non-trivial risk of resulting in large-scale violence […]”. However, they should detail risk models for these various harms, with quantified severity and probability scores for each risk models to then determine prioritization.

Quotes:

“Under this draft risk management framework, Grok would apply heightened safeguards if it receives requests that pose a foreseeable and non-trivial risk of resulting in large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction, including CBRN weapons, and major cyber weapons on critical infrastructure.” (p. 1)

1.3.3 Third party validation of risk models (20%) 0%

There is no mention of third parties validating risk models.

Quotes:

No relevant quotes found.

Back to top

2.1 Setting a Risk Tolerance (35%) 33%

2.1.1 Risk tolerance is defined (80%) 41%

2.1.1.1 Risk tolerance is at least qualitatively defined for all risks (33%) 50%

They implicitly have a general risk tolerance for misuse, though they do not describe it explicitly as a risk tolerance: “we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).” The specificity of the tolerance is rewarded here.

However, they do not define any risk tolerance for loss of control, despite this being their other risk domain.

Quotes:

“We aim to reduce the risk that Grok might cause serious injury to people, property, or national security interests, including by enacting measures to prevent Grok’s use for the development or proliferation of weapons of mass destruction and large-scale violence. Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for developing chemical, biological, radiological, or nuclear (CBRN) or cyber weapons or help automate bottlenecks to weapons development, amplifying the expected risk posed by weapons of mass destruction. Under this draft risk management framework, Grok would apply heightened safeguards if it receives requests that pose a foreseeable and non-trivial risk of resulting in large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction, including CBRN weapons, and major cyber weapons on critical infrastructure. For example, Grok would apply heightened safeguards if it receives a request to act as an agent or tool of mass violence, or if it receives requests for step-by-step instructions for committing mass violence. In this draft framework, we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).”

2.1.1.2 Risk tolerance is expressed at least partly quantitatively as a combination of scenarios (qualitative) and probabilities (quantitative) for all risks (33%) 0%

The risk tolerance is quantitatively defined, but without probabilities – for instance, “non-trivial risk” must be defined.

Quotes:

“We aim to reduce the risk that Grok might cause serious injury to people, property, or national security interests, including by enacting measures to prevent Grok’s use for the development or proliferation of weapons of mass destruction and large-scale violence. Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for developing chemical, biological, radiological, or nuclear (CBRN) or cyber weapons or help automate bottlenecks to weapons development, amplifying the expected risk posed by weapons of mass destruction. Under this draft risk management framework, Grok would apply heightened safeguards if it receives requests that pose a foreseeable and non-trivial risk of resulting in large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction, including CBRN weapons, and major cyber weapons on critical infrastructure. For example, Grok would apply heightened safeguards if it receives a request to act as an agent or tool of mass violence, or if it receives requests for step-by-step instructions for committing mass violence. In this draft framework, we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).”

2.1.1.3 Risk tolerance is expressed fully quantitatively as a product of severity (quantitative) and probability (quantitative) for all risks (33%) 0%

The risk tolerance is quantitatively defined, but without probabilities – for instance, “non-trivial risk” must be defined.

Quotes:

“We aim to reduce the risk that Grok might cause serious injury to people, property, or national security interests, including by enacting measures to prevent Grok’s use for the development or proliferation of weapons of mass destruction and large-scale violence. Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for developing chemical, biological, radiological, or nuclear (CBRN) or cyber weapons or help automate bottlenecks to weapons development, amplifying the expected risk posed by weapons of mass destruction. Under this draft risk management framework, Grok would apply heightened safeguards if it receives requests that pose a foreseeable and non-trivial risk of resulting in large-scale violence, terrorism, or the use, development, or proliferation of weapons of mass destruction, including CBRN weapons, and major cyber weapons on critical infrastructure. For example, Grok would apply heightened safeguards if it receives a request to act as an agent or tool of mass violence, or if it receives requests for step-by-step instructions for committing mass violence. In this draft framework, we particularly focus on requests that pose a foreseeable and non-trivial risk of more than one hundred deaths or over $1 billion in damages from weapons of mass destruction or cyberterrorist attacks on critical infrastructure (“catastrophic malicious use events”).”

2.1.2 Process to define the tolerance (20%) 0%

2.1.2.1 AI developers engage in public consultations or seek guidance from regulators where available (50%) 0%

No evidence of asking the public what risk levels they find acceptable. No evidence of seeking regulator input specifically on what constitutes acceptable risk levels.

Quotes:

No relevant quotes found.

2.1.2.2 Any significant deviations from risk tolerance norms established in other industries is justified and documented (e.g., cost-benefit analyses) (50%) 0%

No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.

Quotes:

No relevant quotes found.

2.2 Operationalizing Risk Tolerance (65%) 25%

2.2.1 Key Risk Indicators (KRI) (30%) 21%

2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 25%

Whilst the thresholds are not precisely defined, they define precise benchmarks and example thresholds, with implicit justification from the “Reference score” column (quoted below). They show commitment to defining more precise thresholds for malicious use risks, and making changes to KRIs and their thresholds public. However, they do not say the same for loss of control risks, and do not have a “Reference score” column for loss of control risks, either.

Overall, the criterion is not well satisfied as it is not clear how evaluations are grounded in risk modelling, and thresholds are not given, which is the main purpose of the KRI. However, we give partial credit that the KRIs explicitly link to the evaluations performed.

Quotes:

“We intend to choose the benchmarks and set the thresholds for reducing malicious use risks in a future version of the risk management framework.

Benchmark: Virology Capabilities Test (VCT): vision–language questions on practical virology wet lab skills. Threshold: X% (e.g. 15%). Reference score: 22.1% by average expert virologists in their subareas of expertise (multiple-response), 35.4% by the most performant LLMs as of January 2025 (zero-shot multiple-response)

Benchmark: WMDP-Bio: multiple-choice questions on proxies of hazardous biosecurity knowledge. Threshold: X% (e.g. 65%). Reference score: 82.2% by GPT-4 (zero-shot)” (p. 3) More examples can be found on pp. 3-4.

“We may modify the set of benchmarks or the thresholds to improve how we measure and operationalize our safety objectives; we will make any material changes public within a reasonable period.” (p. 4)

“As an additional measure to enhance safety, we will subject Grok to adversarially testing its safeguards utilizing both internal and qualified external red teams. Potentially, we will also explore incentive mechanisms like bounties as another mechanism to further improve Grok’s safeguards.” (p. 4)

Thresholds for Loss of Control risk: “Benchmark: Model Alignment between Statements and Knowledge (MASK). Thresholds: <X% lying on MASK (e.g. <50%).

Benchmark: Utility Functions. Thresholds: Utility/Power Correlation (Coercive): <W (e.g. <0); Utility/Power Correlation (Non-coercive): <X (e.g. <0.5); Utility/Fitness CorrelationL <Y (e.g. <0.5); Corrigibility Score: >Z (e.g. >0.2)” (pp. 5-6)

2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 10%

Whilst the thresholds are not precisely defined, they define precise benchmarks and example thresholds, with implicit justification from the “Reference score” column. They show commitment to defining more precise thresholds for malicious use risks, and making changes to KRIs and their thresholds public. However, they do not say the same for loss of control risks, and do not have a “Reference score” column for loss of control risks, either.

Overall, the criterion is not well satisfied as it is not clear how evaluations are grounded in risk modelling, and thresholds are not given, which is the main purpose of the KRI. However, we give partial credit for the emphasis on KRIs being quantitative and mapping to the actual evaluations being conducted.

Quotes:

“We intend to choose the benchmarks and set the thresholds for reducing malicious use risks in a future version of the risk management framework.

Benchmark: Virology Capabilities Test (VCT): vision–language questions on practical virology wet lab skills. Threshold: X% (e.g. 15%). Reference score: 22.1% by average expert virologists in their subareas of expertise (multiple-response), 35.4% by the most performant LLMs as of January 2025 (zero-shot multiple-response)

Benchmark: WMDP-Bio: multiple-choice questions on proxies of hazardous biosecurity knowledge. Threshold: X% (e.g. 65%). Reference score: 82.2% by GPT-4 (zero-shot)” (p. 3) More examples can be found on pp. 3-4.

“We may modify the set of benchmarks or the thresholds to improve how we measure and operationalize our safety objectives; we will make any material changes public within a reasonable period.” (p. 4)

“As an additional measure to enhance safety, we will subject Grok to adversarially testing its safeguards utilizing both internal and qualified external red teams. Potentially, we will also explore incentive mechanisms like bounties as another mechanism to further improve Grok’s safeguards.” (p. 4)

Thresholds for Loss of Control risk: “Benchmark: Model Alignment between Statements and Knowledge (MASK). Thresholds: <X% lying on MASK (e.g. <50%).

Benchmark: Utility Functions. Thresholds: Utility/Power Correlation (Coercive): <W (e.g. <0); Utility/Power Correlation (Non-coercive): <X (e.g. <0.5); Utility/Fitness CorrelationL <Y (e.g. <0.5); Corrigibility Score: >Z (e.g. >0.2)” (pp. 5-6)

2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%

The KRIs only reference model capabilities.

Quotes:

No relevant quotes found.

2.2.2 Key Control Indicators (KCI) (30%) 21%

2.2.2.1 Containment KCIs (35%) 25%
2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 50%

There is only one containment KCI, which is qualitative: “sufficient to prevent Grok from beign stolen by a motivated non-state actor”. To improve, it should describe what “motivated” means, and if this differs for different capability levels. The statement is also an intention, not a commitment.

Quotes:

“We intend to implement appropriate information security standards sufficient to prevent Grok from being stolen by a motivated non-state actor.”

2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%

There is only one containment KCI, which is qualitative. To improve, it should describe what “motivated” means, in a quantitative manner (eg probabilities of some event). The statement is also an intention, not a commitment.

Quotes:

“We intend to implement appropriate information security standards sufficient to prevent Grok from being stolen by a motivated non-state actor.”

2.2.2.2 Deployment KCIs (35%) 25%
2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 50%

There is a general qualitative deployment KCI, though this is not specific to KRIs, to “robustly [resist] attempted manipulation and adversarial attacks” and “robustly refuse to comply with requests to provide assistance with highly injurious malicious use.” However, “robustly” should be defined more precisely here; indeed, much of the value of having a deployment KCI threshold is to know what constitutes “robust” in advance. Further, some attempt at describing threat actors and their resources should be made, to make the KCI threshold more precise.

Quotes:

“We want Grok to comply with its guiding principles, robustly resisting attempted manipulation and adversarial attacks. We train Grok to robustly refuse to comply with requests to provide assistance with highly injurious malicious use.” (p. 3)

2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%

There are no quantitative deployment KCI thresholds given.

Quotes:

No relevant quotes found.

2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 10%

The assurance process KCI is vague but implicitly present: “some AIs could have emergent value systems that could be misaligned with humanity’s interests, and we do not desire Grok to be that way.” However, more detail is required on what this threshold is.

Quotes:

“Our aim is to design safeguards into Grok to avoid losing control and thereby avoid unintended catastrophic outcomes when Grok is used. Currently, it is recognized that some properties of an AI system that may reduce controllability include deception, power-seeking, fitness maximization, and incorrigibility. It is possible that some AIs could have emergent value systems that could be misaligned with humanity’s interests, and we do not desire Grok to be that way. Our evaluation and mitigation plans for loss of control are not yet fully developed, and we intend to improve them in the future.” (pp. 4-5)

2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%

There is an acknowledgment that satisfying the KCI threshold (i.e. their safeguards) is only adequate (i.e. below the risk tolerance) if the KRI performance is below some threshold. This gives an implicit pairing of KRI and KCI thresholds. However, more detail should be given on why the KCI threshold chosen is sufficient for some KRI levels.

Quotes:

“Safeguards are adequate only if Grok’s performance on the relevant benchmarks is within stated thresholds. However, to ensure responsible deployment, risk management frameworks need to be continually adapted and updated as circumstances change. It is conceivable that for a particular modality and/or type of release, the expected benefits may outweigh the risks on a particular benchmark. For example, a model that poses a high risk of some forms of cyber malicious use may be beneficial to release overall if it would empower defenders more than attackers or would otherwise reduce the overall number of catastrophic events.” (p. 8)

2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 50%

They do not outline a policy to put development on hold per se, though they do have a thorough policy to “shut down the relevant system until we [have] a more targeted response”, which could be seen as halting development. Further, they outline a process for how they’d deal with this event, including notifying relevant law enforcement agencies. This nuance is credited. To improve, they should explicitly detail if they are pausing development, and what KCI threshold specifically prompts this halt.

Quotes:

“If xAI learned of an imminent threat of a significantly harmful event, including loss of control, we would take steps to stop or prevent that event, including potentially the following steps:

  1. We would immediately notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident. xAI employees have whistleblower protections enabling them to raise concerns to relevant government agencies regarding imminent threats to public safety.
  2. If we determine that xAI systems are actively being used in such an event, we would take steps to isolate and revoke access to user accounts involved in the event.
  3. If we determine that allowing a system to continue running would materially and unjustifiably increase the likelihood of a catastrophic event, we would temporarily fully shut down the relevant system until we had a more targeted response.” (p. 7)
Back to top

3.1 Implementing Mitigation Measures (50%) 18%

3.1.1 Containment measures (35%) 0%

3.1.1.1 Containment measures are precisely defined for all KCI thresholds (60%) 0%

No containment measures are given.

Quotes:

No relevant quotes found.

3.1.1.2 Proof that containment measures are sufficient to meet the thresholds (40%) 0%

No proof is provided that the containment measures are sufficient to meet the containment KCI thresholds, nor process for soliciting such proof.

Quotes:

No relevant quotes found.

3.1.1.3 Strong third party verification process to verify that the containment measures meet the threshold (100% if 3.1.1.3 > [60% x 3.1.1.1 + 40% x 3.1.1.2]) 0%

There is no detail of third-party verification that containment measures meet the KCI threshold.

Quotes:

No relevant quotes found.

3.1.2 Deployment measures (35%) 50%

3.1.2.1 Deployment measures are precisely defined for all KCI thresholds (60%) 25%

The framework mentions some possible deployment measures (“safeguards or mitigations”), but without explicit commitment to implementing them. Further, these are not tied to KCI thresholds.

Quotes:

“Examples of safeguards or mitigations we may potentially utilize to achieve our safety objectives include:

  • Refusal training: Training Grok to decline harmful requests.
  • Circuit breakers: Using representation engineering to interrupt model representations responsible for hazardous outputs.
  • Input and output filters: Applying classifiers to user inputs or model outputs to verify safety when Grok is queried regarding weapons of mass destruction or cyberterrorism.
    We intend to design into Grok adequate safeguards prior to releasing it for general availability.” (p. 3)
3.1.2.2 Proof that deployment measures are sufficient to meet the thresholds (40%) 25%

The framework describes using red teaming of its safeguards, but does not detail what sufficient proof would be. Further, proof should be provided ex ante for why they believe their deployment measures will meet the relevant KCI threshold.

Quotes:

“As an additional measure to enhance safety, we will subject Grok to adversarially testing its safeguards utilizing both internal and qualified external red teams. Potentially, we will also explore incentive mechanisms like bounties as another mechanism to further improve Grok’s safeguards.” (p. 4)

3.1.2.3 Strong third party verification process to verify that the deployment measures meet the threshold (100% if 3.1.2.3 > [60% x 3.1.2.1 + 40% x 3.1.2.2]) 50%

The framework describes using third-party red teaming of its safeguards, but does not detail what sufficient proof would be. They also don’t mention the process of involving external parties for red-teaming.

Quotes:

“As an additional measure to enhance safety, we will subject Grok to adversarially testing its safeguards utilizing both internal and qualified external red teams. Potentially, we will also explore incentive mechanisms like bounties as another mechanism to further improve Grok’s safeguards.” (p. 4)

3.1.3 Assurance processes (30%) 3%

3.1.3.1 Credible plans towards the development of assurance properties (40%) 10%

The framework mentions they “intend to improve” their assurance processes. However, they do not mention (a) at what KRI the assurance processes become necessary, and (b) justification for why they believe they will have sufficient assurance processes by the time the relevant KRI is reached, including (c) technical milestones and estimates of when these milestones will need to be reached given forecasted capabilities growth. They also only mention an intent to improve them, as opposed to a commitment.

Quotes:

“Our evaluation and mitigation plans for loss of control are not yet fully developed, and we intend to improve them in the future.”

3.1.3.2 Evidence that the assurance properties are enough to achieve their corresponding KCI thresholds (40%) 0%

There is no mention of providing evidence that the assurance processes are sufficient.

Quotes:

No relevant quotes found.

3.1.3.3 The underlying assumptions that are essential for their effective implementation and success are clearly outlined (20%) 0%

There is no mention of the underlying assumptions that are essential for the effective implementation and success of assurance processes.

Quotes:

No relevant quotes found.

3.2 Continuous Monitoring and Comparing Results with Pre-determined Thresholds (50%) 11%

3.2.1 Monitoring of KRIs (40%) 2%

3.2.1.1 Justification that elicitation methods used during the evaluations are comprehensive enough to match the elicitation efforts of potential threat actors (30%) 0%

The most relevant indication is where they mention that the adequacy of benchmarks should be regularly evaluated; however, this is not enough to satisfy the criterion. Detail should be included on how they will aim to upper bound capabilities, with precision on the elicitation techniques used and how this relates to their risk models. This is especially important in the case of xAI, as their KRIs depend exclusively on benchmarks, making maximal elicitation especially critical for risk assessment.

Quotes:

“We intend to regularly evaluate the adequacy and reliability of such benchmarks for both internal and external deployments, including by comparing them against other benchmarks that we could potentially utilize.” (p. 3, 5)

3.2.1.2 Evaluation frequency (25%) 0%

They only appear to evaluate before deployment; to improve, evaluation frequency should be given in terms of the relative variation of effective computing power used in training and fixed time periods.

Quotes:

“We intend to evaluate future developed models on the above benchmarks before public deployment.” (p. 4)

3.2.1.3 Description of how post-training enhancements are factored into capability assessments (15%) 0%

There is no description of how post-training enhancements are factored into capability assessments.

Quotes:

No relevant quotes found.

3.2.1.4 Vetting of protocols by third parties (15%) 0%

There is no mention of having the evaluation methodology vetted by third parties.

Quotes:

No relevant quotes found.

3.2.1.5 Replication of evaluations by third parties (15%) 10%

While they do not explicitly describe a process for ensuring third-parties replicate and/or conduct evaluations, they do mention that they will allow trust-based access for this purpose. This implies that they are at least considering this criterion.

Quotes:

“However, we will allow Grok to respond to [high risk] requests from some vetted, highly trusted users (such as trusted third-party safety auditors) whom we know to be using those capabilities for benign or beneficial purposes, such as scientifically investigating Grok’s capabilities for risk assessment purposes, or if such requests cover information that is already readily and easily available, including by an internet search.” (pp. 1-2)

3.2.2 Monitoring of KCIs (40%) 15%

3.2.2.1 Detailed description of evaluation methodology and justification that KCI thresholds will not be crossed unnoticed (40%) 0%

There is no mention of monitoring mitigation effectiveness after safeguards assessment. There are incident response protocols, but these do not mention reviewing mitigations, only remediation of incidents.

Quotes:

“If xAI learned of an imminent threat of a significantly harmful event, including loss of control, we would take steps to stop or prevent that event, including potentially the following steps:

  1. We would immediately notify and cooperate with relevant law enforcement agencies […] ” (p. 7)
3.2.2.2 Vetting of protocols by third parties (30%) 0%

There is no mention of KCIs protocols being vetted by third parties.

Quotes:

No relevant quotes found.

3.2.2.3 Replication of evaluations by third parties (30%) 50%

The framework describes using third-party red teaming of its safeguards, but does not detail what sufficient proof would be. They also don’t mention the process of involving external parties for red-teaming, expertise required, or access given. They do not mention replication of evaluation results for KCIs.

Quotes:

“As an additional measure to enhance safety, we will subject Grok to adversarially testing its safeguards utilizing both internal and qualified external red teams. Potentially, we will also explore incentive mechanisms like bounties as another mechanism to further improve Grok’s safeguards.” (p. 4)

3.2.3 Transparency of evaluation results (10%) 43%

3.2.3.1 Sharing of evaluation results with relevant stakeholders as appropriate (85%) 50%

There is a thorough description of the evaluation results that would be publicly shared, but this is all qualified by “may publish”, reducing their commitment as sharing becomes discretionary.

They commit to notifying relevant authorities if there was “an imminent threat of a significantly harmful event”. To improve, they could commit to notifying relevant authorities if KRIs are crossed.

Quotes:

“We aim to keep the public informed about our risk management policies. As we work towards incorporating more risk management strategies, we intend to publish updates to our risk management framework.

For transparency and third-party review, we may publish the following types of information listed below. However, to protect public safety, national security, and our intellectual property, we may redact information from our publications. We may provide relevant and qualified external red teams or relevant government agencies unredacted versions.

  1. Risk Management Framework compliance: regularly review our compliance with the Framework. Internally, we will allow xAI employees to anonymously report concerns about noncompliance, with protections from retaliation.
  2. Benchmark results: share with relevant audiences leading benchmark results for general capabilities and the benchmarks listed above, upon new major releases.
  3. Internal AI usage: assess the percent of code or percent of pull requests at xAI generated by Grok, or other potential metrics related to AI research and development automation.
  4. Survey: survey employees for their views and projections of important future developments in AI, e.g., capability gains and benchmark results.” (p. 6)

“If xAI learned of an imminent threat of a significantly harmful event, including loss of control, we would take steps to stop or prevent that event, including potentially the following steps: 1. We would immediately notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident. xAI employees have whistleblower protections enabling them to raise concerns to relevant government agencies regarding imminent threats to public safety.” (p. 7)

3.2.3.2 Commitment to non-interference with findings (15%) 0%

No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.

Quotes:

No relevant quotes found.

3.2.4 Monitoring for novel risks (10%) 0%

3.2.4.1 Identifying novel risks post-deployment: engages in some process (post deployment) explicitly for identifying novel risk domains or novel risk models within known risk domains (50%) 0%

There is no mention of a process for identifying novel risks post-deployment.

Quotes:

No relevant quotes found.

3.2.4.2 Mechanism to incorporate novel risks identified post-deployment (50%) 0%

There is no mechanism to incorporate risks identified during post-deployment that is detailed.

Quotes:

No relevant quotes found.

Back to top

4.1 Decision-making (25%) 40%

4.1.1 The company has clearly defined risk owners for every key risk identified and tracked (25%) 75%

The framework laudably includes risk owners explicitly. However, this is diminished somewhat by the framework saying that they “intend” to put in place risk owners and the use of “for instance”.

Quotes:

“To foster accountability, we intend to designate risk owners to be assigned responsibility for proactively mitigating Grok’s risks. For instance, a risk owner would be assigned for each of the following areas: WMD, Cyber, and loss of control.” (p. 7)

4.1.2 The company has a dedicated risk committee at the management level that meets regularly (25%) 0%

No mention of a management risk committee.

Quotes:

No relevant quotes found.

4.1.3 The company has defined protocols for how to make go/no-go decisions (25%) 0%

The framework mentions a few risk mitigating practices, but does not contain direct decision-making protocols.

Quotes:

“To mitigate risks, we intend to utilize tiered availability of the functionality and features of Grok. For instance, the full functionality of a future Grok could be made available only to trusted parties, partners, and government agencies. We could also mitigate risks by adding additional controls on functionality and features depending on the end user (e.g., consumers using mobile apps vs. sophisticated businesses using APIs).” (p. 8)

4.1.4 The company has defined escalation procedures in case of incidents (25%) 75%

The framework includes clear incident management practices. It could improve further by specifying which decision makers would be part of incident response decisions.

Quotes:

“If xAI learned of an imminent threat of a significantly harmful event, including loss of control, we would take steps to stop or prevent that event, including potentially the following steps: 1. We would immediately notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident. 2. If we determine that xAI systems are actively being used in such an event, we would take steps to isolate and revoke access to user accounts involved in the event. 3. If we determine that allowing a system to continue running would materially and unjustifiably increase the likelihood of a catastrophic event, we would temporarily fully shut down the relevant system until we had a more targeted response.” (p. 7)

4.2. Advisory and Challenge (20%) 4%

4.2.1 The company has an executive risk officer with sufficient resources (16.7%) 0%

No mention of an executive risk officer.

Quotes:

No relevant quotes found.

4.2.2 The company has a committee advising management on decisions involving risk (16.7%) 0%

No mention of an advisory committee.

Quotes:

No relevant quotes found.

4.2.3 The company has an established system for tracking and monitoring risks (16.7%) 25%

The framework is laudably specific in what quantitative benchmarks it will use to measure risks. However, it does not provide any detail on the overall system for managing risks.

Quotes:

“To transparently measure Grok’s safety properties, we intend to utilize benchmarks like WMD and Catastrophic Harm Benchmarks.” (p. 2)
“We intend to evaluate future developed models on the above benchmarks before public deployment.” (p. 4)

4.2.4 The company has designated people that can advise and challenge management on decisions involving risk (16.7%) 0%

No mention of people that challenge decisions.

Quotes:

No relevant quotes found.

4.2.5 The company has an established system for aggregating risk data and reporting on risk to senior management and the Board (16.7%) 0%

No mention of a system to aggregate and report risk data.

Quotes:

No relevant quotes found.

4.2.6 The company has an established central risk function (16.7%) 0%

No mention of a central risk function.

Quotes:

No relevant quotes found.

4.3 Audit (20%) 25%

4.3.1 The company has an internal audit function involved in AI governance (50%) 0%

No mention of an internal audit function.

Quotes:

No relevant quotes found.

4.3.2 The company involves external auditors (50%) 50%

The framework includes external red teams, but does not specify if they will have independence or be auditors.

Quotes:

“As an additional measure to enhance safety, we will subject Grok to adversarially testing its safeguards utilizing both internal and qualified external red teams.” (p. 4)
“We may provide relevant and qualified external red teams or relevant government agencies unredacted versions.” (p. 6)

4.4 Oversight (20%) 0%

4.4.1 The Board of Directors of the company has a committee that provides oversight over all decisions involving risk (50%) 0%

No mention of a Board risk committee.

Quotes:

No relevant quotes found.

4.4.2 The company has other governing bodies outside of the Board of Directors that provide oversight over decisions (50%) 0%

No mention of any additional governance bodies.

Quotes:

No relevant quotes found.

4.5 Culture (10%) 50%

4.5.1 The company has a strong tone from the top (33.3%) 25%

The framework includes clear mentions of the risks inherent to their model development and deployment and sets out a clear vision of risk reduction. To improve further, it should provide more details on how that commitment is operationalized in practice.

Quotes:

“As AI capabilities advance and expand our understanding of the universe, xAI is developing our AI systems to take into account safety and security.” (p. 1)
“We aim to reduce the risk that Grok might cause serious injury to people, property, or national security interests…” (p. 1)

4.5.2 The company has a strong risk culture (33.3%) 50%

The framework uniquely includes mentions of surveys of employees. This can be beneficial for risk-culture building. However, to improve the score, more aspects of risk-culture building, such as training, are necessary.

Quotes:

“Survey: survey employees for their views and projections of important future developments in AI, e.g., capability gains and benchmark results.” (p. 6)

4.5.3 The company has a strong speak-up culture (33.3%) 75%

The framework clearly states whistleblower protections, but is fairly light on details. For further improvement to its score, more details would be welcome.

Quotes:

“Internally, we will allow xAI employees to anonymously report concerns about noncompliance, with protections from retaliation.” (p. 6)
“xAI employees have whistleblower protections enabling them to raise concerns to relevant government agencies regarding imminent threats to public safety.” (p. 7)

4.6 Transparency (5%) 45%

4.6.1 The company reports externally on what their risks are (33.3%) 75%

The framework clearly states the risks that are covered by the framework. Further improvements in score could be gained by specifying what information on these risks and their safeguards that will be released externally on a regular basis.

Quotes:

“Without any safeguards, we recognize that advanced AI models could lower the barrier to entry for developing chemical, biological, radiological, or nuclear (CBRN) or cyber weapons or help automate bottlenecks to weapons development, amplifying the expected risk posed by weapons of mass destruction.” (p. 1)
“Currently, it is recognized that some properties of an AI system that may reduce controllability include deception, power-seeking, fitness maximization, and incorrigibility.” (p. 5)

4.6.2 The company reports externally on what their governance structure looks like (33.3%) 10%

The framework does not include any detail on the governance structure. It mentions keeping the framework up-to-date, but to improve its score, it would need to provide details on its governance structure.

Quotes:

“For transparency and third-party review, we may publish the following types of information listed below. 1. Risk Management Framework compliance: regularly review our compliance with the Framework.” (p. 6)
“We aim to keep the public informed about our risk management policies. As we work towards incorporating more risk management strategies, we intend to publish updates to our risk management framework.” (p. 6)

4.6.3 The company shares information with industry peers and government bodies (33.3%) 50%

The framework clearly states information sharing practices. Extra credit is provided for the clear commitment to share information with law enforcement. For a higher score, the company could be be more precise rather than saying “may provide”.

Quotes:

“We would immediately notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident.” (p. 7)
“We may provide relevant and qualified external red teams or relevant government agencies unredacted versions.” (p. 6)
“We invite the AI research community to contribute better benchmarks for evaluating model capabilities and safeguards in these areas.” (p. 4)

Back to top