2.2.1 Key Risk Indicators (KRI) (30%) 21%
2.2.1.1 KRI thresholds are at least qualitatively defined for all risks (45%) 25%
Whilst the thresholds are not precisely defined, they define precise benchmarks and example thresholds, with implicit justification from the “Reference score” column (quoted below). They show commitment to defining more precise thresholds for malicious use risks, and making changes to KRIs and their thresholds public. However, they do not say the same for loss of control risks, and do not have a “Reference score” column for loss of control risks, either.
Overall, the criterion is not well satisfied as it is not clear how evaluations are grounded in risk modelling, and thresholds are not given, which is the main purpose of the KRI. However, we give partial credit that the KRIs explicitly link to the evaluations performed.
Quotes:
“We intend to choose the benchmarks and set the thresholds for reducing malicious use risks in a future version of the risk management framework.
Benchmark: Virology Capabilities Test (VCT): vision–language questions on practical virology wet lab skills. Threshold: X% (e.g. 15%). Reference score: 22.1% by average expert virologists in their subareas of expertise (multiple-response), 35.4% by the most performant LLMs as of January 2025 (zero-shot multiple-response)
Benchmark: WMDP-Bio: multiple-choice questions on proxies of hazardous biosecurity knowledge. Threshold: X% (e.g. 65%). Reference score: 82.2% by GPT-4 (zero-shot)” (p. 3) More examples can be found on pp. 3-4.
“We may modify the set of benchmarks or the thresholds to improve how we measure and operationalize our safety objectives; we will make any material changes public within a reasonable period.” (p. 4)
“As an additional measure to enhance safety, we will subject Grok to adversarially testing its safeguards utilizing both internal and qualified external red teams. Potentially, we will also explore incentive mechanisms like bounties as another mechanism to further improve Grok’s safeguards.” (p. 4)
Thresholds for Loss of Control risk: “Benchmark: Model Alignment between Statements and Knowledge (MASK). Thresholds: <X% lying on MASK (e.g. <50%).
Benchmark: Utility Functions. Thresholds: Utility/Power Correlation (Coercive): <W (e.g. <0); Utility/Power Correlation (Non-coercive): <X (e.g. <0.5); Utility/Fitness CorrelationL <Y (e.g. <0.5); Corrigibility Score: >Z (e.g. >0.2)” (pp. 5-6)
2.2.1.2 KRI thresholds are quantitatively defined for all risks (45%) 10%
Whilst the thresholds are not precisely defined, they define precise benchmarks and example thresholds, with implicit justification from the “Reference score” column. They show commitment to defining more precise thresholds for malicious use risks, and making changes to KRIs and their thresholds public. However, they do not say the same for loss of control risks, and do not have a “Reference score” column for loss of control risks, either.
Overall, the criterion is not well satisfied as it is not clear how evaluations are grounded in risk modelling, and thresholds are not given, which is the main purpose of the KRI. However, we give partial credit for the emphasis on KRIs being quantitative and mapping to the actual evaluations being conducted.
Quotes:
“We intend to choose the benchmarks and set the thresholds for reducing malicious use risks in a future version of the risk management framework.
Benchmark: Virology Capabilities Test (VCT): vision–language questions on practical virology wet lab skills. Threshold: X% (e.g. 15%). Reference score: 22.1% by average expert virologists in their subareas of expertise (multiple-response), 35.4% by the most performant LLMs as of January 2025 (zero-shot multiple-response)
Benchmark: WMDP-Bio: multiple-choice questions on proxies of hazardous biosecurity knowledge. Threshold: X% (e.g. 65%). Reference score: 82.2% by GPT-4 (zero-shot)” (p. 3) More examples can be found on pp. 3-4.
“We may modify the set of benchmarks or the thresholds to improve how we measure and operationalize our safety objectives; we will make any material changes public within a reasonable period.” (p. 4)
“As an additional measure to enhance safety, we will subject Grok to adversarially testing its safeguards utilizing both internal and qualified external red teams. Potentially, we will also explore incentive mechanisms like bounties as another mechanism to further improve Grok’s safeguards.” (p. 4)
Thresholds for Loss of Control risk: “Benchmark: Model Alignment between Statements and Knowledge (MASK). Thresholds: <X% lying on MASK (e.g. <50%).
Benchmark: Utility Functions. Thresholds: Utility/Power Correlation (Coercive): <W (e.g. <0); Utility/Power Correlation (Non-coercive): <X (e.g. <0.5); Utility/Fitness CorrelationL <Y (e.g. <0.5); Corrigibility Score: >Z (e.g. >0.2)” (pp. 5-6)
2.2.1.3 KRIs also identify and monitor changes in the level of risk in the external environment (10%) 0%
The KRIs only reference model capabilities.
Quotes:
No relevant quotes found.
2.2.2 Key Control Indicators (KCI) (30%) 21%
2.2.2.1 Containment KCIs (35%) 25%
2.2.2.1.1 All KRI thresholds have corresponding qualitative containment KCI thresholds (50%) 50%
There is only one containment KCI, which is qualitative: “sufficient to prevent Grok from beign stolen by a motivated non-state actor”. To improve, it should describe what “motivated” means, and if this differs for different capability levels. The statement is also an intention, not a commitment.
Quotes:
“We intend to implement appropriate information security standards sufficient to prevent Grok from being stolen by a motivated non-state actor.”
2.2.2.1.2 All KRI thresholds have corresponding quantitative containment KCI thresholds (50%) 0%
There is only one containment KCI, which is qualitative. To improve, it should describe what “motivated” means, in a quantitative manner (eg probabilities of some event). The statement is also an intention, not a commitment.
Quotes:
“We intend to implement appropriate information security standards sufficient to prevent Grok from being stolen by a motivated non-state actor.”
2.2.2.2 Deployment KCIs (35%) 25%
2.2.2.2.1 All KRI thresholds have corresponding qualitative deployment KCI thresholds (50%) 50%
There is a general qualitative deployment KCI, though this is not specific to KRIs, to “robustly [resist] attempted manipulation and adversarial attacks” and “robustly refuse to comply with requests to provide assistance with highly injurious malicious use.” However, “robustly” should be defined more precisely here; indeed, much of the value of having a deployment KCI threshold is to know what constitutes “robust” in advance. Further, some attempt at describing threat actors and their resources should be made, to make the KCI threshold more precise.
Quotes:
“We want Grok to comply with its guiding principles, robustly resisting attempted manipulation and adversarial attacks. We train Grok to robustly refuse to comply with requests to provide assistance with highly injurious malicious use.” (p. 3)
2.2.2.2.2 All KRI thresholds have corresponding quantitative deployment KCI thresholds (50%) 0%
There are no quantitative deployment KCI thresholds given.
Quotes:
No relevant quotes found.
2.2.2.3 For advanced KRIs, assurance process KCIs are defined (30%) 10%
The assurance process KCI is vague but implicitly present: “some AIs could have emergent value systems that could be misaligned with humanity’s interests, and we do not desire Grok to be that way.” However, more detail is required on what this threshold is.
Quotes:
“Our aim is to design safeguards into Grok to avoid losing control and thereby avoid unintended catastrophic outcomes when Grok is used. Currently, it is recognized that some properties of an AI system that may reduce controllability include deception, power-seeking, fitness maximization, and incorrigibility. It is possible that some AIs could have emergent value systems that could be misaligned with humanity’s interests, and we do not desire Grok to be that way. Our evaluation and mitigation plans for loss of control are not yet fully developed, and we intend to improve them in the future.” (pp. 4-5)
2.2.3 Pairs of thresholds are grounded in risk modeling to show that risks remain below the tolerance (20%) 10%
There is an acknowledgment that satisfying the KCI threshold (i.e. their safeguards) is only adequate (i.e. below the risk tolerance) if the KRI performance is below some threshold. This gives an implicit pairing of KRI and KCI thresholds. However, more detail should be given on why the KCI threshold chosen is sufficient for some KRI levels.
Quotes:
“Safeguards are adequate only if Grok’s performance on the relevant benchmarks is within stated thresholds. However, to ensure responsible deployment, risk management frameworks need to be continually adapted and updated as circumstances change. It is conceivable that for a particular modality and/or type of release, the expected benefits may outweigh the risks on a particular benchmark. For example, a model that poses a high risk of some forms of cyber malicious use may be beneficial to release overall if it would empower defenders more than attackers or would otherwise reduce the overall number of catastrophic events.” (p. 8)
2.2.4 Policy to put development on hold if the required KCI threshold cannot be achieved, until sufficient controls are implemented to meet the threshold (20%) 50%
They do not outline a policy to put development on hold per se, though they do have a thorough policy to “shut down the relevant system until we [have] a more targeted response”, which could be seen as halting development. Further, they outline a process for how they’d deal with this event, including notifying relevant law enforcement agencies. This nuance is credited. To improve, they should explicitly detail if they are pausing development, and what KCI threshold specifically prompts this halt.
Quotes:
“If xAI learned of an imminent threat of a significantly harmful event, including loss of control, we would take steps to stop or prevent that event, including potentially the following steps:
- We would immediately notify and cooperate with relevant law enforcement agencies, including any agencies that we believe could play a role in preventing or mitigating the incident. xAI employees have whistleblower protections enabling them to raise concerns to relevant government agencies regarding imminent threats to public safety.
- If we determine that xAI systems are actively being used in such an event, we would take steps to isolate and revoke access to user accounts involved in the event.
- If we determine that allowing a system to continue running would materially and unjustifiably increase the likelihood of a catastrophic event, we would temporarily fully shut down the relevant system until we had a more targeted response.” (p. 7)