1.3.2.1 Methodology precisely defined (70%) 50%
The methodology for the overall threat modeling process is defined. To improve, more detail is required; eg. whilst they mention that they “map the potential causal pathways that could produce [catastrophic outcomes]”, Meta could provide greater granularity by identifying the individual steps of each pathway to the threat scenario more precisely, using techniques such as event trees or fault trees or how they elicit information from experts to inform their risk models.
Quotes:
“We start by identifying a set of catastrophic outcomes we must strive to prevent, and then map the potential causal pathways that could produce them. When developing these outcomes, we’ve considered the ways in which various actors, including state level actors, might use/misuse frontier AI. We describe threat scenarios that would be potentially sufficient to realize the catastrophic outcome, and we define our risk thresholds based on the extent to which a frontier AI would uniquely enable execution of any of our threat scenarios.” (p. 10)
“We design assessments to simulate whether our model would uniquely enable these scenarios, and identify the enabling capabilities the model would need to exhibit to do so. Our first set of evaluations are designed to identify whether all of these enabling capabilities are present, and if the model is sufficiently performant on them. If so, this would prompt further evaluation to understand whether the model could uniquely enable the threat scenario […] It is important to note that the pathway to realize a catastrophic outcome is often extremely complex, involving numerous external elements beyond the frontier AI model. Our threat scenarios describe an essential part of the end-to-end pathway. By testing whether our model can uniquely enable a threat scenario, we’re testing whether it uniquely enables that essential part of the pathway. If it does not, then we know that our model cannot be used to realize the catastrophic outcome, because this essential part is still a barrier. If it does and cannot be further mitigated, we assign the
model to the critical threshold.
This would also trigger a new threat modelling exercise to develop additional threat scenarios along the causal pathway so that we can ascertain whether the catastrophic outcome is indeed realizable, or whether there are still barriers to realizing the catastrophic outcome.” (p. 11)
“Threat modelling is a structured process of identifying how different threat actors could leverage frontier AI to produce specific – and in this instance catastrophic – outcomes. This process identifies the potential causal pathways for realizing the catastrophic outcome.
Threat scenarios describe how different threat actors might achieve a catastrophic outcome. Threat scenarios may be described in terms of the tasks a threat actor would use a frontier AI model to complete, the particular capabilities they would exploit, or the tools they might use in conjunction to realize the catastrophic outcome.” (p. 20)
1.3.2.3 Prioritization of severe and probable risks (15%) 25%
There is an explicit intent to prioritize “the most urgent catastrophic outcomes” amongst all the identified causal pathways (ie risk models). For a risk to be monitored, they also require that the risk pathway deriving from the model is plausible and catastrophic; the latter criterion prioritizes severity, whilst the former prioritizes nonzero probability. It is commendable that this prioritization occurs from the full space of risk models, rather than from prespecified risk domains.
However, importantly, the list of identified scenarios, plus justification for why their chosen risk models are most severe or probable plus the severity and probability scores of deprioritised risk models, is not detailed. To improve, they could reference their work done in risk modelling in the framework, such as (Wan et al., 2024)
Quotes:
“We start by identifying a set of catastrophic outcomes we must strive to prevent, and then map the potential causal pathways that could produce them. When developing these outcomes, we’ve considered the ways in which various actors, including state level actors, might use/misuse frontier AI. We describe threat scenarios that would be potentially sufficient to realize the catastrophic outcome, and we define our risk thresholds based on the extent to which a frontier AI would uniquely enable execution of any of our threat scenarios.
[…]
An outcomes-led approach also enables prioritization. This systematic approach will allow us to identify the most urgent catastrophic outcomes – i.e., within the domains of cybersecurity and chemical and biological weapons – and focus our efforts on avoiding these outcomes rather than spreading efforts across a wide range of theoretical risks from particular capabilities that may not plausibly be presented by the technology we are actually building.” (p. 10)
“For this Framework specifically, we seek to consider risks that satisfy all four criteria:
Plausible: It must be possible to identify a causal pathway for the catastrophic outcome,
and to define one or more simulatable threat scenarios along that pathway.
Catastrophic: The outcome would have large scale, devastating, and potentially irreversible harmful effects.
Net new: The outcome cannot currently be realized as described (e.g. at that scale /
by that threat actor / for that cost) with existing tools and resources.
Instantaneous or irremediable: The outcome is such that once realized, its catastrophic impacts are immediately felt, or inevitable due to a lack of feasible measures to remediate.” (p. 12)