1.3.1 The company uses risk models for all the risk domains identified
and the risk models are published (with potentially dangerous
information redacted) (40%) 25%
In the “Risk Identification and Analysis” section, the framework sets out first their risk analysis methodology, then the hazards (i.e., risk domains) they focus on, and states that for each harm in a given risk domain, the pre-mitigation risk level will be determined by estimating the likelihood, severity and observability of the harm.
They define their risk assessment methodology with their scoring table (Table 1). Because this involves scoring things like duration, detectability, frequency etc., this likely involves modeling how threats may be realized. However, it is not clear how they arrive at scores for each component of this risk analysis (e.g. duration, detectability, frequency etc.) To improve, scores should be informed by risk modelling that includes causal pathways to harm with discrete, measurable steps, and the methodology for this risk modelling should be precisely defined.
Whilst this is notable as it means there is a structured methodology for arriving at risk determinations, this is too high level to count as risk modelling. To improve, the company should break down step by step causal pathways of harm with distinct threat scenarios in order to inform the likelihood/severity/observability scores. In addition, these risk models and threat scenarios should then be published.
However, they do give differential ‘model risk’ scores, depending on the model’s use case, expected level of capability, and autonomy. This pre-emptive assessment of potential manifestations of harm shows some awareness of risk modeling, which is rewarded here.
Quotes:
“Each risk criteria have discrete thresholds between 1 and 5 that are used to determine a model’s risk category. The [Preliminary Risk Assessment] will assign a model risk (MR) score between 1 and 5 based on the highest MR score within this criteria. Below is a nonexhaustive list of attributes used to define the MR score.The MR score is correlated to the maximum permissible harm relative to our trustworthy AI principles. High risk models require more intensive scrutiny, increased oversight and face stricter development and operational constraints.” (p. 2)
“NVIDIA’s Trustworthy AI Principles are derived from human rights and legal principles. These principles are used as a foundation for defining a broad range of potential risks that a product may be exposed to. Based on the description of a product’s architecture and development workflows it should be possible to identify possible hazards, estimate the level of risk for each hazard and categorize the cumulative risk relative to our trustworthy AI principles.
We defined risk as the potential for an event to lead to an undesired outcome, measured in terms of its likelihood (probability), its impact (severity) and its ability to be controlled or detected (controllability). The risk associated with each hazard is scored between 1 and 64, with the higher value indicating a higher risk.
Risk = likelihood x severity x observability
Risk = frequency x (duration + speed of onset) x (detectability + predictability)
A hazard that has a non-zero but very low probability of occurring, that is transient in nature, occurs gradually, easy to detect and localized has the lowest risk score. In contrast, a hazard that has a high probability of occurring, is permanent in nature, occurs instantaneously and randomly due to latent faults has the highest risk score.” (p. 6)
1.3.2 Risk modeling methodology (40%) 9%
1.3.2.1 Methodology precisely defined (70%) 0%
They define their risk assessment methodology with their scoring table (Table 1). However, it is not clear how they arrive at scores for each component of this risk analysis (e.g. duration, detectability, frequency etc.) Indeed, no risk modeling methodology is defined for actually mapping out how harms may be realized.
Quotes:
“We defined risk as the potential for an event to lead to an undesired outcome, measured in terms of its likelihood (probability), its impact (severity) and its ability to be controlled or detected (controllability). The risk associated with each hazard is scored between 1 and 64, with the higher value indicating a higher risk.
Risk = likelihood x severity x observability
Risk = frequency x (duration + speed of onset) x (detectability + predictability)
A hazard that has a non-zero but very low probability of occurring, that is transient in nature, occurs gradually, easy to detect and localized has the lowest risk score. In contrast, a hazard that has a high probability of occurring, is permanent in nature, occurs instantaneously and randomly due to latent faults has the highest risk score.” (p. 6)
See Table 1 in the Framework, on page 7.
1.3.2.2 Mechanism to incorporate red teaming findings (15%) 10%
Whilst there is mention of incorporating hazards identified during red-teaming, showing awareness that red-teaming may uncover new risks to consider and thus analyse, this only includes risks that were prespecified but previously absent. To improve, open ended red-teaming should be conducted, and when novel risks or risk pathways are discovered, this should trigger new risk modelling of other affected risk domains.
Quotes:
“For frontier models we need to consider speculative risks that may or may not be present in the model. To help detect specific adversarial capabilities, models will be stress-tested against extreme but plausible scenarios that may lead to systemic risks. This approach ensures that both known and emergent hazards are taken into account.” (p. 7)
1.3.2.3 Prioritization of severe and probable risks (15%) 50%
There is a clear prioritization of risk domains (‘hazards’, in NVIDIA’s terms) by severity, likelihood, as well as controllability. These are taken across the full space of risk models.
To improve, probability and severity scores (qualitative or quantiative) should be published for different risk models, with justification given for these scores.
It is commendable that they further broke down severity, and added observability, showing nuance.
Quotes:
“We defined risk as the potential for an event to lead to an undesired outcome, measured in terms of its likelihood (probability), its impact (severity) and its ability to be controlled or detected (controllability). The risk associated with each hazard is scored between 1 and 64, with the higher value indicating a higher risk.
Risk = likelihood x severity x observability
Risk = frequency x (duration + speed of onset) x (detectability + predictability)
A hazard that has a non-zero but very low probability of occurring, that is transient in nature, occurs gradually, easy to detect and localized has the lowest risk score. In contrast, a hazard that has a high probability of occurring, is permanent in nature, occurs instantaneously and randomly due to latent faults has the highest risk score.” (p. 6)
“Based on the description of a product’s architecture and development workflows it should be possible to identify possible hazards, estimate the level of risk for each hazard and categorize the cumulative risk relative to our trustworthy AI principles.” (p. 6)