open ai logo
Total Score:
Weak  

1.6/5

Non Existent
Strong

Risk Identification

Strong
Substantial
Moderate
Weak
 Very
Weak
Moderate 

2.5/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong

Risk Tolerance & Analysis

 Very
Weak
Weak
Moderate
Substantial
Strong
Weak  

1.2/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong

Risk Mitigation

Strong
Substantial
Moderate
Weak
 Very
Weak
Weak  

1.1/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Identification

‍In risk identification, we assess whether an AI developer is: 

  • Approaching in an appropriate way risks outlined by the literature.
  • Doing extensive open-ended red teaming to identify new risks.
  • Leveraging a diverse range of risk identification techniques, including threat modeling when appropriate, to adequately identify new threats.
Risk Identification
info icon
0 - No information available.

1 - Some risks are in scope of the risk management process. Some efforts of open-ended red teaming are reported, along with very basic threat and risk modeling.

2 - A number of risks are in the scope of the risk management process, but some important ones are missing. Significant efforts of open-ended red teaming are reported, along with significant threat modeling efforts.

3 - Most of the important and commonly discussed risks are in scope of the risk management process. Consequential red teaming is precisely reported, along with significant threat modeling and structured risk identification techniques usage.

4 - Nearly all the risks covered in the relevant literature are in scope of the risk management process. There is a methodology outlining how structured risk identification across the lifecycle is performed, precisely characterized red teaming (including from external parties) is carried out, along with advanced and broad threat and risk modeling.

5 - There is a comprehensive, continued, and detailed effort to ensure all risks are found and addressed. The red teaming and threat and risk modeling effort is extremely extensive, quantified, jointly integrated with structured risk identification efforts, and conducted with third parties.
Strong
Substantial
Moderate
Weak
Very Weak
Moderate  

2.5/5

Best-in-Class

  • Although OpenAI should provide more details, they are the first to include the study of new or understudied emerging risks: “Seeking out unknown-unknowns. We will continually run a process for identification and analysis (as well as tracking) of currently unknown categories of catastrophic risk as they emerge.”
  • OpenAI pioneered uplift studies methodologies through its in-depth analysis of the LLM-aided biological weapon creation threat model: “This evaluation aims to measure whether models could meaningfully increase malicious actors’ access to dangerous information about biological threat creation, compared to the baseline of existing resources (i.e., the internet)”.
  • OpenAI provides the most detailed account of any red-teaming procedure in their GPT-4o model card.

Highlights

  • OpenAI covers some imminent high-severity risks in their preparedness framework: Cybersecurity, CBRN threats, and Model Autonomy. It also includes persuasion as a relevant vector for risks.
  • Although we encourage OpenAI to provide more details, we commend the inclusion of new or understudied emerging risks: “Seeking out unknown-unknowns. We will continually run a process for identification and analysis (as well as tracking) of currently unknown categories of catastrophic risk as they emerge.”
  • OpenAI conducts an in-depth analysis of the LLM-aided biological weapon creation threat model: “This evaluation aims to measure whether models could meaningfully increase malicious actors’ access to dangerous information about biological threat creation, compared to the baseline of existing resources (i.e., the internet)”.
  • The Red Teaming Network provides OpenAI with a wealth of expertise to uncover unexpected threats. Additionally, in the system card of GPT-4o, they made significant efforts in red teaming with 100 external red teamers who were tasked to do exploratory capability discovery and assess novel potential risks.
  • OpenAI analyzes how GPT models are used for malicious cyber activities and influence operations, which are attempts to manipulate public opinion or influence political outcomes.
  • OpenAI covers fairness and bias risks in the O1 system card.

Weaknesses

  • OpenAI does not clarify how they filter and deem acceptable or not the vulnerabilities uncovered by red-teaming.
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Tolerance
& Analysis

In risk tolerance and analysis, we assess whether the AI developers have defined:

  • A global risk tolerance.
  • Operational capabilities thresholds and their equivalent risk. Those have to be defined with precision and breadth.
  • Corresponding objectives of risk mitigation measures: AI developers should establish clear objectives for risk mitigation measures. These objectives should be grounded in strong rationales, including threat modeling, to justify that they are sufficient to address the identified risks and align with the organization's risk tolerance.
  • Evaluation protocols detailing procedures for measuring the model's capabilities and ensuring that capability thresholds are not exceeded without detection.
Global Risk Tolerance
info icon
0 - No information available.

1 - Global risk tolerance is qualitatively defined.
E.g., “Our system should not increase the likelihood of extinction risks”.

2 - Global risk tolerance is quantitatively defined for casualties.

3 - Global risk tolerance is quantitatively defined for casualties and economic damages, with adequate ranges and rationale for the decision.

4 - Global risk tolerance is quantitatively defined for casualties, economic damages, and other high-severity risks (e.g., large-scale manipulation of public opinion), with robust methodology and decision-making processes to decide the tolerance (e.g., public consultation).

5 - Global risk tolerance is clearly and quantitatively defined for all significant threats and risks known in the literature. Any significant deviations in risk tolerance from industry norms are clearly justified and explained (e.g., through a comprehensive benefit/cost analysis).
Strong
Substantial
Moderate
Weak
Very Weak
Non Existent 

0/5

Global Risk Tolerance

Weaknesses

  • OpenAI does not state any global risk tolerance, even qualitatively.
Operational Risk Tolerance
info icon
0 - No information available.

1 - Some important capability thresholds are qualitatively defined and their corresponding mitigation objectives are qualitatively defined as well.

2 - Some important capability thresholds are precisely defined, and their corresponding mitigations are precisely defined as well.

3 - Almost all important hazardous capability thresholds and their corresponding mitigation objectives are precisely defined and grounded in extensive threat and risk modeling.

4 - All hazardous capabilities are precisely defined. The corresponding mitigation objectives are quantitatively defined and grounded in extensive threat and risk modeling. Assurance property targets are operationalized.
 
5 - All hazardous capabilities have a precisely defined threshold. Corresponding mitigation objectives are quantified and grounded in comprehensive threat and risk modeling with a clear and in-depth methodology. Assurance property targets are operationalized and justified.
Strong
Substantial
Moderate
Weak
Very Weak
Weak  

1.5/5

Operational Risk Tolerance

Highlights

  • OpenAI provides in the preparedness framework a relatively detailed qualitative description of four levels of risks (low, medium, high, critical) over the four risk categories mentioned above.

Weaknesses

  • OpenAI sets capability thresholds significantly higher than other AI developers, without justification grounded in threat and risk modeling.
  • OpenAI does not quantitatively define capability thresholds.
  • OpenAI defines information security mitigation objectives qualitatively and vaguely: "We will ensure that our security is hardened in a way that is designed to prevent our mitigations and controls from being circumvented via exfiltration (by the time we hit "high" pre-mitigation risk)".
  • OpenAI mentions other mitigation objective thresholds only relative to risk thresholds: “As part of our baseline commitments, we are aiming to keep post-mitigation risk at “medium” risk or below”. However, it is crucial to define the mitigation objectives independently, supported by a threat modeling process that justifies how these objectives enable the organization to maintain risk levels below the established thresholds. For instance, OpenAI should define concrete deployment mitigation objectives. An example could be: 'Our monitoring detects 99% of cyber offense misuse attempts.' Furthermore, OpenAI should justify why this objective is sufficient for a given capability level.
  • OpenAI does not justify setting such a high bar to stop development compared to other industry standards, particularly given no explicit safety buffer: “Only models with a post-mitigation score of "high" or below can be developed further.” This is particularly concerning as above “high” score, the model autonomy capabilities mentioned are: “model can self-exfiltrate under current prevailing security”
Evaluation Protocols
info icon
0 - No information available.

1 - Elements of the evaluation methodologies are described. The testing frequency is defined in terms of multiples of compute.

2 - The testing frequency is defined in terms of multiples of compute and there is a commitment to following it. The evaluation protocol is well-defined and includes relevant elicitation techniques. Independent third parties conduct pre-deployment evaluations with API access.

3 - The testing frequency is defined in terms of both multiples of compute and time and there is a commitment to following it. The evaluation protocol is well-defined and incorporates state-of-the-art elicitation techniques. A justification is provided demonstrating that these techniques are comprehensive enough to elicit capabilities that could be found and exercised by external actors. AI developers implement and justify measures (such as appropriate safety buffers), to ensure protocols can effectively detect capability threshold crossings.  Independent third parties conduct pre-deployment evaluations with fine-tuning access.

4 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and provides a rationale for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes state-of-the-art elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing.

5 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and a rationale is provided for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes relevant elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing and third parties are granted permission and resources to independently run their own evaluations, to verify the accuracy of the evaluation results.
Strong
Substantial
Moderate
Weak
Very Weak
Weak  

2/5

Tolerance & Analysis Score = 1/4 × Global Risk Tolerance + 1/2 × Operational Risk Tolerance + 1/4  × Evaluation Protocols
Evaluation Protocols

Best-in-class

  • OpenAI has committed to doing the most frequent evaluations during scaling, every 2x effective compute increase.

Highlights

  • OpenAI has committed to perform evaluations whenever there is more than a 2x effective compute increase or major algorithmic breakthrough. We already have evidence of emerging capabilities like in-context learning substantially changing the risk profile and fully emerging over a 5x increase in compute (Olsson et al., 2022). Additionally, Claude 3.5. Sonnet presents capabilities levels and a reported usability substantially higher than Claude 3 Opus with less than a 4x computing power increases, suggesting that having a criteria lower than 4x is adequate.
  • They sometimes give elements of the evaluation methodologies such as in the GPT-4o system card, for the cybersecurity evaluation: “We evaluated GPT-4o with iterative debugging and access to tools available in the headless Kali Linux distribution (with up to 30 rounds of tool use for each attempt).”
  • OpenAI performed extensive evaluation suites on the O1 model for CBRN risks which includes wet lab protocol evaluations, model-biotool integration evaluations, tacit knowledge acquisition, …
  • OpenAI conducted third-party pre-deployment evaluations with various organizations including  METR and Apollo Research.

Weaknesses

  • OpenAI does not justify why their elicitation techniques suffice to elicit capabilities that external actors could obtain.
  • Despite stating in their Preparedness Framework that "Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results," the O1 system card only mentions that “these indicator evaluations and the implied risk levels are reviewed by the Safety Advisory Group, which determines a risk level for each category”. The system card does not provide details about the composition of this group or clarify whether it meets the standard of "qualified, independent third-parties".
  • OpenAI does not specify a time-based frequency for conducting evaluations.
  • Even though OpenAI gave pre-deployment access to third-party evaluators, they do not justify that they gave enough resources to these 3rd parties to perform the evaluations properly. For example, METR only had access to O1-preview during 6 days.
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Mitigation

In risk mitigation, we assess whether:

  • The proposed risk mitigation measures, which include both deployment and containment strategies, are well-planned and clearly specified.
  • There is a strong case for assurance properties to actually reduce risks, and the assumptions these properties are operating under are clearly stated.
Containment Measures
info icon
0 - No information available.

1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
 
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.

3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.

4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.

5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
Strong
Substantial
Moderate
Weak
Very Weak
Very Weak  

1/5

Containment Measures

Highlights

  • OpenAI includes a short section on some potential cybersecurity measures they might use, but it lacks commitment and clear justification for sufficiency.

Weaknesses

  • OpenAI provides very shallow reporting of information security measures.

Deployment Measures
info icon
0 - No information available.

1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
 
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.

3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.

4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.

5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
Strong
Substantial
Moderate
Weak
Very Weak
Weak  

1.25/5

Deployment Measures

Best-in-class

Highlights

  • In the GPT-4 system card, OpenAI mentions that they are continuously developing and improving their API filters.
  • OpenAI considers a range of different deployment tiers for different levels of risks.

Weaknesses

  • OpenAI provides no details on many mitigation measures: "OpenAI already has extensive safety processes in place both before and after deployment (e.g., system cards, red-teaming, refusals, jailbreak monitoring, etc.)".
  • OpenAI provides no evidence that these measures suffice to preserve risks below defined levels.
Assurance Properties
info icon
0 - No information available.

1 - Limited pursuit of some assurance properties, sparse evidence of how promising they are to reduce risks.

2 - Pursuit of some assurance properties along with research results indicating that they may be promising. Some of the key assumptions the assurance properties are operating under are stated.

3 - Pursuit of assurance properties, some evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. The assumptions the assurance properties are operating under are stated but some important ones are missing.

4 - Pursuit of assurance properties, solid evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. All the assumptions the assurance properties are operating under are stated.

5 - Broad consensus that one assurance property is likely to work, is being strongly pursued, and there is a strong case for it to be sufficient. All the assumptions the assurance properties are operating under are clearly stated and justified.
Strong
Substantial
Moderate
Weak
Very Weak
Very Weak  

1/5

For Risk Mitigation, all the grades: Containment Mitigation, Deployment Mitigation and Assurance Properties have the same weights.
Assurance Properties

Best-in-class

  • OpenAI provides clarity regarding some crucial assumptions: "It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems.”

Highlights

Weaknesses

  • OpenAI no longer has its Superalignment team, nor a large fraction of its personnel that were working on ensuring advanced AI systems are safe. Therefore, it’s unclear whether they will be able to execute adequately on their initial plans.
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.
Sections

Best-in-class: These are elements where the company outperforms all the others. They represent industry-leading practices.
Highlights: These are the company's strongest points within the category, justifying its current grade.
Weaknesses: These are the areas that prevent the company from achieving a higher score.

References

The main source of information is their Preparedness Framework. Unless otherwise specified, all information and references are derived from this document.