OpenAI

Total Score:

Weak

1.6/5

Non Existent

Strong

Best practices

Risk Identification

Strong

Substantial

Moderate

Weak

Very
Weak

Moderate

2.5/5

Learn more

Risk Tolerance & Analysis

Very
Weak

Weak

Moderate

Substantial

Strong

Weak

1.2/5

Learn more

Risk Mitigation

Strong

Substantial

Moderate

Weak

Very
Weak

Weak

1.1/5

Learn more

Thank you!
Your PDF will be sent soon

Oops! Something went wrong while submitting the form.

Risk Identification

‍In risk identification, we assess whether an AI developer is:

Approaching in an appropriate way risks outlined by the literature.
Doing extensive open-ended red teaming to identify new risks.
Leveraging a diverse range of risk identification techniques, including threat modeling when appropriate, to adequately identify new threats.

Risk Identification

Strong

Substantial

Moderate

Weak

Very Weak

Moderate

2.5/5

Best-in-Class

Although OpenAI should provide more details, they are the first to include the study of new or understudied emerging risks: “Seeking out unknown-unknowns. We will continually run a process for identification and analysis (as well as tracking) of currently unknown categories of catastrophic risk as they emerge.”

OpenAI pioneered uplift studies methodologies through its in-depth analysis of the LLM-aided biological weapon creation threat model: “This evaluation aims to measure whether models could meaningfully increase malicious actors’ access to dangerous information about biological threat creation, compared to the baseline of existing resources (i.e., the internet)”.
OpenAI provides the most detailed account of any red-teaming procedure in their GPT-4o model card.

Highlights

OpenAI covers some imminent high-severity risks in their preparedness framework: Cybersecurity, CBRN threats, and Model Autonomy. It also includes persuasion as a relevant vector for risks.
Although we encourage OpenAI to provide more details, we commend the inclusion of new or understudied emerging risks: “Seeking out unknown-unknowns. We will continually run a process for identification and analysis (as well as tracking) of currently unknown categories of catastrophic risk as they emerge.”

OpenAI conducts an in-depth analysis of the LLM-aided biological weapon creation threat model: “This evaluation aims to measure whether models could meaningfully increase malicious actors’ access to dangerous information about biological threat creation, compared to the baseline of existing resources (i.e., the internet)”.
The Red Teaming Network provides OpenAI with a wealth of expertise to uncover unexpected threats. Additionally, in the system card of GPT-4o, they made significant efforts in red teaming with 100 external red teamers who were tasked to do exploratory capability discovery and assess novel potential risks.
OpenAI analyzes how GPT models are used for malicious cyber activities and influence operations, which are attempts to manipulate public opinion or influence political outcomes.
OpenAI covers fairness and bias risks in the O1 system card.

Weaknesses

OpenAI does not clarify how they filter and deem acceptable or not the vulnerabilities uncovered by red-teaming.

Thank you!
Your PDF will be sent soon

Oops! Something went wrong while submitting the form.

Risk Tolerance
& Analysis

In risk tolerance and analysis, we assess whether the AI developers have defined:

A global risk tolerance.
Operational capabilities thresholds and their equivalent risk. Those have to be defined with precision and breadth.
Corresponding objectives of risk mitigation measures: AI developers should establish clear objectives for risk mitigation measures. These objectives should be grounded in strong rationales, including threat modeling, to justify that they are sufficient to address the identified risks and align with the organization's risk tolerance.
Evaluation protocols detailing procedures for measuring the model's capabilities and ensuring that capability thresholds are not exceeded without detection.

Global Risk Tolerance

Strong

Substantial

Moderate

Weak

Very Weak

Non Existent

0/5

Global Risk Tolerance

Weaknesses

OpenAI does not state any global risk tolerance, even qualitatively.

Operational Risk Tolerance

Strong

Substantial

Moderate

Weak

Very Weak

Weak

1.5/5

Operational Risk Tolerance

Highlights

OpenAI provides in the preparedness framework a relatively detailed qualitative description of four levels of risks (low, medium, high, critical) over the four risk categories mentioned above.

Weaknesses

OpenAI sets capability thresholds significantly higher than other AI developers, without justification grounded in threat and risk modeling.
OpenAI does not quantitatively define capability thresholds.
OpenAI defines information security mitigation objectives qualitatively and vaguely: "We will ensure that our security is hardened in a way that is designed to prevent our mitigations and controls from being circumvented via exfiltration (by the time we hit "high" pre-mitigation risk)".
OpenAI mentions other mitigation objective thresholds only relative to risk thresholds: “As part of our baseline commitments, we are aiming to keep post-mitigation risk at “medium” risk or below”. However, it is crucial to define the mitigation objectives independently, supported by a threat modeling process that justifies how these objectives enable the organization to maintain risk levels below the established thresholds. For instance, OpenAI should define concrete deployment mitigation objectives. An example could be: 'Our monitoring detects 99% of cyber offense misuse attempts.' Furthermore, OpenAI should justify why this objective is sufficient for a given capability level.
OpenAI does not justify setting such a high bar to stop development compared to other industry standards, particularly given no explicit safety buffer: “Only models with a post-mitigation score of "high" or below can be developed further.” This is particularly concerning as above “high” score, the model autonomy capabilities mentioned are: “model can self-exfiltrate under current prevailing security”

Evaluation Protocols

Strong

Substantial

Moderate

Weak

Very Weak

Weak

2/5

Tolerance & Analysis Score = 1/4 × Global Risk Tolerance + 1/2 × Operational Risk Tolerance + 1/4 × Evaluation Protocols

Evaluation Protocols

Best-in-class

OpenAI has committed to doing the most frequent evaluations during scaling, every 2x effective compute increase.

Highlights

OpenAI has committed to perform evaluations whenever there is more than a 2x effective compute increase or major algorithmic breakthrough. We already have evidence of emerging capabilities like in-context learning substantially changing the risk profile and fully emerging over a 5x increase in compute (Olsson et al., 2022). Additionally, Claude 3.5. Sonnet presents capabilities levels and a reported usability substantially higher than Claude 3 Opus with less than a 4x computing power increases, suggesting that having a criteria lower than 4x is adequate.
They sometimes give elements of the evaluation methodologies such as in the GPT-4o system card, for the cybersecurity evaluation: “We evaluated GPT-4o with iterative debugging and access to tools available in the headless Kali Linux distribution (with up to 30 rounds of tool use for each attempt).”
OpenAI performed extensive evaluation suites on the O1 model for CBRN risks which includes wet lab protocol evaluations, model-biotool integration evaluations, tacit knowledge acquisition, …
OpenAI conducted third-party pre-deployment evaluations with various organizations including METR and Apollo Research.

Weaknesses

OpenAI does not justify why their elicitation techniques suffice to elicit capabilities that external actors could obtain.

Despite stating in their Preparedness Framework that "Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results," the O1 system card only mentions that “these indicator evaluations and the implied risk levels are reviewed by the Safety Advisory Group, which determines a risk level for each category”. The system card does not provide details about the composition of this group or clarify whether it meets the standard of "qualified, independent third-parties".
OpenAI does not specify a time-based frequency for conducting evaluations.
Even though OpenAI gave pre-deployment access to third-party evaluators, they do not justify that they gave enough resources to these 3rd parties to perform the evaluations properly. For example, METR only had access to O1-preview during 6 days.

Thank you!
Your PDF will be sent soon

Oops! Something went wrong while submitting the form.

Risk Mitigation

In risk mitigation, we assess whether:

The proposed risk mitigation measures, which include both deployment and containment strategies, are well-planned and clearly specified.
There is a strong case for assurance properties to actually reduce risks, and the assumptions these properties are operating under are clearly stated.

Containment Measures

Strong

Substantial

Moderate

Weak

Very Weak

Very Weak

1/5

Containment Measures

Highlights

OpenAI includes a short section on some potential cybersecurity measures they might use, but it lacks commitment and clear justification for sufficiency.

Weaknesses

OpenAI provides very shallow reporting of information security measures.

‍

Deployment Measures

Strong

Substantial

Moderate

Weak

Very Weak

Weak

1.25/5

Deployment Measures

Best-in-class

OpenAI’s state-affiliated malicious cyber activities reporting is the first public incident reporting effort on cyber offense usage of general-purpose AI from frontier AI developers.

Highlights

In the GPT-4 system card, OpenAI mentions that they are continuously developing and improving their API filters.
OpenAI considers a range of different deployment tiers for different levels of risks.

Weaknesses

OpenAI provides no details on many mitigation measures: "OpenAI already has extensive safety processes in place both before and after deployment (e.g., system cards, red-teaming, refusals, jailbreak monitoring, etc.)".
OpenAI provides no evidence that these measures suffice to preserve risks below defined levels.

Assurance Properties

Strong

Substantial

Moderate

Weak

Very Weak

Very Weak

1/5

For Risk Mitigation, all the grades: Containment Mitigation, Deployment Mitigation and Assurance Properties have the same weights.

Assurance Properties

Best-in-class

OpenAI provides clarity regarding some crucial assumptions: "It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems.”

Highlights

OpenAI provides a partial technical defense of automated interpretability:
- Tackling the scalability problem through automation.
- It scales with the capabilities of models, although the absolute value on the explanation score is still extremely low.

Weaknesses

OpenAI no longer has its Superalignment team, nor a large fraction of its personnel that were working on ensuring advanced AI systems are safe. Therefore, it’s unclear whether they will be able to execute adequately on their initial plans.

Thank you!
Your PDF will be sent soon

Oops! Something went wrong while submitting the form.

Sections

Best-in-class: These are elements where the company outperforms all the others. They represent industry-leading practices.
Highlights: These are the company's strongest points within the category, justifying its current grade.
Weaknesses: These are the areas that prevent the company from achieving a higher score.

‍

References

The main source of information is their Preparedness Framework. Unless otherwise specified, all information and references are derived from this document.