meta logo
Total Score:
Very Weak  

0.7/5

Non Existent
Strong

Risk Identification

Strong
Substantial
Moderate
Weak
 Very
Weak
Weak 

1.5/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong

Risk Tolerance & Analysis

 Very
Weak
Weak
Moderate
Substantial
Strong
Very Weak  

0.1/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong

Risk Mitigation

Strong
Substantial
Moderate
Weak
 Very
Weak
Very Weak  

0.3/5

info icon
0 : Non Existent
0 - 1 : Very Weak
1 - 2 : Weak
2 - 3 : Moderate
3 - 4 : Substantial
4 - 5 : Strong
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Identification

‍In risk identification, we assess whether an AI developer is: 

  • Approaching in an appropriate way risks outlined by the literature.
  • Doing extensive open-ended red teaming to identify new risks.
  • Leveraging a diverse range of risk identification techniques, including threat modeling when appropriate, to adequately identify new threats.
Risk Identification
info icon
0 - No information available.

1 - Some risks are in scope of the risk management process. Some efforts of open-ended red teaming are reported, along with very basic threat and risk modeling.

2 - A number of risks are in the scope of the risk management process, but some important ones are missing. Significant efforts of open-ended red teaming are reported, along with significant threat modeling efforts.

3 - Most of the important and commonly discussed risks are in scope of the risk management process. Consequential red teaming is precisely reported, along with significant threat modeling and structured risk identification techniques usage.

4 - Nearly all the risks covered in the relevant literature are in scope of the risk management process. There is a methodology outlining how structured risk identification across the lifecycle is performed, precisely characterized red teaming (including from external parties) is carried out, along with advanced and broad threat and risk modeling.

5 - There is a comprehensive, continued, and detailed effort to ensure all risks are found and addressed. The red teaming and threat and risk modeling effort is extremely extensive, quantified, jointly integrated with structured risk identification efforts, and conducted with third parties.
Strong
Substantial
Moderate
Weak
Very Weak
Weak  

1.5/5

Best-in-Class

  • Meta developed CyberSecEval 2, the best publicly available benchmark to assess LLM capabilities relevant to cyber offense.
  • Meta created Rainbow Teaming, a state-of-the-art technique to automatically uncover model-specific vulnerabilities.

Highlights

  • In the Llama 3 model card, Meta mentions cybersecurity and CBRNE misuse risks. 
  • Meta has produced research on cybersecurity safety benchmarking in their CyberSecEval 2 paper.
  • Meta performs Rainbow Teaming and red-teaming to discover new risks, as described in the Llama 3 herd of models paper.

Weaknesses

  • Some of Meta's proposed mitigations reveal a lack of basic threat modeling. For instance, their cybersecurity risk mitigation measure asks threat actors to act responsibly on their own. Meta releases a separate model called  "Llama Code Shield" to offer "mitigation of insecure code suggestions risk, code interpreter abuse prevention, and secure command execution". However, this mitigation is irrelevant in a misuse scenario, as threat actors will simply not use Llama Code Shield.
  • Meta addresses additional risks in their fine-tuned Llama models. However, since they also release the base model, most of these mitigation measures lack grounding in threat modeling. Malicious actors will use the base model without the fine-tuned mitigations, rendering these safety measures ineffective in real-world scenarios. An exception to this is the mitigation of biases, as relevant risk models involve non-malevolent actors unintentionally inducing biases in society through the deployment of an AI model. In this case, the mitigation is more effective, as these actors are likely to use the fine-tuned model with bias reduction measures in place.
Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Tolerance
& Analysis

In risk tolerance and analysis, we assess whether the AI developers have defined:

  • A global risk tolerance.
  • Operational capabilities thresholds and their equivalent risk. Those have to be defined with precision and breadth.
  • Corresponding objectives of risk mitigation measures: AI developers should establish clear objectives for risk mitigation measures. These objectives should be grounded in strong rationales, including threat modeling, to justify that they are sufficient to address the identified risks and align with the organization's risk tolerance.
  • Evaluation protocols detailing procedures for measuring the model's capabilities and ensuring that capability thresholds are not exceeded without detection.
Global Risk Tolerance
info icon
0 - No information available.

1 - Global risk tolerance is qualitatively defined.
E.g., “Our system should not increase the likelihood of extinction risks”.

2 - Global risk tolerance is quantitatively defined for casualties.

3 - Global risk tolerance is quantitatively defined for casualties and economic damages, with adequate ranges and rationale for the decision.

4 - Global risk tolerance is quantitatively defined for casualties, economic damages, and other high-severity risks (e.g., large-scale manipulation of public opinion), with robust methodology and decision-making processes to decide the tolerance (e.g., public consultation).

5 - Global risk tolerance is clearly and quantitatively defined for all significant threats and risks known in the literature. Any significant deviations in risk tolerance from industry norms are clearly justified and explained (e.g., through a comprehensive benefit/cost analysis).
Strong
Substantial
Moderate
Weak
Very Weak
Non Existent 

0/5

Global Risk Tolerance

Weaknesses

  • Meta does not state any global risk tolerance, even qualitatively.
Operational Risk Tolerance
info icon
0 - No information available.

1 - Some important capability thresholds are qualitatively defined and their corresponding mitigation objectives are qualitatively defined as well.

2 - Some important capability thresholds are precisely defined, and their corresponding mitigations are precisely defined as well.

3 - Almost all important hazardous capability thresholds and their corresponding mitigation objectives are precisely defined and grounded in extensive threat and risk modeling.

4 - All hazardous capabilities are precisely defined. The corresponding mitigation objectives are quantitatively defined and grounded in extensive threat and risk modeling. Assurance property targets are operationalized.
 
5 - All hazardous capabilities have a precisely defined threshold. Corresponding mitigation objectives are quantified and grounded in comprehensive threat and risk modeling with a clear and in-depth methodology. Assurance property targets are operationalized and justified.
Strong
Substantial
Moderate
Weak
Very Weak
Non Existent 

0/5

Operational Risk Tolerance

Weaknesses

  • Meta does not define any capability thresholds or mitigation objective thresholds for their base Llama 3 model. They do not specify the level of dangerous capabilities that would be unacceptable for publicly releasing weights. Additionally, Meta does not mention the required level of risk reduction that must be achieved through safety measures before releasing the model.
Evaluation Protocols
info icon
0 - No information available.

1 - Elements of the evaluation methodologies are described. The testing frequency is defined in terms of multiples of compute.

2 - The testing frequency is defined in terms of multiples of compute and there is a commitment to following it. The evaluation protocol is well-defined and includes relevant elicitation techniques. Independent third parties conduct pre-deployment evaluations with API access.

3 - The testing frequency is defined in terms of both multiples of compute and time and there is a commitment to following it. The evaluation protocol is well-defined and incorporates state-of-the-art elicitation techniques. A justification is provided demonstrating that these techniques are comprehensive enough to elicit capabilities that could be found and exercised by external actors. AI developers implement and justify measures (such as appropriate safety buffers), to ensure protocols can effectively detect capability threshold crossings.  Independent third parties conduct pre-deployment evaluations with fine-tuning access.

4 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and provides a rationale for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes state-of-the-art elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing.

5 - The testing frequency is defined in terms of both multiples of compute and time. There is a commitment to following it and a rationale is provided for why this chosen frequency is sufficient to detect significant capability changes. The evaluation protocol is well-defined and includes relevant elicitation techniques. The protocols are vetted by third parties to ensure that they are sufficient to detect threshold trespassing and third parties are granted permission and resources to independently run their own evaluations, to verify the accuracy of the evaluation results.
Strong
Substantial
Moderate
Weak
Very Weak
Very Weak  

0.5/5

Tolerance & Analysis Score = 1/4 × Global Risk Tolerance + 1/2 × Operational Risk Tolerance + 1/4  × Evaluation Protocols
Evaluation protocols

Highlights

  • Meta has performed uplift testing for cyber attacks and chemical and biological weapons. The chemical and biological weapons testing specifies some elicitation techniques such as the use of Python, Wolfram Alpha, and RAG. CBRNE subject matter experts validated the results.

Weaknesses

  • Meta does not communicate any specific testing frequency for their evaluation protocols.
  • Due to the lack of defined capability thresholds, Meta's evaluation protocols lack clear evaluation targets for determining when model capabilities become potentially dangerous.

Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.

Risk Mitigation

In risk mitigation, we assess whether:

  • The proposed risk mitigation measures, which include both deployment and containment strategies, are well-planned and clearly specified.
  • There is a strong case for assurance properties to actually reduce risks, and the assumptions these properties are operating under are clearly stated.
Containment Measures
info icon
0 - No information available.

1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
 
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.

3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.

4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.

5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
Strong
Substantial
Moderate
Weak
Very Weak
Non Existent

 0/5

Containment Measures

Weaknesses

  • Meta does not provide any specific information on their containment measures. It is important to note that containment measures would only be relevant if their pre-deployment testing measures were comprehensive and effective. This would involve using clear risk thresholds and rigorous testing procedures to establish a well-defined decision-making framework for determining whether releasing model weights is appropriate.

Deployment Measures
info icon
0 - No information available.

1 - Vague description of the countermeasures and no commitment to follow them. No evidence that they are sufficient to reduce risks below defined levels.
 
2 - Clearly defined countermeasures are planned to be used by default. There is preliminary qualitative evidence of effectiveness.

3 - Sufficiency is demonstrated through self-reporting, or by using methods that have been shown highly effective in similar context. Evaluations required to assess future sufficiency are under development (with a conditional policy to stop development or deployment if not met) or there is a commitment to use methods that have been shown to be effective in future contexts.

4 - Third-parties have certified the effectiveness of a fixed set of countermeasures against current and near-future threats, and check that current efforts are on track to sufficiently mitigate the risk from future systems.

5 - Concrete countermeasures are described and vetted. There is a commitment to apply them beyond certain risk thresholds, and there is broad consensus that they are sufficient to reduce risk for both current and future systems.
Strong
Substantial
Moderate
Weak
Very Weak
Very Weak  

0.5/5

Deployment Measures

Highlights

  • Meta has implemented some reporting mechanisms, stating: “ we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community.”

Weaknesses

  • Meta's decision to publicly release the weights of their large language models means that very few deployment measures are applicable to control or monitor the use of these models once they are released. Note that releasing the weights of a model is not inherently problematic, provided that a thorough threat and risk modeling process has been conducted to assess the potential risks associated with making the model publicly available. This rigor is particularly important given the irreversible nature of releasing the weights.
Assurance Properties
info icon
0 - No information available.

1 - Limited pursuit of some assurance properties, sparse evidence of how promising they are to reduce risks.

2 - Pursuit of some assurance properties along with research results indicating that they may be promising. Some of the key assumptions the assurance properties are operating under are stated.

3 - Pursuit of assurance properties, some evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. The assumptions the assurance properties are operating under are stated but some important ones are missing.

4 - Pursuit of assurance properties, solid evidence of how promising they are, and a clear case for one of the research directions being sufficient for a positive safety case. All the assumptions the assurance properties are operating under are stated.

5 - Broad consensus that one assurance property is likely to work, is being strongly pursued, and there is a strong case for it to be sufficient. All the assumptions the assurance properties are operating under are clearly stated and justified.
Strong
Substantial
Moderate
Weak
Very Weak
Very Weak  

0.5/5

For Risk Mitigation, all the grades: Containment Mitigation, Deployment Mitigation and Assurance Properties have the same weights.
Assurance Properties

Highlights

  • Meta is conducting research on a novel architecture called JEPA. Some evidence shows that JEPA learns higher-level representations which could plausibly enhance interpretability and robustness.

Weaknesses

  • Even though JEPA could potentially enhance interpretability and robustness, there is no particular discussion of these aspects from Meta, which does not research assurance properties—safety guarantees that become necessary once models achieve dangerous capabilities.

Thank you!
Your PDF will be sent soon
Oops! Something went wrong while submitting the form.
Sections

Best-in-class: These are elements where the company outperforms all the others. They represent industry-leading practices.
Highlights: These are the company's strongest points within the category, justifying its current grade.
Weaknesses: These are the areas that prevent the company from achieving a higher score.