Best in class
- Cohere has an unusually clear risk prioritization process, where they not only assess risks on their likelihood and severity, but also describe specifically how they determine which risks to focus on.
They don’t list specific risk domains that their risk management process focuses on ex ante. Rather, risk domains are identified for particular customers and use cases. However, their risk domains focus on malicious use and bias, with examples in cybersecurity, child sexual exploitation, and discrimination. More detail on why they chose to focus on these issues and how they came to identify these risks is required, especially as they differ from the industry standard.
They explicitly do not consider CBRN or loss of control risks, and explicitly do not consider “potential future risks associated with LLMs”. This is a serious limitation that requires strong justification; given the harms from loss of control or CBRN could be substantial, dismissing monitoring these risks at all requires a high amount of confidence. However, 1.1.2 scores less than 50%. Further, it shows they have not engaged with literature – for instance, there is emphasis on these risks in documents such as the International Science of AI Safety Report and current drafts of the EU AI Act Codes of Practice.
“One approach to risk assurance in the AI industry is focused on risks described as catastrophic or severe, such as capabilities related to radiological and nuclear weapons, autonomy, and self-replication. In this context, thresholds relating to these potential catastrophic risks are developed, and the approach described in safety frameworks is designed to assess risks that are speculated to arise when models attain specific capabilities, such as the ability to perform autonomous research or facilitate biorisk. The models are then deemed to present “unacceptable” levels of risk when certain capability levels are attained. While it is important to consider long-term, potential future risks associated with LLMs and the systems in which they are deployed, studies regarding the likelihood of these capabilities arising and leading to real-world harm are limited in their methodological maturity and transparency, often lacking clear theoretical threat models or developed empirical methods due to their nascency. For example, existing research into how LLMs may increase biorisks fails to account for entire risk chains beyond access to information, and does not systematically compare LLMs to other information access tools, such as the internet. More work is needed to develop methods for assessing these types of threats more reliably.” (pp. 14-15)
“Cohere’s approach to risk assurance, and to determining when models and systems are suffi ciently safe and secure to be made available to our customers, is focused on risks that are known, measurable, or observable today” (p. 15)
“Limitations in training data, such as unrepresentative data distributions, historically outdated representations, or an imbalance between harmful patterns and attributes on the one hand and positive patterns and attributes on the other, also impact model capabilities. If these limitations are not mitigated, models can output harmful content, such as hateful or violent content, or child sexual exploitation and abuse material (CSAM).
We therefore focus our secure AI work on risks that have a high likelihood of occurring based on the types of tasks LLMs are highly performant in, as well as the limitations inherent in how these models function. This is what we refer to as “model capabilities.”
We place potential risks arising from LLM capabilities into one of two categories:
“Cohere consistently reviews state-of-the-art research and industry practice regarding the risks associated with AI, and uses this to determine our priorities. At Cohere, risks to our systems are identifi ed through a list of continuously-expanding techniques, including:
Potential Harm: Outputs that result in a discriminatory outcome, insecure code, child sexual exploitation and abuse, malware.
“The examples provided above consider the likelihood and severity of potential harms in the enterprise contexts in which Cohere models are deployed. A similar assessment of potential harms from the same models deployed in contexts such as a consumer chatbot would result in a different risk profile.” (p. 8)
“Preventing the generation of harmful outputs involves testing and evaluation techniques to control the types of harmful output described in Section 1, for example, child sexual abuse material (CSAM), targeted violence and hate, outputs that result in discriminatory outcomes for protected groups, or insecure code.” (p. 11)
They explicitly do not consider CBRN or loss of control risks, and explicitly do not consider “potential future risks associated with LLMs”, giving justification that “studies regarding the likelihood of these capabilities arising and leading to real-world harm are limited in their methodological maturity and transparency, often lacking clear theoretical threat models or developed empirical methods due to their nascency.” However, this reasoning requires more documentation and justification, for instance citing these studies and why they believe their reasoning to be limited. Excluding a risk that is established in taxonomies and literature carries a high burden of proof.
“Cohere’s approach to risk assurance, and to determining when models and systems are suffi ciently safe and secure to be made available to our customers, is focused on risks that are known, measurable, or observable today” (p. 15)
“One approach to risk assurance in the AI industry is focused on risks described as catastrophic or severe, such as capabilities related to radiological and nuclear weapons, autonomy, and self-replication. In this context, thresholds relating to these potential catastrophic risks are developed, and the approach described in safety frameworks is designed to assess risks that are speculated to arise when models attain specific capabilities, such as the ability to perform autonomous research or facilitate biorisk. The models are then deemed to present “unacceptable” levels of risk when certain capability levels are attained. While it is important to consider long-term, potential future risks associated with LLMs and the systems in which they are deployed, studies regarding the likelihood of these capabilities arising and leading to real-world harm are limited in their methodological maturity and transparency, often lacking clear theoretical threat models or developed empirical methods due to their nascency. For example, existing research into how LLMs may increase biorisks fails to account for entire risk chains beyond access to information, and does not systematically compare LLMs to other information access tools, such as the internet. More work is needed to develop methods for assessing these types of threats more reliably.” (pp. 14-15)
The framework doesn’t mention any procedures pre-deployment to identify novel risk domains or risk models for the frontier model. To improve, they should commit to such a process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.
No relevant quotes found.
The framework doesn’t mention any third-party procedures pre-deployment to identify novel risk domains or risk models for the frontier model.
There is mention of multi-disciplinary red teaming and consultation of domain experts during the “Training, evaluation and testing” stage of model development. However, this is not explicitly for the purpose of identifying novel risks, and criteria for expertise are not given.
To improve, they should commit to an external process to identify either novel risk domains, or novel risk models/changed risk profiles within pre-specified risk domains (e.g. emergence of an extended context length allowing improved zero shot learning changes the risk profile), and provide methodology, resources and required expertise.
“Multi-disciplinary red teaming […] Consultation of domain experts” (p. 13)
There is some evidence of conducting risk modelling, plus considering use cases and the potential likelihood and severity of harms from those use cases.
More evidence of a structured process for this risk modeling should be given, including methodology, experts involved, and the lists of identified threat scenarios. More detail is required on the step by step causal pathway of these scenarios to harm, plus justification that adequate effort has been exerted to systematically map out all possible risk pathways. Risk models should be published.
“At Cohere, risks to our systems are identified through a list of continuously-expanding techniques, including: […] Internal threat modeling, which includes a review of how our customers interact with and use our models, to proactively identify potential threats and implement specific counter measures before deployment” (p. 6)
There is no methodology for risk modeling defined.
No relevant quotes found.
No mention of risks identified during open-ended red teaming or evaluations triggering further risk modeling.
No relevant quotes found.
There is a clear assessment and subsequent prioritization of risk models representing the most severe and probable harms. This appears to be from the full space of risk models. However, more detail on the scores given for likelihood and severity of different risk models should be published.
“We identify risks by first assessing potential risks arising from our models’ capabilities and the systems in which they may be deployed. We then assess the likelihood and severity of potential harms that may arise in enterprise contexts from the identified risks.” (p. 5)
“We therefore focus our secure AI work on risks that have a high likelihood of occurring based on the types of tasks LLMs are highly performant in, as well as the limitations inherent in how these models function. This is what we refer to as “model capabilities.” (p. 5)
Use case, likelihood of harm in context, severity of harm in context. For instance, “Insecure Code. Code generation for enterprise developers managing a company’s proprietary data within on-premises servers. Medium to High possibility of a vulnerability being introduced into company code. Medium to High [severity of harm in context], depending on the nature
of the vulnerability introduced and the type of data handled by the company. Severe vulnerabilities can leave companies vulnerable to cyber attacks affecting individuals and society.”
There is no evidence that third parties validate risk models.
No relevant quotes found.
Their risk tolerance for when the residual risk is “acceptable” is if there are “no significant regressions [demonstrated in evaluations and tests] compared to our previously launched model versions.” Risk tolerances are also allowed to differ based on the customer: “analysis of whether a model is “acceptable” from a risk management perspective must be adapted to the customer context”.
However, this risk tolerance is still vague, and allows Cohere to have plenty of discretion. To improve, they should predefine a risk tolerance that applies to all models, expressed in terms of probability of some severity.
“We consider models safe and secure to launch when our evaluations and tests demonstrate no significant regressions compared to our previously launched model versions, so that performance and security is maintained or improved for every new significant model version. This is Cohere’s bright line for determining when a model is “acceptable” from a risk management perspective and ready to be launched.” (p. 16)
“In this way, the analysis of whether a model is “acceptable” from a risk management perspective must be adapted to the customer context, and must be able to adapt to new requirements or needs that emerge post-deployment. Assurance here means working with our customers to ensure that our models and systems conform to their risk management obligations and standards.” (p. 17)
The risk tolerance, implicit or otherwise, is not expressed fully or partly quantitatively. To improve, the risk tolerance should be expressed fully quantitatively or as a combination of scenarios with probabilities.
No relevant quotes found.
The risk tolerance, implicit or otherwise, is not expressed fully or partly quantitatively. To improve, the risk tolerance should be expressed fully quantitatively or as a combination of scenarios with probabilities.
No relevant quotes found.
No evidence of engaging in public consultations or seeking guidance from regulators for risk tolerance.
No relevant quotes found.
No justification process: No evidence of considering whether their approach aligns with or deviates from established norms.
No relevant quotes found.
There are implicit KRI assessments which are conducted, but the KRI thresholds are not given. To improve, thresholds which would trigger mitigations should be developed. The KRIs should also be grounded in risk modeling.
Key risks:
“Data acquisition and preparation stage:
“Training, evaluations and testing.
“Deployment and maintenance.
“Improvement and further fine-tuning.
“Multi-faceted evaluations, including standard benchmarks and proprietary evaluations based on identified possible harms and harm reduction objectives” (p. 13)
There is no evidence of KRI thresholds being quantitatively defined.
No relevant quotes found.
“Unexpected post-deployment usage patterns that were not accounted for and result in unmitigated risk” are described as a key risk to track during the deployment and maintenance stage. However, a threshold which triggers mitigations should be defined.
Key Risks: “Unexpected post-deployment usage patterns that were not accounted for and result in unmitigated risk” (p. 13)
There is evidence of aligning to a standard, i.e. SOC 2 Type II, but this is not tied to a specific KRI threshold and it is not clear how this threshold differs as model risks vary.
“We align our program to SOC 2 Type II and other recognized frameworks, and we rigorously monitor the health and performance of our security controls throughout the year, performing real-time corrective action when needed.” (p. 9)
There is no evidence of a quantitative containment KCI threshold.
No relevant quotes found.
There are “goals” for mitigation practices to reach in general, though these are vague – for instance, “adhering to guardrails” or “minimizing over-refusal”. To improve, these goals (which are proto deployment KCI thresholds) should have more detail for what the criteria of sufficiency would be. They should also be linked to KRIs.
“More specifically, our harm mitigation practices are focused on achieving the following goals:
“Cohere’s models, their training data, and the guardrails within which they operate are dynamically updated throughout the development process to achieve the three harm mitigation objectives described above.” (p. 11)
There are no quantitative deployment KCI thresholds given.
No relevant quotes found.
There are no assurance processes KCIs defined. The framework does not provide recognition of there being KCIs outside of containment and deployment measures.
No relevant quotes found.
There is no evidence of reasoning that if KRIs are crossed but KCIs are reached, then risks remain below the risk tolerance.
No relevant quotes found.
There is no policy to put development or deployment on hold mentioned in the framework.
No relevant quotes found.
While containment measures are defined, most remain high-level (e.g., “secure, risk- based defaults and internal reviews”, or “Supply chain controls for any third parties (e.g., data vendors or third-party data annotation)”, or “Blocklists”) More detail on the measures actually implemented or planned to be implemented is needed to improve. They should also be linked to specific KCI (and thus KRI) thresholds.
“These controls include:
Key Mitigations We Apply:
“Data acquisition and preparation.
“Training, evaluations and testing.
“Deployment and maintenance.
“Improvement and further fine-tuning.
Whilst there is a process for determining weaknesses in containment measures with internal API testing, it is not clear that this is prior to their implementation, and this does not cover other aspects of containment, such as securing model weights. Further, to improve, they should detail proof for why they believe the containment measures proposed will be sufficient to meet the KCI threshold, in advance.
“Where applicable, we also consider risks within the context of customer deployments. For example, because many of our users start building applications through our application programming interfaces (APIs) before moving to more advanced deployments, we extensively test and secure our APIs. Our API V2 underwent a heavy security design review before we made it available.” (p. 10)
Whilst there is a process for determining weaknesses in containment measures, it is not clear that this is prior to their implementation. To improve, they should detail a process for third-parties to verify the case for why they believe the containment measures proposed will be sufficient to meet the KCI threshold, in advance.
“Prior to deployment, significant model releases undergo an independent third-party penetration test to validate the security of containers and models.” (p. 10)
“Independent third-party security
testing, e.g., penetration testing” (p. 13)
While deployment measures are defined, most if not all remain high-level (e.g., “human-interpretable explanation of outputs”, or “multi-disciplinary red teaming”.) To improve, more detail on the measures actually implemented or planned to be implemented should be given. Further, the measures should be tied to specific KCI thresholds.
Key Mitigations We Apply:
“Data acquisition and preparation.
“Training, evaluations and testing.
“Deployment and maintenance.
“Improvement and further fine-tuning.
No proof is provided that the deployment measures are sufficient to meet the deployment KCI thresholds, nor is there a process to solicit such proof.
No relevant quotes found.
There is no mention of third-party verification of deployment measures meeting the threshold.
No relevant quotes found.
There is an explicit aversiveness to preparing for assurance processes in advance: “Cohere’s approach to risk assurance, and to determining when models and systems are suffi ciently safe and secure to be made available to our customers, is focused on risks that are known, measurable, or observable today.” Further, they note that “more work is needed to develop methods for assessing these types of threats more reliably” – to improve, the framework could set out a commitment to contribute to this research effort.
“One approach to risk assurance in the AI industry is focused on risks described as catastrophic or severe, such as capabilities related to radiological and nuclear weapons, autonomy, and self-replication. In this context, thresholds relating to these potential catastrophic risks are developed, and the approach described in safety frameworks is designed to assess risks that are speculated to arise when models attain specific capabilities, such as the ability to perform autonomous research or facilitate biorisk. The models are then deemed to present “unacceptable” levels of risk when certain capability levels are attained. While it is important to consider long-term, potential future risks associated with LLMs and the systems in which they are deployed, studies regarding the likelihood of these capabilities arising and leading to real-world harm are limited in their methodological maturity and transparency, often lacking clear theoretical threat models or developed empirical methods due to their nascency. For example, existing research into how LLMs may increase biorisks fails to account for entire risk chains beyond access to information, and does not systematically compare LLMs to other information access tools, such as the internet. More work is needed to develop methods for assessing these types of threats more reliably.” (pp. 14-15)
“Cohere’s approach to risk assurance, and to determining when models and systems are sufficiently safe and secure to be made available to our customers, is focused on risks that are known, measurable, or observable today” (p. 15)
There is no mention of providing evidence that the assurance processes are sufficient.
No relevant quotes found.
There is no mention of the underlying assumptions that are essential for the effective implementation and success of assurance processes.
No relevant quotes found.
There is no mention of elicitation methods being comprehensive enough to match elicitation efforts of potential threat actors. Elicitation techniques, such as fine-tuning or scaffolding, are not mentioned.
No relevant quotes found.
Whilst the framework mentions conducting evaluations “throughout the model development cycle”, more detail is not given. The frequency does not appear to be tied to the variation of effective computing power during training, or fixed time periods.
“As described above, Cohere conducts evaluations throughout the model development cycle, using both internal and external evaluation benchmarks.” (p. 16)
There is no description of how post-training enhancements are factored into capability assessments.
No relevant quotes found.
There is no mention of having the evaluation methodology vetted by third parties.
No relevant quotes found.
There is no mention of having the evaluation methodology vetted by third parties.
No relevant quotes found.
There is a description of “continuous monitoring of our security controls using automated and manual techniques” and “various evaluations to ensure that models actually adhere to these guardrails.” However, more detail is needed on the exact methodology of this monitoring to ensure that the KCI threshold will not be crossed unnoticed. Monitoring should also explicitly be linked to the monitoring of KCI measures. To improve, they could build on their existing monitoring infrastructure which monitors for “malicious attempts to prompt our models for harmful outputs” to link directly to KRIs and KCIs that they’d like to monitor.
“We are also progressing work to further study models when in use and assess the real-world effectiveness of mitigations, while upholding stringent levels of privacy and confidentiality and benefiting from external expertise where appropriate.” (p. 8)
“Where applicable, we also consider risks within the context of customer deployments. For example, because many of our users start building applications through our application programming interfaces (APIs) before moving to more advanced deployments, we extensively test and secure our APIs. Our API V2 underwent a heavy security design review before we made it available.” (p. 10)
“Moreover, we identify risks across our broader technology stack and environment by performing continuous monitoring of our security controls using automated and manual techniques. Models are developed and deployed in broader computational environments, and effectively managing AI risks requires us to identify, assess, and mitigate information security threats or vulnerabilities that may arise in these environments.” (p. 6)
“Beyond simply offering these features, Cohere conducts various evaluations to ensure that models actually adhere to these guardrails.” (p. 11)
“Continuous monitoring to detect anomalies and security issues” (p. 13)
“Responsible Disclosure Policy to incent third-party security vulnerability discovery” (p. 14)
“Where Cohere has direct visibility into the use of its models during deployment, we use that visibility to monitor for malicious attempts to prompt our models for harmful outputs, revoking access from accounts that abuse our systems. Cohere partners closely with customers who deploy Cohere’s AI solutions privately or on third-party managed platforms to ensure that they understand and recognize their responsibility for implementing appropriate monitoring controls during deployment.”
There is no mention of KCIs protocols being vetted by third parties.
No relevant quotes found.
There is an indication that third parties conduct red teaming of containment KCI measures to ensure they meet the containment KCI threshold, but detail on process, expertise required and methods are not given, and conducting independent testing is still discretionary. To improve, there should also be a process for replicating / having safeguard red teaming conducted by third parties for deployment KCI measures.
“Cohere conducts multidisciplinary red teaming during both the model development phase and post-launch. These red teaming exercises may include independent external parties, such as NIST and Humane Intelligence, and are conducted based on realistic use cases to attempt to break the model’s ability to fulfill alignment on risk mitigation goals in order to elicit information about areas of improvement.” (p. 16)
There is a commitment to make public documentation of evaluation results. However, there is no commitment to notify government agencies if risk thresholds are exceeded. Further, there is not a commitment to make KCI assessments public.
Quotes:
“Documentation is a key aspect of our accountability to our customers, partners, relevant government agencies, and the wider public. To promote transparency about our practices, we:
No commitment to permitting the reports, which detail the results of external evaluations (i.e. any KRI or KCI assessments conducted by third parties), to be written independently and without interference or suppression.
No relevant quotes found.
Their monitoring mostly focuses on security vulnerabilities; nonetheless, they mention a process for performing “continuous monitoring” explicitly to “identify risks”. Whilst they may not be novel risk domains, it does suggest a willingness to detect novel threat models, detected via observation in the deployment context.
“Moreover, we identify risks across our broader technology stack and environment by performing continuous monitoring of our security controls using automated and manual techniques. Models are developed and deployed in broader computational environments, and effectively managing AI risks requires us to identify, assess, and mitigate information security threats or vulnerabilities that may arise in these environments.” (p. 6)
“Cohere partners closely with customers who deploy Cohere’s AI solutions privately or on third-party managed platforms to ensure that they understand and recognize their responsibility for implementing appropriate monitoring controls during deployment.” (p. 12)
Apart from incidence response, there is no mechanism to incorporate risks identified post-deployment detailed.
No relevant quotes found.
The framework specifies a delegation of authority for risk decisions, but to one executive only for all risks.
“The final authority to determine if our products are safe, secure, and ready to be made available to our customers is delegated by Cohere’s CEO to Cohere’s Chief Scientist.” (p. 15)
No mention of a management risk committee.
No relevant quotes found.
The framework includes rudimentary protocols for decision-making.
“This decision is made on the basis of final, multi-faceted evaluations and testing.” (p. 15)
“We consider models safe and secure to launch when our evaluations and tests demonstrate no significant regressions compared to our previously launched model versions, so that performance and security is maintained or improved for every new significant model version. This is Cohere’s bright line for determining when a model is “acceptable” from a risk management perspective and ready to be launched.” (p. 16)
No mention of escalation procedures.
No relevant quotes found.
Not explicitly a risk officer, but the Chief Scientist seem to partly play this role.
“The final authority to determine if our products are safe, secure, and ready to be made available to our customers is delegated by Cohere’s CEO to Cohere’s Chief Scientist.” (p. 15)
No mention of an advisory committee.
No relevant quotes found.
The framework has a rudimentary mention of consistent review.
“Cohere consistently reviews state-of-the-art research and industry practice regarding the risks associated with AI, and uses this to determine our priorities.” (p. 6)
No mention of people that challenge decisions.
No relevant quotes found.
No mention of a system to aggregate and report risk data.
No relevant quotes found.
No mention of a central risk function.
No relevant quotes found.
No mention of an internal audit function.
No relevant quotes found.
The framework laudably specifies the independence of the external testers.
“Prior to major model releases, Cohere also performs robust vulnerability management testing, including independent third-party penetration testing of model containers.” (p. 16)
“These red teaming exercises may include independent external parties, such as NIST and Humane Intelligence.” (p. 16)
No mention of a Board risk committee.
No relevant quotes found.
No mention of any additional governance bodies.
No relevant quotes found.
The framework includes a brief mention of controls.
“At Cohere, we recognize that properly securing AI requires going beyond traditional controls.” (p. 8)
The framework states the existence of a security-first culture, but does not offer much detail.
“Cohere’s security-first culture drives how we work together to design, operate, continuously monitor, and secure both our internal environment (i.e., network, applications, endpoints, data, and personnel) and customer and partner deployments. (p. 8)
No mention of elements of speak-up culture.
No relevant quotes found.
The framework mentions which risks are in scope and includes a commitment to publish information regarding these risks.
“We place potential risks arising from LLM capabilities into one of two categories:
“Documentation is a key aspect of our accountability to our customers, partners, relevant government agencies, and the wider public. To promote transparency about our practices, we: Publish documentation regarding our models’ capabilities, evaluation results, configurable secure AI features, and model limitations for developers to safely and securely build AI systems using Cohere solutions. This includes model documentation, such as Cohere’s Usage Policy and Model Cards, and technical guides, such as Cohere’s LLM University. (p. 17)
The framework includes rudimentary governance elements.
“The final authority to determine if our products are safe, secure, and ready to be made available to our customers is delegated by Cohere’s CEO to Cohere’s Chief Scientist.” (p. 15)
The framework lists several external actors, but not specifically authorities.
“Cohere is committed to building a responsible, safe, and secure AI ecosystem, and actively engages with external actors to continuously improve our own practices, as well as to advance the state-of-the art on AI risk management. In particular, Cohere contributes to the development of critical guidance and industry standards with organisations such as: OWASP Top 10 for Large Language Models and Generative AI, CoSAI (Coalition for Secure AI) — founding member, CSA (Cloud Security Alliance), ML Commons”. (p. 19)
“Cohere also engages in cooperation with international AI Safety Institutes and external researchers to advance the scientific understanding of AI risks, for example by submitting our public models for inclusion on public benchmarks and red teaming exercises.” (p. 19)