Skip to content
Model card extract · safety evaluation

Six layers. Ten modes.

A scannable extract of the safety-evaluation protocol and named failure modes from model card v0.2.2. For the full authoritative document, read the model card.

Gate protocol · version 2026.04Every model, prompt, and knowledge-base change passes all six layers before reaching a patient. Below-threshold scores block the deployment.
L1Pass

Document adherence

Responds only from the curated clinical knowledge base on clinically scoped content.

L2Pass

Instruction adherence

Refuses multi-turn pressure to step outside scope — diagnosis, medication, minors.

L3Pass

Named failure modes

Ten documented failure modes tested with persona-stratified red-team suites.

L4Pass

Tone & modality fidelity

Clinical evaluators score against MI Treatment Integrity, DBT fidelity, trauma-informed care.

L5Pass

Independent clinical review

Sign-off required from a clinical-advisory-board member who did not touch the change.

L6Pass

Bias & equity

Subgroup performance deltas reported; deltas over threshold block the release.

Clinical advisory veto: Any single-layer block halts the pipeline. The CCO cannot override.
Named failure modes

Ten we test. One we have not found yet.

Validation of delusional content.

Never affirm, never confront, always redirect to the supervising clinician. Escalate on persistent or intensifying content.

Failure to de-escalate suicide risk.

Deterministic SAFE-T path. Stanley-Brown Safety Plan review where one is in place. Immediate escalation on C-SSRS tier change.

Dependency that displaces human connection.

No human persona, explicit AI self-identification, active surfacing of human alternatives, dependency-signal monitoring via the caseload view.

Help-seeking suppression.

Monitored through periodic review of companion-to-clinician signal rates in audit logs and the clinician's monthly survey.

Stigmatizing or biased response.

Persona-stratified red-team evaluations. Clinical evaluator rubric includes a specific bias-and-stigma item. Bias findings published alongside other red-team results.

Over-reassurance.

Monitored in clinical evaluator review.

Drift from modality.

Monitored in tone and modality fidelity scoring.

Scope creep under pressure.

Monitored in Layer 2 evaluation — multi-turn refusal to step outside scope.

Sycophancy.

Monitored in clinical evaluator review.

False escalation fatigue.

Measured by escalation precision in audit logs. Reviewed monthly.

Unknown failure modes.

These exist. We will find them. When we do, we add them to this section and describe our mitigation. We do not quietly revise the card.

Go deeper

The full model card v0.2.2.

Training and fine-tuning posture, dependency design, human oversight, escalation behavior, data and privacy, change governance, known limitations, accountability, and the full version log — all on the canonical model card.

Read the full model card See the escalation page