Wiki/.claude/workflows/incident.md
Corentin JOGUET 460f7effe0
Some checks are pending
CI / Lint bridge (Biome) (push) Waiting to run
CI / Type-check bridge (push) Blocked by required conditions
CI / Tests unit bridge (push) Blocked by required conditions
CI / Tests integration bridge (push) Blocked by required conditions
CI / Security scan (push) Waiting to run
CI / Docker build + healthcheck (push) Blocked by required conditions
feat(workflows): create 5 BYAN workflows for agent collaboration
Workflows (playbooks markdown) pour orchestrer les 4 agents specialises :

- README.md : index + conventions communes + integration BYAN web futur
- build-story.md : cycle complet livrer 1 story Phase 2 (bridge-dev → bridge-tester → review → CI → deploy staging → validation metier)
- sync-bidirec.md : sync Docmost ↔ Baserow event-driven (idempotence + anti-loop X-Bridge-Origin)
- release.md : process release semver (E2E staging → tag → approval → deploy prod → watch 30min)
- incident.md : SEV1/2/3 response + post-mortem blameless + runbooks
- bump-deps.md : Dependabot PRs + major bumps + Docmost/Baserow upstream

Chaque workflow specifie : trigger, acteurs (agents + humains), sequence
ordonnee avec outputs, gates humains bloquants, rollback, comm templates.

Workflows = playbooks declaratifs pour Claude main qui orchestre les agents
via Agent tool sequentiel. A migrer plus tard vers BYAN web workflow runs
quand le runtime BYAN sera fix.

Equipe complete pour formation-hub :
- 4 agents specialises (bridge-dev, bridge-tester, acadenice-devops, docmost-fork-dev)
- 5 workflows orchestrant leur collaboration
2026-05-07 19:30:48 +02:00

193 lines
5.6 KiB
Markdown

# Workflow : INCIDENT RESPONSE
Process de gestion d'incident en prod. Cf doc 18 section 9.
## Trigger
L'un des suivants :
- Alerte automatique (UptimeRobot, monitoring, healthcheck failed)
- Report utilisateur (Slack, email, ticket)
- Detection logs anormaux
## Severites
| Niveau | Definition | Reponse cible |
|--------|-----------|---------------|
| **SEV1 (CRITICAL)** | Service down complet ou data loss en cours | < 15 min |
| **SEV2 (WARNING)** | Degradation majeure, partie indisponible, perte donnees evitee | < 4h ouvrees |
| **SEV3 (INFO)** | Bug isole, workaround possible | < 24h ouvrees |
## Acteurs
- **Corentin** (oncall principal)
- **Yan** (oncall backup)
- **acadenice-devops** (investigation + restore)
- **bridge-dev** (si bug code)
- **bridge-tester** (regression test post-fix)
## Sequence — SEV1 (service down)
```
[1] DETECT (auto ou manuel)
- Alerte UptimeRobot/Slack/email
- Confirmer le scope : qui est down, depuis quand, quoi est perdu
- Output : situation comprise
[2] TRIAGE (15 min, Corentin oncall)
- Severite confirmee SEV1 ?
- Notifier Yan + Ludo si data loss
- Annoncer canal #ops + banner status si user-facing :
"[SEV1] formation-hub - investigating, ETA <unknown>"
- Output : equipe alertee
[3] INVESTIGATE (acadenice-devops)
- Verifier containers : docker compose ps
- Verifier healthcheck : ./scripts/healthcheck.sh
- Verifier logs : docker compose logs --tail=200 <service>
- Verifier metrics : CPU, memoire, disque
- Verifier deps : Postgres, Redis joignables ?
- Output : root cause identifie ou hypothese forte
[4] MITIGATE (acadenice-devops + bridge-dev si code)
- Selon root cause :
* Service down : restart container, verifier ressources
* DB corruption : restore backup recent
* Bug code : rollback version precedente (cf release.md)
* Compromission : rotate secrets, isoler env
* Disque plein : cleanup logs/backups, upsizing
- Output : service restored
[5] VERIFY (Corentin + acadenice-devops)
- Healthcheck full : 4/4 OK
- Smoke test : ./scripts/smoke-test.sh
- Tester un flow utilisateur reel
- Output : confirmation prod restoree
[6] COMMUNICATE (Corentin)
- Slack/Teams : "[SEV1 RESOLVED] formation-hub - back online. Cause: ..."
- Email all si data loss : compliance RGPD
- Update banner status : retire
- Output : equipe et users informes
[7] POST-MORTEM (sous 7 jours, Corentin + Yan)
- Creer doc : docs/post-mortems/YYYY-MM-DD-<title>.md
- Format blameless (focus systeme, pas la personne)
- Sections : Timeline / Impact / Root cause / AI / Lessons learned
- Action items (AI) : owner + due date
- Partager avec equipe
- Update runbooks si pattern recurrent
- Output : post-mortem publie + AI ouverts
```
## Sequence — SEV2 (degradation)
Idem SEV1 mais sans urgence < 15 min. Reponse cible 4h ouvrees. Pas d'annonce email all sauf si user-facing.
## Sequence — SEV3 (bug isole)
```
[1] Triager via GitHub/Forgejo issue avec label `bug` + severite `low`
[2] Assigner a bridge-dev pour fix dans la prochaine release
[3] Si workaround dispo : documenter dans le ticket
[4] Pas de post-mortem (sauf pattern recurrent)
```
## Comm template SEV1/2 pendant incident
```
[SEV1] formation-hub - Service degraded
Symptom: <quoi exactement>
Started: <quand>
Investigating: <ou on en est>
ETA: <estimate restore ou "investigating">
Channel: #ops
```
Mise a jour toutes les 30 min minimum.
## Comm template SEV1 resolved
```
[SEV1 RESOLVED] formation-hub - back online
Duration down: <X>h<Y>m
Root cause: <one-liner>
Impact: <users affectes, data loss oui/non>
Post-mortem: docs/post-mortems/YYYY-MM-DD-<title>.md (publie sous 7j)
```
## Post-mortem template
`docs/post-mortems/YYYY-MM-DD-<titre-incident>.md` :
```markdown
# Post-mortem : <titre incident>
## Timeline (heures locales)
- HH:MM detection
- HH:MM triage
- HH:MM mitigation start
- HH:MM service restored
- HH:MM root cause confirmed
## Impact
- Duree downtime : Xh Ym
- Users impactes : Y
- Data loss : oui/non, si oui : combien et quoi
- Cout estime : XX€ (si quantifiable)
## Root cause
<un paragraphe : ce qui a casse + pourquoi>
## Pourquoi notre monitoring n'a pas alerte plus tot ?
<analyse honnete - blind spot detection ?>
## Action items
- [ ] AI 1 : <description> (owner @who, due YYYY-MM-DD)
- [ ] AI 2 : ...
## Lessons learned
<que retenir pour eviter recurrence>
## Mention blameless
Cet incident n'est pas la faute d'une personne. C'est un manque de garde-fous systeme. AIs au-dessus visent a ajouter ces garde-fous.
```
## Runbooks lies (a creer Phase 1)
Dans `docs/runbooks/` :
- `runbook-docmost-down.md`
- `runbook-baserow-down.md`
- `runbook-disk-full.md`
- `runbook-postgres-corrupted.md`
- `runbook-restore-from-backup.md`
- `runbook-rotate-secrets.md`
Format runbook :
```
# Runbook : <INCIDENT_TYPE>
## Symptomes
## Diagnostic (etapes)
## Resolution (etapes)
## Prevention future
## Rollback / escalade
```
## On-call rotation
Phase 0/1 : **Corentin = oncall principal**, Yan = backup.
Si embauche futur :
- Rotation hebdo
- Handoff weekly avec recap
- Compensation oncall (jour off ou prime)
## Limites
- Pas de SLA strict pour Phase 1 (outil interne, pas critique 24/7). Best effort.
- Pas de status page publique en Phase 1 (info via Slack interne suffit).
- Phase 3+ : si on ouvre l'outil a clients externes, considere SLA + status page.
## Notes
- Apres incident SEV1/2 : update doc 18 section 6 (runbooks) si pattern detecte
- Apres 3 incidents similaires en 1 mois : escalade strategique (refactor architecture, ressources additionnelles, etc.)