Wiki/docs/18-plan-operations.md
Corentin JOGUET 668576cdc4 chore: initial commit — formation-hub conception phase
Conception complete (Phase 0) pour formation-hub Acadenice :

- 19 docs Merise Agile + UML + GitOps + plans (tests/deploy/ops/api)
  cf docs/00-readme.md pour l'index complet
- Stack Docker compose (Docmost + Baserow + Postgres + Redis + MinIO local FS)
  compose.yml + compose.staging.yml + compose.prod.yml
- CI/CD GitHub Actions skeleton (ci, deploy-staging, deploy-prod)
- Bridge service skeleton (Hono + TS + Biome + Vitest + zod + pino)
- Templates GitHub : PR + 3 issue types + CODEOWNERS + dependabot.yml
- Scripts ops : healthcheck, backup quotidien, smoke-test post-deploy
- LICENSE AGPL-3.0 + SECURITY.md + CONTRIBUTING.md + CHANGELOG.md
- Diagramme drawIO archi infra (XML importable dans diagrams.net)

Decisions structurelles enregistrees :
- Scope CFA + Agence avec entite PERSONNE pivot multi-roles (ADR-001)
- Stack composite Docmost AGPL + Baserow MIT + bridge custom (ADR-001)
- Path B : UX quasi-unified via Tiptap node-views custom (ADR-002)
- Monorepo trunk-based development (ADR-003)
- Postgres separe Docmost/Baserow (ADR-004)
- Bridge stack Node 22 + Hono (ADR-005)
- Repo neuf prefere a fork Docmost
- Prod-like des le jour 1 (pas MVP)
2026-05-07 12:16:19 +02:00

459 lines
13 KiB
Markdown

# Plan d'operations (RUN)
> Strategie d'operations post-launch : monitoring, alerting, backups, DR, incident response, runbooks.
> Audience : Corentin (owner ops), Yan (backup), futur freelance.
## 1. Vue d'ensemble — RUN responsibilities
```mermaid
flowchart TB
subgraph "Daily"
D1[Check uptime monitoring]
D2[Verifier logs erreurs]
D3[Review backups quotidiens]
end
subgraph "Weekly"
W1[Audit dependabot bumps]
W2[Check capacite disque/CPU]
W3[Review issues / PR ops]
end
subgraph "Monthly"
M1[Test restauration backup]
M2[Review security alerts]
M3[Audit access list]
M4[Capacity planning review]
end
subgraph "On Incident"
I1[Detect / Page]
I2[Triage]
I3[Mitigate / Restore]
I4[Post-mortem]
end
```
## 2. Monitoring
### 2.1 Stack de monitoring (Phase 1 minimal → Phase 3 complet)
| Phase | Tool | Role | Cout |
|-------|------|------|------|
| **Phase 1** | UptimeRobot (free) | Healthcheck HTTP toutes 5 min sur wiki + baserow | 0€ |
| **Phase 2** | + Uptime Kuma self-host | Plus de granularite, dashboards perso | 0€ (sur prod VPS ou VPS dedie) |
| **Phase 3** | + Prometheus + Grafana | Metriques system + app, alerting fin | ~5€/mois (extra resources) |
| **Phase 3** | + Loki | Centralisation logs containers | ~5€/mois |
| **Phase 4** | + Sentry self-host ou SaaS | Error tracking app, stack traces | 0€-25€/mois |
### 2.2 Endpoints surveilles (Phase 1)
| Endpoint | Frequence | SLA cible |
|----------|-----------|-----------|
| `https://wiki.acadenice.fr` (HTTP 200) | 5 min | uptime >= 99% |
| `https://baserow.acadenice.fr/api/_health/` | 5 min | uptime >= 99% |
| `https://bridge.acadenice.fr/api/health` (Phase 2+) | 5 min | uptime >= 99% |
### 2.3 Metriques cles (Phase 3+)
System :
- CPU usage (alerte > 80% sustained 5 min)
- Memoire (alerte > 85%)
- Disque (alerte > 80%)
- Network in/out
Application :
- Latence p95 par endpoint (bridge)
- Taux d'erreurs HTTP 5xx (alerte > 1%)
- Throughput requests/sec
- Queue Redis depth (Baserow celery)
- Postgres connections actives (alerte > 80% pool size)
Business (custom) :
- Nb saisies heures/jour (sentinel : si chute brutale = bug saisie)
- Nb attributions creees/semaine
- Nb projets en cours
- Capacite formateurs depassee (alerte si > 0)
## 3. Alerting
### 3.1 Channels
| Channel | Severite | Cible |
|---------|----------|-------|
| Email Corentin + Yan | Tous niveaux | corentin@acadenice.fr, yan@acadenice.fr |
| Slack/Teams #ops | warning + critical | Canal interne |
| SMS (Twilio ou OVH) | critical seulement | Corentin (oncall principal) |
### 3.2 Severites
| Niveau | Definition | Reponse attendue |
|--------|-----------|------------------|
| **CRITICAL** | Service down / data loss en cours | < 15 min |
| **WARNING** | Degradation perf / capacite proche limit | < 4h ouvrees |
| **INFO** | Audit, releases, backups OK | revue hebdo |
### 3.3 Alertes initiales (Phase 1)
```
[CRITICAL] HTTP 5xx > 5% en 5 min → page Corentin
[CRITICAL] Service down (uptime check fail 3x) → page Corentin + Yan
[CRITICAL] Disque > 95% → page
[WARNING] CPU > 80% sustained 10 min → email
[WARNING] Memoire > 85% → email
[WARNING] Capacite formateur depassee → email admin pedagogique
[INFO] Backup quotidien execute (succes/fail) → log + email si fail
```
## 4. Backups — strategie 3-2-1
**3** copies des donnees, sur **2** supports differents, dont **1** offsite.
### 4.1 Targets backup
| Quoi | Frequence | Outil | Local | Distant |
|------|-----------|-------|-------|---------|
| Postgres docmost | Quotidien 03:00 | `pg_dump.gz` | `/opt/formation-hub/backups/local/` | S3-compatible (OVH/Backblaze) |
| Postgres baserow embedded | Quotidien 03:00 | `pg_dump.gz` | idem | idem |
| Docmost files (uploads) | Quotidien 03:00 | `tar.gz` | idem | idem |
| Baserow data dir | Quotidien 03:00 | `tar.gz` | idem | idem |
| `.env.prod` (encrypted) | Sur changement | gpg + push to vault | (none) | Vault hors bande |
### 4.2 Retention
| Type | Local | Distant |
|------|-------|---------|
| Quotidien | 30 jours rolling | 90 jours rolling |
| Hebdo (vendredi) | 12 semaines | 12 mois |
| Mensuel (1er) | 12 mois | 5 ans |
### 4.3 Scripts backup
`scripts/backup.sh` :
```bash
#!/usr/bin/env bash
set -euo pipefail
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_DIR=/opt/formation-hub/backups/local
mkdir -p "$BACKUP_DIR"
cd /opt/formation-hub
# Postgres docmost
docker compose -f compose.yml -f compose.prod.yml exec -T docmost-db \
pg_dump -U docmost docmost | gzip > "$BACKUP_DIR/docmost-db-$DATE.sql.gz"
# Postgres baserow (embedded — exec dans le container baserow)
docker compose -f compose.yml -f compose.prod.yml exec -T baserow \
pg_dumpall -U postgres | gzip > "$BACKUP_DIR/baserow-db-$DATE.sql.gz"
# Files
docker compose -f compose.yml -f compose.prod.yml exec -T docmost \
tar czf - /app/data/storage > "$BACKUP_DIR/docmost-files-$DATE.tar.gz"
docker compose -f compose.yml -f compose.prod.yml exec -T baserow \
tar czf - /baserow/data > "$BACKUP_DIR/baserow-data-$DATE.tar.gz"
# Sync distant via rclone (configure separement)
rclone copy "$BACKUP_DIR/" s3:acadenice-formation-hub-backup/ --include "*-$DATE.*"
# Retention locale (supprime > 30 jours)
find "$BACKUP_DIR" -type f -mtime +30 -delete
```
`/etc/cron.d/formation-hub-backup` :
```
0 3 * * * corentin /opt/formation-hub/scripts/backup.sh >> /var/log/formation-hub-backup.log 2>&1
```
### 4.4 Test restauration mensuel
`scripts/restore-test.sh` execute le 1er du mois sur env isole :
1. Provisionne un VPS test ephemere
2. Restore le backup le plus recent
3. Lance smoke tests
4. Verifie integrite (checksum, nb rows)
5. Si fail : alerte CRITICAL + log
6. Detruit le VPS test
## 5. Disaster recovery
### 5.1 Scenarios DR
| Scenario | Probabilite | Impact | Plan |
|----------|-------------|--------|------|
| VPS down (provider issue) | Faible | Service down 0-4h | Attendre provider OU failover manuel vers VPS backup |
| Corruption Postgres | Faible | Data loss < 24h | Restore depuis backup quotidien |
| Compromission complete (rootkit) | Tres faible | Vol de donnees | Wipe + reinstall + restore data + audit complet + RGPD declaration |
| Provider abandonne service | Tres faible | Service migre | Migration vers autre provider, jusqu'a 1 semaine downtime acceptable |
| Erreur humaine (rm -rf) | Moyenne | Variable | Backup quotidien + soft delete in DB |
### 5.2 RTO / RPO targets (rappel CDC)
- **RTO** (Recovery Time Objective) : 4h max
- **RPO** (Recovery Point Objective) : 24h max (backup quotidien)
### 5.3 Plan de DR — etape par etape
```
1. DETECT
- Alerte automatique OU report utilisateur
- Confirmer le scope (qui est down ? quoi est perdu ?)
2. TRIAGE (15 min)
- Severite (CRITICAL / WARNING)
- Notifier Yan + Ludo si CRITICAL
- Annoncer canal #ops + banner status si user-facing
3. MITIGATE (selon scenario)
- Restore backup
- Failover
- Hotfix
- Rollback
4. RESTORE
- Verifier integrite donnees (rollups, FK, nb rows)
- Smoke tests
- Notification "back online"
5. POST-MORTEM (sous 7 jours)
- Timeline
- Root cause
- Action items
- Ajouter au runbook si pattern recurrent
```
## 6. Runbooks
Documentation par incident type. Format standardise :
```
# Runbook : <INCIDENT_TYPE>
## Symptomes
- ...
## Diagnostic
1. Verifier ...
2. Verifier ...
## Resolution
1. Step
2. Step
## Prevention future
- ...
## Rollback / escalade
- ...
```
### 6.1 Runbooks Phase 1 (a creer)
| Runbook | Priorite |
|---------|----------|
| `runbook-docmost-down.md` | Haute |
| `runbook-baserow-down.md` | Haute |
| `runbook-disk-full.md` | Haute |
| `runbook-postgres-corrupted.md` | Haute |
| `runbook-restore-from-backup.md` | Haute |
| `runbook-rotate-secrets.md` | Moyenne |
| `runbook-bump-docmost-version.md` | Moyenne |
| `runbook-bump-baserow-version.md` | Moyenne |
| `runbook-add-new-user.md` | Faible |
| `runbook-renewal-tls.md` | Faible (auto via Traefik) |
A stocker dans `docs/runbooks/` ou directement sur Outline pour acces rapide en incident.
## 7. Maintenance
### 7.1 Bumps dependances
| Type | Frequence | Process |
|------|-----------|---------|
| Auto via Dependabot (security) | Hebdo | Auto-PR + CI + merge si vert |
| Auto via Dependabot (minor/patch) | Hebdo | Auto-PR + review humaine |
| Major bumps | Manuel | PR dediee + tests E2E + decision business |
| Docmost upstream | Decision manuelle (testing staging) | PR change image tag + test E2E |
| Baserow upstream | idem | idem |
| Postgres major | Annuel max, planifie | Backup + migration + restore + verification |
### 7.2 OS patches
| Type | Frequence |
|------|-----------|
| Security patches Debian | Auto via `unattended-upgrades` |
| Major Debian release | Tous les 2-3 ans, planifie |
| Reboot apres kernel patch | Mensuel max, fenetre maintenance |
### 7.3 Window de maintenance
Communiquer 48h avant si downtime > 5 min :
- Email a tous les utilisateurs Acadenice
- Banner Docmost / Baserow
- Slack #internal
Creneau prefere : **dimanche 06:00-08:00 UTC** (zero usage probable).
## 8. Capacity planning
### 8.1 Indicateurs a surveiller
- Nb users actifs (mensuel)
- Volume rows Baserow (par table)
- Volume documents Docmost
- Storage uploads
- CPU/RAM moyenne sur 7 jours
### 8.2 Triggers d'upsizing
| Indicateur | Seuil | Action |
|-----------|-------|--------|
| CPU moyen > 60% sur 1 semaine | Trigger | Upsize VPS (4 → 8 vCPU) |
| RAM moyen > 75% sur 1 semaine | Trigger | Upsize RAM (8 → 16 Go) |
| Disque > 70% | Trigger | Upsize storage OU clean old backups |
| Nb users simultanes peak > 50 | Trigger | Considerer 2 replicas + load balancer |
### 8.3 Review trimestrielle
Tous les 3 mois, Corentin review :
- Couts infra
- Adequation specs
- Croissance attendue prochain trimestre
- Decision upsize/downsize/migrate
## 9. Incident response
### 9.1 Severites (rappel)
- **SEV1** : Service down complet (CRITICAL)
- **SEV2** : Degradation majeure (WARNING)
- **SEV3** : Bug isole, workaround possible (INFO)
### 9.2 Comm template
Pendant incident :
```
[SEV1] formation-hub - Service degraded
Symptom: <quoi>
Started: <quand>
Investigation: <where we are>
ETA: <estimate restore>
Channel: #ops
```
Mise a jour toutes les 30 min.
### 9.3 Post-mortem template
`docs/post-mortems/YYYY-MM-DD-titre.md` :
```markdown
# Post-mortem : <titre incident>
## Timeline
- HH:MM detection
- HH:MM triage
- HH:MM mitigation start
- HH:MM service restored
- HH:MM root cause confirmed
## Impact
- Duree downtime : Xh
- Users impactes : Y
- Data loss : oui/non, si oui : combien
## Root cause
<...>
## Pourquoi notre monitoring n'a pas alerte plus tot ?
<...>
## Action items
- [ ] AI 1 : ... (owner @who, due date)
- [ ] AI 2 : ...
## Lessons learned
<...>
```
Post-mortem **blameless** : focus sur le systeme, pas la personne.
## 10. Daily / Weekly / Monthly tasks
### 10.1 Daily (5 min, matin)
```
[ ] Check uptime monitoring (vert ?)
[ ] Verifier logs containers (pas d'erreur recurrente ?)
[ ] Verifier backup quotidien execute (status email ou log)
[ ] Check Slack #ops (rien d'urgent ?)
```
### 10.2 Weekly (30 min, lundi matin)
```
[ ] Review Dependabot PRs
[ ] Check disque/CPU graphs (anomalies ?)
[ ] Review issues GitHub ops/sec
[ ] Update CHANGELOG si releases passees
[ ] Plan release prochaine si features pretes
```
### 10.3 Monthly (2h, 1er du mois)
```
[ ] Test restauration backup (DR exercice)
[ ] Audit access list (qui a acces a quoi ?)
[ ] Review security alerts (CVE, audits)
[ ] Capacity planning review
[ ] Review couts infra (vs budget)
[ ] Update runbooks si nouveaux patterns
[ ] Review monitoring : alertes sur-bruyantes ? sous-detectes ?
```
## 11. On-call rotation (futur)
Pour l'instant : **Corentin = oncall principal**, Yan = backup.
Si plus d'admin technique embauches plus tard :
- Rotation hebdo Corentin / Yan / N
- Handoff weekly avec recap
- Compensation oncall (jour off ou prime)
## 12. Communication metier
Channels :
- **#ops** Slack/Teams : equipe technique
- **#internal** : tous les salaries Acadenice
- **Email all** : announcements majeurs (releases breaking, maintenance)
- **Banner Docmost** : info live downtime / maintenance
## 13. Documentation des operations
Tout doit etre dans `docs/runbooks/` (ou Outline `[INTERNE] Runbooks`) :
- Comment faire un backup manuel
- Comment restorer
- Comment ajouter un user
- Comment rotate les secrets
- Comment bump une version Docmost ou Baserow
- Comment investiguer un alert
- Comment escalader un incident
## 14. Outils ops — recap
| Outil | Phase | Cout/mois |
|-------|-------|-----------|
| UptimeRobot free | Phase 1+ | 0€ |
| Uptime Kuma self-host | Phase 2+ | 0€ |
| Prometheus + Grafana | Phase 3+ | ~5€ resources |
| Loki | Phase 3+ | ~5€ resources |
| Sentry | Phase 4+ | 0-25€ |
| pg_dump + tar + rclone | Phase 1+ | 0€ |
| OVH Object Storage / Backblaze | Phase 1+ | ~5-10€ |
| Slack / Teams webhook | Phase 1+ | 0€ (existant) |
## 15. Questions ouvertes
- [ ] Self-host Uptime Kuma vs SaaS UptimeRobot pour Phase 1 ?
- [ ] Backup distant : OVH (souverainete FR) vs Backblaze (cout) ?
- [ ] On-call rotation et compensation a definir si embauche
- [ ] Runbook execution automatique (Rundeck ?) ou pure markdown ?
- [ ] Status page publique (Statuspage.io / self-host) pour transparence vers users ?