Wiki/docs/18-plan-operations.md
Corentin JOGUET 668576cdc4 chore: initial commit — formation-hub conception phase
Conception complete (Phase 0) pour formation-hub Acadenice :

- 19 docs Merise Agile + UML + GitOps + plans (tests/deploy/ops/api)
  cf docs/00-readme.md pour l'index complet
- Stack Docker compose (Docmost + Baserow + Postgres + Redis + MinIO local FS)
  compose.yml + compose.staging.yml + compose.prod.yml
- CI/CD GitHub Actions skeleton (ci, deploy-staging, deploy-prod)
- Bridge service skeleton (Hono + TS + Biome + Vitest + zod + pino)
- Templates GitHub : PR + 3 issue types + CODEOWNERS + dependabot.yml
- Scripts ops : healthcheck, backup quotidien, smoke-test post-deploy
- LICENSE AGPL-3.0 + SECURITY.md + CONTRIBUTING.md + CHANGELOG.md
- Diagramme drawIO archi infra (XML importable dans diagrams.net)

Decisions structurelles enregistrees :
- Scope CFA + Agence avec entite PERSONNE pivot multi-roles (ADR-001)
- Stack composite Docmost AGPL + Baserow MIT + bridge custom (ADR-001)
- Path B : UX quasi-unified via Tiptap node-views custom (ADR-002)
- Monorepo trunk-based development (ADR-003)
- Postgres separe Docmost/Baserow (ADR-004)
- Bridge stack Node 22 + Hono (ADR-005)
- Repo neuf prefere a fork Docmost
- Prod-like des le jour 1 (pas MVP)
2026-05-07 12:16:19 +02:00

13 KiB

Plan d'operations (RUN)

Strategie d'operations post-launch : monitoring, alerting, backups, DR, incident response, runbooks. Audience : Corentin (owner ops), Yan (backup), futur freelance.

1. Vue d'ensemble — RUN responsibilities

flowchart TB
    subgraph "Daily"
        D1[Check uptime monitoring]
        D2[Verifier logs erreurs]
        D3[Review backups quotidiens]
    end
    subgraph "Weekly"
        W1[Audit dependabot bumps]
        W2[Check capacite disque/CPU]
        W3[Review issues / PR ops]
    end
    subgraph "Monthly"
        M1[Test restauration backup]
        M2[Review security alerts]
        M3[Audit access list]
        M4[Capacity planning review]
    end
    subgraph "On Incident"
        I1[Detect / Page]
        I2[Triage]
        I3[Mitigate / Restore]
        I4[Post-mortem]
    end

2. Monitoring

2.1 Stack de monitoring (Phase 1 minimal → Phase 3 complet)

Phase Tool Role Cout
Phase 1 UptimeRobot (free) Healthcheck HTTP toutes 5 min sur wiki + baserow 0€
Phase 2 + Uptime Kuma self-host Plus de granularite, dashboards perso 0€ (sur prod VPS ou VPS dedie)
Phase 3 + Prometheus + Grafana Metriques system + app, alerting fin ~5€/mois (extra resources)
Phase 3 + Loki Centralisation logs containers ~5€/mois
Phase 4 + Sentry self-host ou SaaS Error tracking app, stack traces 0€-25€/mois

2.2 Endpoints surveilles (Phase 1)

Endpoint Frequence SLA cible
https://wiki.acadenice.fr (HTTP 200) 5 min uptime >= 99%
https://baserow.acadenice.fr/api/_health/ 5 min uptime >= 99%
https://bridge.acadenice.fr/api/health (Phase 2+) 5 min uptime >= 99%

2.3 Metriques cles (Phase 3+)

System :

  • CPU usage (alerte > 80% sustained 5 min)
  • Memoire (alerte > 85%)
  • Disque (alerte > 80%)
  • Network in/out

Application :

  • Latence p95 par endpoint (bridge)
  • Taux d'erreurs HTTP 5xx (alerte > 1%)
  • Throughput requests/sec
  • Queue Redis depth (Baserow celery)
  • Postgres connections actives (alerte > 80% pool size)

Business (custom) :

  • Nb saisies heures/jour (sentinel : si chute brutale = bug saisie)
  • Nb attributions creees/semaine
  • Nb projets en cours
  • Capacite formateurs depassee (alerte si > 0)

3. Alerting

3.1 Channels

Channel Severite Cible
Email Corentin + Yan Tous niveaux corentin@acadenice.fr, yan@acadenice.fr
Slack/Teams #ops warning + critical Canal interne
SMS (Twilio ou OVH) critical seulement Corentin (oncall principal)

3.2 Severites

Niveau Definition Reponse attendue
CRITICAL Service down / data loss en cours < 15 min
WARNING Degradation perf / capacite proche limit < 4h ouvrees
INFO Audit, releases, backups OK revue hebdo

3.3 Alertes initiales (Phase 1)

[CRITICAL] HTTP 5xx > 5% en 5 min                   → page Corentin
[CRITICAL] Service down (uptime check fail 3x)      → page Corentin + Yan
[CRITICAL] Disque > 95%                             → page
[WARNING]  CPU > 80% sustained 10 min               → email
[WARNING]  Memoire > 85%                            → email
[WARNING]  Capacite formateur depassee              → email admin pedagogique
[INFO]     Backup quotidien execute (succes/fail)   → log + email si fail

4. Backups — strategie 3-2-1

3 copies des donnees, sur 2 supports differents, dont 1 offsite.

4.1 Targets backup

Quoi Frequence Outil Local Distant
Postgres docmost Quotidien 03:00 pg_dump.gz /opt/formation-hub/backups/local/ S3-compatible (OVH/Backblaze)
Postgres baserow embedded Quotidien 03:00 pg_dump.gz idem idem
Docmost files (uploads) Quotidien 03:00 tar.gz idem idem
Baserow data dir Quotidien 03:00 tar.gz idem idem
.env.prod (encrypted) Sur changement gpg + push to vault (none) Vault hors bande

4.2 Retention

Type Local Distant
Quotidien 30 jours rolling 90 jours rolling
Hebdo (vendredi) 12 semaines 12 mois
Mensuel (1er) 12 mois 5 ans

4.3 Scripts backup

scripts/backup.sh :

#!/usr/bin/env bash
set -euo pipefail
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_DIR=/opt/formation-hub/backups/local
mkdir -p "$BACKUP_DIR"

cd /opt/formation-hub

# Postgres docmost
docker compose -f compose.yml -f compose.prod.yml exec -T docmost-db \
  pg_dump -U docmost docmost | gzip > "$BACKUP_DIR/docmost-db-$DATE.sql.gz"

# Postgres baserow (embedded — exec dans le container baserow)
docker compose -f compose.yml -f compose.prod.yml exec -T baserow \
  pg_dumpall -U postgres | gzip > "$BACKUP_DIR/baserow-db-$DATE.sql.gz"

# Files
docker compose -f compose.yml -f compose.prod.yml exec -T docmost \
  tar czf - /app/data/storage > "$BACKUP_DIR/docmost-files-$DATE.tar.gz"

docker compose -f compose.yml -f compose.prod.yml exec -T baserow \
  tar czf - /baserow/data > "$BACKUP_DIR/baserow-data-$DATE.tar.gz"

# Sync distant via rclone (configure separement)
rclone copy "$BACKUP_DIR/" s3:acadenice-formation-hub-backup/ --include "*-$DATE.*"

# Retention locale (supprime > 30 jours)
find "$BACKUP_DIR" -type f -mtime +30 -delete

/etc/cron.d/formation-hub-backup :

0 3 * * * corentin /opt/formation-hub/scripts/backup.sh >> /var/log/formation-hub-backup.log 2>&1

4.4 Test restauration mensuel

scripts/restore-test.sh execute le 1er du mois sur env isole :

  1. Provisionne un VPS test ephemere
  2. Restore le backup le plus recent
  3. Lance smoke tests
  4. Verifie integrite (checksum, nb rows)
  5. Si fail : alerte CRITICAL + log
  6. Detruit le VPS test

5. Disaster recovery

5.1 Scenarios DR

Scenario Probabilite Impact Plan
VPS down (provider issue) Faible Service down 0-4h Attendre provider OU failover manuel vers VPS backup
Corruption Postgres Faible Data loss < 24h Restore depuis backup quotidien
Compromission complete (rootkit) Tres faible Vol de donnees Wipe + reinstall + restore data + audit complet + RGPD declaration
Provider abandonne service Tres faible Service migre Migration vers autre provider, jusqu'a 1 semaine downtime acceptable
Erreur humaine (rm -rf) Moyenne Variable Backup quotidien + soft delete in DB

5.2 RTO / RPO targets (rappel CDC)

  • RTO (Recovery Time Objective) : 4h max
  • RPO (Recovery Point Objective) : 24h max (backup quotidien)

5.3 Plan de DR — etape par etape

1. DETECT
   - Alerte automatique OU report utilisateur
   - Confirmer le scope (qui est down ? quoi est perdu ?)

2. TRIAGE (15 min)
   - Severite (CRITICAL / WARNING)
   - Notifier Yan + Ludo si CRITICAL
   - Annoncer canal #ops + banner status si user-facing

3. MITIGATE (selon scenario)
   - Restore backup
   - Failover
   - Hotfix
   - Rollback

4. RESTORE
   - Verifier integrite donnees (rollups, FK, nb rows)
   - Smoke tests
   - Notification "back online"

5. POST-MORTEM (sous 7 jours)
   - Timeline
   - Root cause
   - Action items
   - Ajouter au runbook si pattern recurrent

6. Runbooks

Documentation par incident type. Format standardise :

# Runbook : <INCIDENT_TYPE>

## Symptomes
- ...

## Diagnostic
1. Verifier ...
2. Verifier ...

## Resolution
1. Step
2. Step

## Prevention future
- ...

## Rollback / escalade
- ...

6.1 Runbooks Phase 1 (a creer)

Runbook Priorite
runbook-docmost-down.md Haute
runbook-baserow-down.md Haute
runbook-disk-full.md Haute
runbook-postgres-corrupted.md Haute
runbook-restore-from-backup.md Haute
runbook-rotate-secrets.md Moyenne
runbook-bump-docmost-version.md Moyenne
runbook-bump-baserow-version.md Moyenne
runbook-add-new-user.md Faible
runbook-renewal-tls.md Faible (auto via Traefik)

A stocker dans docs/runbooks/ ou directement sur Outline pour acces rapide en incident.

7. Maintenance

7.1 Bumps dependances

Type Frequence Process
Auto via Dependabot (security) Hebdo Auto-PR + CI + merge si vert
Auto via Dependabot (minor/patch) Hebdo Auto-PR + review humaine
Major bumps Manuel PR dediee + tests E2E + decision business
Docmost upstream Decision manuelle (testing staging) PR change image tag + test E2E
Baserow upstream idem idem
Postgres major Annuel max, planifie Backup + migration + restore + verification

7.2 OS patches

Type Frequence
Security patches Debian Auto via unattended-upgrades
Major Debian release Tous les 2-3 ans, planifie
Reboot apres kernel patch Mensuel max, fenetre maintenance

7.3 Window de maintenance

Communiquer 48h avant si downtime > 5 min :

  • Email a tous les utilisateurs Acadenice
  • Banner Docmost / Baserow
  • Slack #internal

Creneau prefere : dimanche 06:00-08:00 UTC (zero usage probable).

8. Capacity planning

8.1 Indicateurs a surveiller

  • Nb users actifs (mensuel)
  • Volume rows Baserow (par table)
  • Volume documents Docmost
  • Storage uploads
  • CPU/RAM moyenne sur 7 jours

8.2 Triggers d'upsizing

Indicateur Seuil Action
CPU moyen > 60% sur 1 semaine Trigger Upsize VPS (4 → 8 vCPU)
RAM moyen > 75% sur 1 semaine Trigger Upsize RAM (8 → 16 Go)
Disque > 70% Trigger Upsize storage OU clean old backups
Nb users simultanes peak > 50 Trigger Considerer 2 replicas + load balancer

8.3 Review trimestrielle

Tous les 3 mois, Corentin review :

  • Couts infra
  • Adequation specs
  • Croissance attendue prochain trimestre
  • Decision upsize/downsize/migrate

9. Incident response

9.1 Severites (rappel)

  • SEV1 : Service down complet (CRITICAL)
  • SEV2 : Degradation majeure (WARNING)
  • SEV3 : Bug isole, workaround possible (INFO)

9.2 Comm template

Pendant incident :

[SEV1] formation-hub - Service degraded
Symptom: <quoi>
Started: <quand>
Investigation: <where we are>
ETA: <estimate restore>
Channel: #ops

Mise a jour toutes les 30 min.

9.3 Post-mortem template

docs/post-mortems/YYYY-MM-DD-titre.md :

# Post-mortem : <titre incident>

## Timeline
- HH:MM detection
- HH:MM triage
- HH:MM mitigation start
- HH:MM service restored
- HH:MM root cause confirmed

## Impact
- Duree downtime : Xh
- Users impactes : Y
- Data loss : oui/non, si oui : combien

## Root cause
<...>

## Pourquoi notre monitoring n'a pas alerte plus tot ?
<...>

## Action items
- [ ] AI 1 : ... (owner @who, due date)
- [ ] AI 2 : ...

## Lessons learned
<...>

Post-mortem blameless : focus sur le systeme, pas la personne.

10. Daily / Weekly / Monthly tasks

10.1 Daily (5 min, matin)

[ ] Check uptime monitoring (vert ?)
[ ] Verifier logs containers (pas d'erreur recurrente ?)
[ ] Verifier backup quotidien execute (status email ou log)
[ ] Check Slack #ops (rien d'urgent ?)

10.2 Weekly (30 min, lundi matin)

[ ] Review Dependabot PRs
[ ] Check disque/CPU graphs (anomalies ?)
[ ] Review issues GitHub ops/sec
[ ] Update CHANGELOG si releases passees
[ ] Plan release prochaine si features pretes

10.3 Monthly (2h, 1er du mois)

[ ] Test restauration backup (DR exercice)
[ ] Audit access list (qui a acces a quoi ?)
[ ] Review security alerts (CVE, audits)
[ ] Capacity planning review
[ ] Review couts infra (vs budget)
[ ] Update runbooks si nouveaux patterns
[ ] Review monitoring : alertes sur-bruyantes ? sous-detectes ?

11. On-call rotation (futur)

Pour l'instant : Corentin = oncall principal, Yan = backup.

Si plus d'admin technique embauches plus tard :

  • Rotation hebdo Corentin / Yan / N
  • Handoff weekly avec recap
  • Compensation oncall (jour off ou prime)

12. Communication metier

Channels :

  • #ops Slack/Teams : equipe technique
  • #internal : tous les salaries Acadenice
  • Email all : announcements majeurs (releases breaking, maintenance)
  • Banner Docmost : info live downtime / maintenance

13. Documentation des operations

Tout doit etre dans docs/runbooks/ (ou Outline [INTERNE] Runbooks) :

  • Comment faire un backup manuel
  • Comment restorer
  • Comment ajouter un user
  • Comment rotate les secrets
  • Comment bump une version Docmost ou Baserow
  • Comment investiguer un alert
  • Comment escalader un incident

14. Outils ops — recap

Outil Phase Cout/mois
UptimeRobot free Phase 1+ 0€
Uptime Kuma self-host Phase 2+ 0€
Prometheus + Grafana Phase 3+ ~5€ resources
Loki Phase 3+ ~5€ resources
Sentry Phase 4+ 0-25€
pg_dump + tar + rclone Phase 1+ 0€
OVH Object Storage / Backblaze Phase 1+ ~5-10€
Slack / Teams webhook Phase 1+ 0€ (existant)

15. Questions ouvertes

  • Self-host Uptime Kuma vs SaaS UptimeRobot pour Phase 1 ?
  • Backup distant : OVH (souverainete FR) vs Backblaze (cout) ?
  • On-call rotation et compensation a definir si embauche
  • Runbook execution automatique (Rundeck ?) ou pure markdown ?
  • Status page publique (Statuspage.io / self-host) pour transparence vers users ?