# Plan d'operations (RUN) > Strategie d'operations post-launch : monitoring, alerting, backups, DR, incident response, runbooks. > Audience : Corentin (owner ops), Yan (backup), futur freelance. ## 1. Vue d'ensemble — RUN responsibilities ```mermaid flowchart TB subgraph "Daily" D1[Check uptime monitoring] D2[Verifier logs erreurs] D3[Review backups quotidiens] end subgraph "Weekly" W1[Audit dependabot bumps] W2[Check capacite disque/CPU] W3[Review issues / PR ops] end subgraph "Monthly" M1[Test restauration backup] M2[Review security alerts] M3[Audit access list] M4[Capacity planning review] end subgraph "On Incident" I1[Detect / Page] I2[Triage] I3[Mitigate / Restore] I4[Post-mortem] end ``` ## 2. Monitoring ### 2.1 Stack de monitoring (Phase 1 minimal → Phase 3 complet) | Phase | Tool | Role | Cout | |-------|------|------|------| | **Phase 1** | UptimeRobot (free) | Healthcheck HTTP toutes 5 min sur wiki + baserow | 0€ | | **Phase 2** | + Uptime Kuma self-host | Plus de granularite, dashboards perso | 0€ (sur prod VPS ou VPS dedie) | | **Phase 3** | + Prometheus + Grafana | Metriques system + app, alerting fin | ~5€/mois (extra resources) | | **Phase 3** | + Loki | Centralisation logs containers | ~5€/mois | | **Phase 4** | + Sentry self-host ou SaaS | Error tracking app, stack traces | 0€-25€/mois | ### 2.2 Endpoints surveilles (Phase 1) | Endpoint | Frequence | SLA cible | |----------|-----------|-----------| | `https://wiki.acadenice.fr` (HTTP 200) | 5 min | uptime >= 99% | | `https://baserow.acadenice.fr/api/_health/` | 5 min | uptime >= 99% | | `https://bridge.acadenice.fr/api/health` (Phase 2+) | 5 min | uptime >= 99% | ### 2.3 Metriques cles (Phase 3+) System : - CPU usage (alerte > 80% sustained 5 min) - Memoire (alerte > 85%) - Disque (alerte > 80%) - Network in/out Application : - Latence p95 par endpoint (bridge) - Taux d'erreurs HTTP 5xx (alerte > 1%) - Throughput requests/sec - Queue Redis depth (Baserow celery) - Postgres connections actives (alerte > 80% pool size) Business (custom) : - Nb saisies heures/jour (sentinel : si chute brutale = bug saisie) - Nb attributions creees/semaine - Nb projets en cours - Capacite formateurs depassee (alerte si > 0) ## 3. Alerting ### 3.1 Channels | Channel | Severite | Cible | |---------|----------|-------| | Email Corentin + Yan | Tous niveaux | corentin@acadenice.fr, yan@acadenice.fr | | Slack/Teams #ops | warning + critical | Canal interne | | SMS (Twilio ou OVH) | critical seulement | Corentin (oncall principal) | ### 3.2 Severites | Niveau | Definition | Reponse attendue | |--------|-----------|------------------| | **CRITICAL** | Service down / data loss en cours | < 15 min | | **WARNING** | Degradation perf / capacite proche limit | < 4h ouvrees | | **INFO** | Audit, releases, backups OK | revue hebdo | ### 3.3 Alertes initiales (Phase 1) ``` [CRITICAL] HTTP 5xx > 5% en 5 min → page Corentin [CRITICAL] Service down (uptime check fail 3x) → page Corentin + Yan [CRITICAL] Disque > 95% → page [WARNING] CPU > 80% sustained 10 min → email [WARNING] Memoire > 85% → email [WARNING] Capacite formateur depassee → email admin pedagogique [INFO] Backup quotidien execute (succes/fail) → log + email si fail ``` ## 4. Backups — strategie 3-2-1 **3** copies des donnees, sur **2** supports differents, dont **1** offsite. ### 4.1 Targets backup | Quoi | Frequence | Outil | Local | Distant | |------|-----------|-------|-------|---------| | Postgres docmost | Quotidien 03:00 | `pg_dump.gz` | `/opt/formation-hub/backups/local/` | S3-compatible (OVH/Backblaze) | | Postgres baserow embedded | Quotidien 03:00 | `pg_dump.gz` | idem | idem | | Docmost files (uploads) | Quotidien 03:00 | `tar.gz` | idem | idem | | Baserow data dir | Quotidien 03:00 | `tar.gz` | idem | idem | | `.env.prod` (encrypted) | Sur changement | gpg + push to vault | (none) | Vault hors bande | ### 4.2 Retention | Type | Local | Distant | |------|-------|---------| | Quotidien | 30 jours rolling | 90 jours rolling | | Hebdo (vendredi) | 12 semaines | 12 mois | | Mensuel (1er) | 12 mois | 5 ans | ### 4.3 Scripts backup `scripts/backup.sh` : ```bash #!/usr/bin/env bash set -euo pipefail DATE=$(date +%Y%m%d-%H%M%S) BACKUP_DIR=/opt/formation-hub/backups/local mkdir -p "$BACKUP_DIR" cd /opt/formation-hub # Postgres docmost docker compose -f compose.yml -f compose.prod.yml exec -T docmost-db \ pg_dump -U docmost docmost | gzip > "$BACKUP_DIR/docmost-db-$DATE.sql.gz" # Postgres baserow (embedded — exec dans le container baserow) docker compose -f compose.yml -f compose.prod.yml exec -T baserow \ pg_dumpall -U postgres | gzip > "$BACKUP_DIR/baserow-db-$DATE.sql.gz" # Files docker compose -f compose.yml -f compose.prod.yml exec -T docmost \ tar czf - /app/data/storage > "$BACKUP_DIR/docmost-files-$DATE.tar.gz" docker compose -f compose.yml -f compose.prod.yml exec -T baserow \ tar czf - /baserow/data > "$BACKUP_DIR/baserow-data-$DATE.tar.gz" # Sync distant via rclone (configure separement) rclone copy "$BACKUP_DIR/" s3:acadenice-formation-hub-backup/ --include "*-$DATE.*" # Retention locale (supprime > 30 jours) find "$BACKUP_DIR" -type f -mtime +30 -delete ``` `/etc/cron.d/formation-hub-backup` : ``` 0 3 * * * corentin /opt/formation-hub/scripts/backup.sh >> /var/log/formation-hub-backup.log 2>&1 ``` ### 4.4 Test restauration mensuel `scripts/restore-test.sh` execute le 1er du mois sur env isole : 1. Provisionne un VPS test ephemere 2. Restore le backup le plus recent 3. Lance smoke tests 4. Verifie integrite (checksum, nb rows) 5. Si fail : alerte CRITICAL + log 6. Detruit le VPS test ## 5. Disaster recovery ### 5.1 Scenarios DR | Scenario | Probabilite | Impact | Plan | |----------|-------------|--------|------| | VPS down (provider issue) | Faible | Service down 0-4h | Attendre provider OU failover manuel vers VPS backup | | Corruption Postgres | Faible | Data loss < 24h | Restore depuis backup quotidien | | Compromission complete (rootkit) | Tres faible | Vol de donnees | Wipe + reinstall + restore data + audit complet + RGPD declaration | | Provider abandonne service | Tres faible | Service migre | Migration vers autre provider, jusqu'a 1 semaine downtime acceptable | | Erreur humaine (rm -rf) | Moyenne | Variable | Backup quotidien + soft delete in DB | ### 5.2 RTO / RPO targets (rappel CDC) - **RTO** (Recovery Time Objective) : 4h max - **RPO** (Recovery Point Objective) : 24h max (backup quotidien) ### 5.3 Plan de DR — etape par etape ``` 1. DETECT - Alerte automatique OU report utilisateur - Confirmer le scope (qui est down ? quoi est perdu ?) 2. TRIAGE (15 min) - Severite (CRITICAL / WARNING) - Notifier Yan + Ludo si CRITICAL - Annoncer canal #ops + banner status si user-facing 3. MITIGATE (selon scenario) - Restore backup - Failover - Hotfix - Rollback 4. RESTORE - Verifier integrite donnees (rollups, FK, nb rows) - Smoke tests - Notification "back online" 5. POST-MORTEM (sous 7 jours) - Timeline - Root cause - Action items - Ajouter au runbook si pattern recurrent ``` ## 6. Runbooks Documentation par incident type. Format standardise : ``` # Runbook : ## Symptomes - ... ## Diagnostic 1. Verifier ... 2. Verifier ... ## Resolution 1. Step 2. Step ## Prevention future - ... ## Rollback / escalade - ... ``` ### 6.1 Runbooks Phase 1 (a creer) | Runbook | Priorite | |---------|----------| | `runbook-docmost-down.md` | Haute | | `runbook-baserow-down.md` | Haute | | `runbook-disk-full.md` | Haute | | `runbook-postgres-corrupted.md` | Haute | | `runbook-restore-from-backup.md` | Haute | | `runbook-rotate-secrets.md` | Moyenne | | `runbook-bump-docmost-version.md` | Moyenne | | `runbook-bump-baserow-version.md` | Moyenne | | `runbook-add-new-user.md` | Faible | | `runbook-renewal-tls.md` | Faible (auto via Traefik) | A stocker dans `docs/runbooks/` ou directement sur Outline pour acces rapide en incident. ## 7. Maintenance ### 7.1 Bumps dependances | Type | Frequence | Process | |------|-----------|---------| | Auto via Dependabot (security) | Hebdo | Auto-PR + CI + merge si vert | | Auto via Dependabot (minor/patch) | Hebdo | Auto-PR + review humaine | | Major bumps | Manuel | PR dediee + tests E2E + decision business | | Docmost upstream | Decision manuelle (testing staging) | PR change image tag + test E2E | | Baserow upstream | idem | idem | | Postgres major | Annuel max, planifie | Backup + migration + restore + verification | ### 7.2 OS patches | Type | Frequence | |------|-----------| | Security patches Debian | Auto via `unattended-upgrades` | | Major Debian release | Tous les 2-3 ans, planifie | | Reboot apres kernel patch | Mensuel max, fenetre maintenance | ### 7.3 Window de maintenance Communiquer 48h avant si downtime > 5 min : - Email a tous les utilisateurs Acadenice - Banner Docmost / Baserow - Slack #internal Creneau prefere : **dimanche 06:00-08:00 UTC** (zero usage probable). ## 8. Capacity planning ### 8.1 Indicateurs a surveiller - Nb users actifs (mensuel) - Volume rows Baserow (par table) - Volume documents Docmost - Storage uploads - CPU/RAM moyenne sur 7 jours ### 8.2 Triggers d'upsizing | Indicateur | Seuil | Action | |-----------|-------|--------| | CPU moyen > 60% sur 1 semaine | Trigger | Upsize VPS (4 → 8 vCPU) | | RAM moyen > 75% sur 1 semaine | Trigger | Upsize RAM (8 → 16 Go) | | Disque > 70% | Trigger | Upsize storage OU clean old backups | | Nb users simultanes peak > 50 | Trigger | Considerer 2 replicas + load balancer | ### 8.3 Review trimestrielle Tous les 3 mois, Corentin review : - Couts infra - Adequation specs - Croissance attendue prochain trimestre - Decision upsize/downsize/migrate ## 9. Incident response ### 9.1 Severites (rappel) - **SEV1** : Service down complet (CRITICAL) - **SEV2** : Degradation majeure (WARNING) - **SEV3** : Bug isole, workaround possible (INFO) ### 9.2 Comm template Pendant incident : ``` [SEV1] formation-hub - Service degraded Symptom: Started: Investigation: ETA: Channel: #ops ``` Mise a jour toutes les 30 min. ### 9.3 Post-mortem template `docs/post-mortems/YYYY-MM-DD-titre.md` : ```markdown # Post-mortem : ## Timeline - HH:MM detection - HH:MM triage - HH:MM mitigation start - HH:MM service restored - HH:MM root cause confirmed ## Impact - Duree downtime : Xh - Users impactes : Y - Data loss : oui/non, si oui : combien ## Root cause <...> ## Pourquoi notre monitoring n'a pas alerte plus tot ? <...> ## Action items - [ ] AI 1 : ... (owner @who, due date) - [ ] AI 2 : ... ## Lessons learned <...> ``` Post-mortem **blameless** : focus sur le systeme, pas la personne. ## 10. Daily / Weekly / Monthly tasks ### 10.1 Daily (5 min, matin) ``` [ ] Check uptime monitoring (vert ?) [ ] Verifier logs containers (pas d'erreur recurrente ?) [ ] Verifier backup quotidien execute (status email ou log) [ ] Check Slack #ops (rien d'urgent ?) ``` ### 10.2 Weekly (30 min, lundi matin) ``` [ ] Review Dependabot PRs [ ] Check disque/CPU graphs (anomalies ?) [ ] Review issues GitHub ops/sec [ ] Update CHANGELOG si releases passees [ ] Plan release prochaine si features pretes ``` ### 10.3 Monthly (2h, 1er du mois) ``` [ ] Test restauration backup (DR exercice) [ ] Audit access list (qui a acces a quoi ?) [ ] Review security alerts (CVE, audits) [ ] Capacity planning review [ ] Review couts infra (vs budget) [ ] Update runbooks si nouveaux patterns [ ] Review monitoring : alertes sur-bruyantes ? sous-detectes ? ``` ## 11. On-call rotation (futur) Pour l'instant : **Corentin = oncall principal**, Yan = backup. Si plus d'admin technique embauches plus tard : - Rotation hebdo Corentin / Yan / N - Handoff weekly avec recap - Compensation oncall (jour off ou prime) ## 12. Communication metier Channels : - **#ops** Slack/Teams : equipe technique - **#internal** : tous les salaries Acadenice - **Email all** : announcements majeurs (releases breaking, maintenance) - **Banner Docmost** : info live downtime / maintenance ## 13. Documentation des operations Tout doit etre dans `docs/runbooks/` (ou Outline `[INTERNE] Runbooks`) : - Comment faire un backup manuel - Comment restorer - Comment ajouter un user - Comment rotate les secrets - Comment bump une version Docmost ou Baserow - Comment investiguer un alert - Comment escalader un incident ## 14. Outils ops — recap | Outil | Phase | Cout/mois | |-------|-------|-----------| | UptimeRobot free | Phase 1+ | 0€ | | Uptime Kuma self-host | Phase 2+ | 0€ | | Prometheus + Grafana | Phase 3+ | ~5€ resources | | Loki | Phase 3+ | ~5€ resources | | Sentry | Phase 4+ | 0-25€ | | pg_dump + tar + rclone | Phase 1+ | 0€ | | OVH Object Storage / Backblaze | Phase 1+ | ~5-10€ | | Slack / Teams webhook | Phase 1+ | 0€ (existant) | ## 15. Questions ouvertes - [ ] Self-host Uptime Kuma vs SaaS UptimeRobot pour Phase 1 ? - [ ] Backup distant : OVH (souverainete FR) vs Backblaze (cout) ? - [ ] On-call rotation et compensation a definir si embauche - [ ] Runbook execution automatique (Rundeck ?) ou pure markdown ? - [ ] Status page publique (Statuspage.io / self-host) pour transparence vers users ?