Conception complete (Phase 0) pour formation-hub Acadenice : - 19 docs Merise Agile + UML + GitOps + plans (tests/deploy/ops/api) cf docs/00-readme.md pour l'index complet - Stack Docker compose (Docmost + Baserow + Postgres + Redis + MinIO local FS) compose.yml + compose.staging.yml + compose.prod.yml - CI/CD GitHub Actions skeleton (ci, deploy-staging, deploy-prod) - Bridge service skeleton (Hono + TS + Biome + Vitest + zod + pino) - Templates GitHub : PR + 3 issue types + CODEOWNERS + dependabot.yml - Scripts ops : healthcheck, backup quotidien, smoke-test post-deploy - LICENSE AGPL-3.0 + SECURITY.md + CONTRIBUTING.md + CHANGELOG.md - Diagramme drawIO archi infra (XML importable dans diagrams.net) Decisions structurelles enregistrees : - Scope CFA + Agence avec entite PERSONNE pivot multi-roles (ADR-001) - Stack composite Docmost AGPL + Baserow MIT + bridge custom (ADR-001) - Path B : UX quasi-unified via Tiptap node-views custom (ADR-002) - Monorepo trunk-based development (ADR-003) - Postgres separe Docmost/Baserow (ADR-004) - Bridge stack Node 22 + Hono (ADR-005) - Repo neuf prefere a fork Docmost - Prod-like des le jour 1 (pas MVP)
13 KiB
Plan d'operations (RUN)
Strategie d'operations post-launch : monitoring, alerting, backups, DR, incident response, runbooks. Audience : Corentin (owner ops), Yan (backup), futur freelance.
1. Vue d'ensemble — RUN responsibilities
flowchart TB
subgraph "Daily"
D1[Check uptime monitoring]
D2[Verifier logs erreurs]
D3[Review backups quotidiens]
end
subgraph "Weekly"
W1[Audit dependabot bumps]
W2[Check capacite disque/CPU]
W3[Review issues / PR ops]
end
subgraph "Monthly"
M1[Test restauration backup]
M2[Review security alerts]
M3[Audit access list]
M4[Capacity planning review]
end
subgraph "On Incident"
I1[Detect / Page]
I2[Triage]
I3[Mitigate / Restore]
I4[Post-mortem]
end
2. Monitoring
2.1 Stack de monitoring (Phase 1 minimal → Phase 3 complet)
| Phase | Tool | Role | Cout |
|---|---|---|---|
| Phase 1 | UptimeRobot (free) | Healthcheck HTTP toutes 5 min sur wiki + baserow | 0€ |
| Phase 2 | + Uptime Kuma self-host | Plus de granularite, dashboards perso | 0€ (sur prod VPS ou VPS dedie) |
| Phase 3 | + Prometheus + Grafana | Metriques system + app, alerting fin | ~5€/mois (extra resources) |
| Phase 3 | + Loki | Centralisation logs containers | ~5€/mois |
| Phase 4 | + Sentry self-host ou SaaS | Error tracking app, stack traces | 0€-25€/mois |
2.2 Endpoints surveilles (Phase 1)
| Endpoint | Frequence | SLA cible |
|---|---|---|
https://wiki.acadenice.fr (HTTP 200) |
5 min | uptime >= 99% |
https://baserow.acadenice.fr/api/_health/ |
5 min | uptime >= 99% |
https://bridge.acadenice.fr/api/health (Phase 2+) |
5 min | uptime >= 99% |
2.3 Metriques cles (Phase 3+)
System :
- CPU usage (alerte > 80% sustained 5 min)
- Memoire (alerte > 85%)
- Disque (alerte > 80%)
- Network in/out
Application :
- Latence p95 par endpoint (bridge)
- Taux d'erreurs HTTP 5xx (alerte > 1%)
- Throughput requests/sec
- Queue Redis depth (Baserow celery)
- Postgres connections actives (alerte > 80% pool size)
Business (custom) :
- Nb saisies heures/jour (sentinel : si chute brutale = bug saisie)
- Nb attributions creees/semaine
- Nb projets en cours
- Capacite formateurs depassee (alerte si > 0)
3. Alerting
3.1 Channels
| Channel | Severite | Cible |
|---|---|---|
| Email Corentin + Yan | Tous niveaux | corentin@acadenice.fr, yan@acadenice.fr |
| Slack/Teams #ops | warning + critical | Canal interne |
| SMS (Twilio ou OVH) | critical seulement | Corentin (oncall principal) |
3.2 Severites
| Niveau | Definition | Reponse attendue |
|---|---|---|
| CRITICAL | Service down / data loss en cours | < 15 min |
| WARNING | Degradation perf / capacite proche limit | < 4h ouvrees |
| INFO | Audit, releases, backups OK | revue hebdo |
3.3 Alertes initiales (Phase 1)
[CRITICAL] HTTP 5xx > 5% en 5 min → page Corentin
[CRITICAL] Service down (uptime check fail 3x) → page Corentin + Yan
[CRITICAL] Disque > 95% → page
[WARNING] CPU > 80% sustained 10 min → email
[WARNING] Memoire > 85% → email
[WARNING] Capacite formateur depassee → email admin pedagogique
[INFO] Backup quotidien execute (succes/fail) → log + email si fail
4. Backups — strategie 3-2-1
3 copies des donnees, sur 2 supports differents, dont 1 offsite.
4.1 Targets backup
| Quoi | Frequence | Outil | Local | Distant |
|---|---|---|---|---|
| Postgres docmost | Quotidien 03:00 | pg_dump.gz |
/opt/formation-hub/backups/local/ |
S3-compatible (OVH/Backblaze) |
| Postgres baserow embedded | Quotidien 03:00 | pg_dump.gz |
idem | idem |
| Docmost files (uploads) | Quotidien 03:00 | tar.gz |
idem | idem |
| Baserow data dir | Quotidien 03:00 | tar.gz |
idem | idem |
.env.prod (encrypted) |
Sur changement | gpg + push to vault | (none) | Vault hors bande |
4.2 Retention
| Type | Local | Distant |
|---|---|---|
| Quotidien | 30 jours rolling | 90 jours rolling |
| Hebdo (vendredi) | 12 semaines | 12 mois |
| Mensuel (1er) | 12 mois | 5 ans |
4.3 Scripts backup
scripts/backup.sh :
#!/usr/bin/env bash
set -euo pipefail
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_DIR=/opt/formation-hub/backups/local
mkdir -p "$BACKUP_DIR"
cd /opt/formation-hub
# Postgres docmost
docker compose -f compose.yml -f compose.prod.yml exec -T docmost-db \
pg_dump -U docmost docmost | gzip > "$BACKUP_DIR/docmost-db-$DATE.sql.gz"
# Postgres baserow (embedded — exec dans le container baserow)
docker compose -f compose.yml -f compose.prod.yml exec -T baserow \
pg_dumpall -U postgres | gzip > "$BACKUP_DIR/baserow-db-$DATE.sql.gz"
# Files
docker compose -f compose.yml -f compose.prod.yml exec -T docmost \
tar czf - /app/data/storage > "$BACKUP_DIR/docmost-files-$DATE.tar.gz"
docker compose -f compose.yml -f compose.prod.yml exec -T baserow \
tar czf - /baserow/data > "$BACKUP_DIR/baserow-data-$DATE.tar.gz"
# Sync distant via rclone (configure separement)
rclone copy "$BACKUP_DIR/" s3:acadenice-formation-hub-backup/ --include "*-$DATE.*"
# Retention locale (supprime > 30 jours)
find "$BACKUP_DIR" -type f -mtime +30 -delete
/etc/cron.d/formation-hub-backup :
0 3 * * * corentin /opt/formation-hub/scripts/backup.sh >> /var/log/formation-hub-backup.log 2>&1
4.4 Test restauration mensuel
scripts/restore-test.sh execute le 1er du mois sur env isole :
- Provisionne un VPS test ephemere
- Restore le backup le plus recent
- Lance smoke tests
- Verifie integrite (checksum, nb rows)
- Si fail : alerte CRITICAL + log
- Detruit le VPS test
5. Disaster recovery
5.1 Scenarios DR
| Scenario | Probabilite | Impact | Plan |
|---|---|---|---|
| VPS down (provider issue) | Faible | Service down 0-4h | Attendre provider OU failover manuel vers VPS backup |
| Corruption Postgres | Faible | Data loss < 24h | Restore depuis backup quotidien |
| Compromission complete (rootkit) | Tres faible | Vol de donnees | Wipe + reinstall + restore data + audit complet + RGPD declaration |
| Provider abandonne service | Tres faible | Service migre | Migration vers autre provider, jusqu'a 1 semaine downtime acceptable |
| Erreur humaine (rm -rf) | Moyenne | Variable | Backup quotidien + soft delete in DB |
5.2 RTO / RPO targets (rappel CDC)
- RTO (Recovery Time Objective) : 4h max
- RPO (Recovery Point Objective) : 24h max (backup quotidien)
5.3 Plan de DR — etape par etape
1. DETECT
- Alerte automatique OU report utilisateur
- Confirmer le scope (qui est down ? quoi est perdu ?)
2. TRIAGE (15 min)
- Severite (CRITICAL / WARNING)
- Notifier Yan + Ludo si CRITICAL
- Annoncer canal #ops + banner status si user-facing
3. MITIGATE (selon scenario)
- Restore backup
- Failover
- Hotfix
- Rollback
4. RESTORE
- Verifier integrite donnees (rollups, FK, nb rows)
- Smoke tests
- Notification "back online"
5. POST-MORTEM (sous 7 jours)
- Timeline
- Root cause
- Action items
- Ajouter au runbook si pattern recurrent
6. Runbooks
Documentation par incident type. Format standardise :
# Runbook : <INCIDENT_TYPE>
## Symptomes
- ...
## Diagnostic
1. Verifier ...
2. Verifier ...
## Resolution
1. Step
2. Step
## Prevention future
- ...
## Rollback / escalade
- ...
6.1 Runbooks Phase 1 (a creer)
| Runbook | Priorite |
|---|---|
runbook-docmost-down.md |
Haute |
runbook-baserow-down.md |
Haute |
runbook-disk-full.md |
Haute |
runbook-postgres-corrupted.md |
Haute |
runbook-restore-from-backup.md |
Haute |
runbook-rotate-secrets.md |
Moyenne |
runbook-bump-docmost-version.md |
Moyenne |
runbook-bump-baserow-version.md |
Moyenne |
runbook-add-new-user.md |
Faible |
runbook-renewal-tls.md |
Faible (auto via Traefik) |
A stocker dans docs/runbooks/ ou directement sur Outline pour acces rapide en incident.
7. Maintenance
7.1 Bumps dependances
| Type | Frequence | Process |
|---|---|---|
| Auto via Dependabot (security) | Hebdo | Auto-PR + CI + merge si vert |
| Auto via Dependabot (minor/patch) | Hebdo | Auto-PR + review humaine |
| Major bumps | Manuel | PR dediee + tests E2E + decision business |
| Docmost upstream | Decision manuelle (testing staging) | PR change image tag + test E2E |
| Baserow upstream | idem | idem |
| Postgres major | Annuel max, planifie | Backup + migration + restore + verification |
7.2 OS patches
| Type | Frequence |
|---|---|
| Security patches Debian | Auto via unattended-upgrades |
| Major Debian release | Tous les 2-3 ans, planifie |
| Reboot apres kernel patch | Mensuel max, fenetre maintenance |
7.3 Window de maintenance
Communiquer 48h avant si downtime > 5 min :
- Email a tous les utilisateurs Acadenice
- Banner Docmost / Baserow
- Slack #internal
Creneau prefere : dimanche 06:00-08:00 UTC (zero usage probable).
8. Capacity planning
8.1 Indicateurs a surveiller
- Nb users actifs (mensuel)
- Volume rows Baserow (par table)
- Volume documents Docmost
- Storage uploads
- CPU/RAM moyenne sur 7 jours
8.2 Triggers d'upsizing
| Indicateur | Seuil | Action |
|---|---|---|
| CPU moyen > 60% sur 1 semaine | Trigger | Upsize VPS (4 → 8 vCPU) |
| RAM moyen > 75% sur 1 semaine | Trigger | Upsize RAM (8 → 16 Go) |
| Disque > 70% | Trigger | Upsize storage OU clean old backups |
| Nb users simultanes peak > 50 | Trigger | Considerer 2 replicas + load balancer |
8.3 Review trimestrielle
Tous les 3 mois, Corentin review :
- Couts infra
- Adequation specs
- Croissance attendue prochain trimestre
- Decision upsize/downsize/migrate
9. Incident response
9.1 Severites (rappel)
- SEV1 : Service down complet (CRITICAL)
- SEV2 : Degradation majeure (WARNING)
- SEV3 : Bug isole, workaround possible (INFO)
9.2 Comm template
Pendant incident :
[SEV1] formation-hub - Service degraded
Symptom: <quoi>
Started: <quand>
Investigation: <where we are>
ETA: <estimate restore>
Channel: #ops
Mise a jour toutes les 30 min.
9.3 Post-mortem template
docs/post-mortems/YYYY-MM-DD-titre.md :
# Post-mortem : <titre incident>
## Timeline
- HH:MM detection
- HH:MM triage
- HH:MM mitigation start
- HH:MM service restored
- HH:MM root cause confirmed
## Impact
- Duree downtime : Xh
- Users impactes : Y
- Data loss : oui/non, si oui : combien
## Root cause
<...>
## Pourquoi notre monitoring n'a pas alerte plus tot ?
<...>
## Action items
- [ ] AI 1 : ... (owner @who, due date)
- [ ] AI 2 : ...
## Lessons learned
<...>
Post-mortem blameless : focus sur le systeme, pas la personne.
10. Daily / Weekly / Monthly tasks
10.1 Daily (5 min, matin)
[ ] Check uptime monitoring (vert ?)
[ ] Verifier logs containers (pas d'erreur recurrente ?)
[ ] Verifier backup quotidien execute (status email ou log)
[ ] Check Slack #ops (rien d'urgent ?)
10.2 Weekly (30 min, lundi matin)
[ ] Review Dependabot PRs
[ ] Check disque/CPU graphs (anomalies ?)
[ ] Review issues GitHub ops/sec
[ ] Update CHANGELOG si releases passees
[ ] Plan release prochaine si features pretes
10.3 Monthly (2h, 1er du mois)
[ ] Test restauration backup (DR exercice)
[ ] Audit access list (qui a acces a quoi ?)
[ ] Review security alerts (CVE, audits)
[ ] Capacity planning review
[ ] Review couts infra (vs budget)
[ ] Update runbooks si nouveaux patterns
[ ] Review monitoring : alertes sur-bruyantes ? sous-detectes ?
11. On-call rotation (futur)
Pour l'instant : Corentin = oncall principal, Yan = backup.
Si plus d'admin technique embauches plus tard :
- Rotation hebdo Corentin / Yan / N
- Handoff weekly avec recap
- Compensation oncall (jour off ou prime)
12. Communication metier
Channels :
- #ops Slack/Teams : equipe technique
- #internal : tous les salaries Acadenice
- Email all : announcements majeurs (releases breaking, maintenance)
- Banner Docmost : info live downtime / maintenance
13. Documentation des operations
Tout doit etre dans docs/runbooks/ (ou Outline [INTERNE] Runbooks) :
- Comment faire un backup manuel
- Comment restorer
- Comment ajouter un user
- Comment rotate les secrets
- Comment bump une version Docmost ou Baserow
- Comment investiguer un alert
- Comment escalader un incident
14. Outils ops — recap
| Outil | Phase | Cout/mois |
|---|---|---|
| UptimeRobot free | Phase 1+ | 0€ |
| Uptime Kuma self-host | Phase 2+ | 0€ |
| Prometheus + Grafana | Phase 3+ | ~5€ resources |
| Loki | Phase 3+ | ~5€ resources |
| Sentry | Phase 4+ | 0-25€ |
| pg_dump + tar + rclone | Phase 1+ | 0€ |
| OVH Object Storage / Backblaze | Phase 1+ | ~5-10€ |
| Slack / Teams webhook | Phase 1+ | 0€ (existant) |
15. Questions ouvertes
- Self-host Uptime Kuma vs SaaS UptimeRobot pour Phase 1 ?
- Backup distant : OVH (souverainete FR) vs Backblaze (cout) ?
- On-call rotation et compensation a definir si embauche
- Runbook execution automatique (Rundeck ?) ou pure markdown ?
- Status page publique (Statuspage.io / self-host) pour transparence vers users ?