Conception complete (Phase 0) pour formation-hub Acadenice : - 19 docs Merise Agile + UML + GitOps + plans (tests/deploy/ops/api) cf docs/00-readme.md pour l'index complet - Stack Docker compose (Docmost + Baserow + Postgres + Redis + MinIO local FS) compose.yml + compose.staging.yml + compose.prod.yml - CI/CD GitHub Actions skeleton (ci, deploy-staging, deploy-prod) - Bridge service skeleton (Hono + TS + Biome + Vitest + zod + pino) - Templates GitHub : PR + 3 issue types + CODEOWNERS + dependabot.yml - Scripts ops : healthcheck, backup quotidien, smoke-test post-deploy - LICENSE AGPL-3.0 + SECURITY.md + CONTRIBUTING.md + CHANGELOG.md - Diagramme drawIO archi infra (XML importable dans diagrams.net) Decisions structurelles enregistrees : - Scope CFA + Agence avec entite PERSONNE pivot multi-roles (ADR-001) - Stack composite Docmost AGPL + Baserow MIT + bridge custom (ADR-001) - Path B : UX quasi-unified via Tiptap node-views custom (ADR-002) - Monorepo trunk-based development (ADR-003) - Postgres separe Docmost/Baserow (ADR-004) - Bridge stack Node 22 + Hono (ADR-005) - Repo neuf prefere a fork Docmost - Prod-like des le jour 1 (pas MVP)
459 lines
13 KiB
Markdown
459 lines
13 KiB
Markdown
# Plan d'operations (RUN)
|
|
|
|
> Strategie d'operations post-launch : monitoring, alerting, backups, DR, incident response, runbooks.
|
|
> Audience : Corentin (owner ops), Yan (backup), futur freelance.
|
|
|
|
## 1. Vue d'ensemble — RUN responsibilities
|
|
|
|
```mermaid
|
|
flowchart TB
|
|
subgraph "Daily"
|
|
D1[Check uptime monitoring]
|
|
D2[Verifier logs erreurs]
|
|
D3[Review backups quotidiens]
|
|
end
|
|
subgraph "Weekly"
|
|
W1[Audit dependabot bumps]
|
|
W2[Check capacite disque/CPU]
|
|
W3[Review issues / PR ops]
|
|
end
|
|
subgraph "Monthly"
|
|
M1[Test restauration backup]
|
|
M2[Review security alerts]
|
|
M3[Audit access list]
|
|
M4[Capacity planning review]
|
|
end
|
|
subgraph "On Incident"
|
|
I1[Detect / Page]
|
|
I2[Triage]
|
|
I3[Mitigate / Restore]
|
|
I4[Post-mortem]
|
|
end
|
|
```
|
|
|
|
## 2. Monitoring
|
|
|
|
### 2.1 Stack de monitoring (Phase 1 minimal → Phase 3 complet)
|
|
|
|
| Phase | Tool | Role | Cout |
|
|
|-------|------|------|------|
|
|
| **Phase 1** | UptimeRobot (free) | Healthcheck HTTP toutes 5 min sur wiki + baserow | 0€ |
|
|
| **Phase 2** | + Uptime Kuma self-host | Plus de granularite, dashboards perso | 0€ (sur prod VPS ou VPS dedie) |
|
|
| **Phase 3** | + Prometheus + Grafana | Metriques system + app, alerting fin | ~5€/mois (extra resources) |
|
|
| **Phase 3** | + Loki | Centralisation logs containers | ~5€/mois |
|
|
| **Phase 4** | + Sentry self-host ou SaaS | Error tracking app, stack traces | 0€-25€/mois |
|
|
|
|
### 2.2 Endpoints surveilles (Phase 1)
|
|
|
|
| Endpoint | Frequence | SLA cible |
|
|
|----------|-----------|-----------|
|
|
| `https://wiki.acadenice.fr` (HTTP 200) | 5 min | uptime >= 99% |
|
|
| `https://baserow.acadenice.fr/api/_health/` | 5 min | uptime >= 99% |
|
|
| `https://bridge.acadenice.fr/api/health` (Phase 2+) | 5 min | uptime >= 99% |
|
|
|
|
### 2.3 Metriques cles (Phase 3+)
|
|
|
|
System :
|
|
- CPU usage (alerte > 80% sustained 5 min)
|
|
- Memoire (alerte > 85%)
|
|
- Disque (alerte > 80%)
|
|
- Network in/out
|
|
|
|
Application :
|
|
- Latence p95 par endpoint (bridge)
|
|
- Taux d'erreurs HTTP 5xx (alerte > 1%)
|
|
- Throughput requests/sec
|
|
- Queue Redis depth (Baserow celery)
|
|
- Postgres connections actives (alerte > 80% pool size)
|
|
|
|
Business (custom) :
|
|
- Nb saisies heures/jour (sentinel : si chute brutale = bug saisie)
|
|
- Nb attributions creees/semaine
|
|
- Nb projets en cours
|
|
- Capacite formateurs depassee (alerte si > 0)
|
|
|
|
## 3. Alerting
|
|
|
|
### 3.1 Channels
|
|
|
|
| Channel | Severite | Cible |
|
|
|---------|----------|-------|
|
|
| Email Corentin + Yan | Tous niveaux | corentin@acadenice.fr, yan@acadenice.fr |
|
|
| Slack/Teams #ops | warning + critical | Canal interne |
|
|
| SMS (Twilio ou OVH) | critical seulement | Corentin (oncall principal) |
|
|
|
|
### 3.2 Severites
|
|
|
|
| Niveau | Definition | Reponse attendue |
|
|
|--------|-----------|------------------|
|
|
| **CRITICAL** | Service down / data loss en cours | < 15 min |
|
|
| **WARNING** | Degradation perf / capacite proche limit | < 4h ouvrees |
|
|
| **INFO** | Audit, releases, backups OK | revue hebdo |
|
|
|
|
### 3.3 Alertes initiales (Phase 1)
|
|
|
|
```
|
|
[CRITICAL] HTTP 5xx > 5% en 5 min → page Corentin
|
|
[CRITICAL] Service down (uptime check fail 3x) → page Corentin + Yan
|
|
[CRITICAL] Disque > 95% → page
|
|
[WARNING] CPU > 80% sustained 10 min → email
|
|
[WARNING] Memoire > 85% → email
|
|
[WARNING] Capacite formateur depassee → email admin pedagogique
|
|
[INFO] Backup quotidien execute (succes/fail) → log + email si fail
|
|
```
|
|
|
|
## 4. Backups — strategie 3-2-1
|
|
|
|
**3** copies des donnees, sur **2** supports differents, dont **1** offsite.
|
|
|
|
### 4.1 Targets backup
|
|
|
|
| Quoi | Frequence | Outil | Local | Distant |
|
|
|------|-----------|-------|-------|---------|
|
|
| Postgres docmost | Quotidien 03:00 | `pg_dump.gz` | `/opt/formation-hub/backups/local/` | S3-compatible (OVH/Backblaze) |
|
|
| Postgres baserow embedded | Quotidien 03:00 | `pg_dump.gz` | idem | idem |
|
|
| Docmost files (uploads) | Quotidien 03:00 | `tar.gz` | idem | idem |
|
|
| Baserow data dir | Quotidien 03:00 | `tar.gz` | idem | idem |
|
|
| `.env.prod` (encrypted) | Sur changement | gpg + push to vault | (none) | Vault hors bande |
|
|
|
|
### 4.2 Retention
|
|
|
|
| Type | Local | Distant |
|
|
|------|-------|---------|
|
|
| Quotidien | 30 jours rolling | 90 jours rolling |
|
|
| Hebdo (vendredi) | 12 semaines | 12 mois |
|
|
| Mensuel (1er) | 12 mois | 5 ans |
|
|
|
|
### 4.3 Scripts backup
|
|
|
|
`scripts/backup.sh` :
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
DATE=$(date +%Y%m%d-%H%M%S)
|
|
BACKUP_DIR=/opt/formation-hub/backups/local
|
|
mkdir -p "$BACKUP_DIR"
|
|
|
|
cd /opt/formation-hub
|
|
|
|
# Postgres docmost
|
|
docker compose -f compose.yml -f compose.prod.yml exec -T docmost-db \
|
|
pg_dump -U docmost docmost | gzip > "$BACKUP_DIR/docmost-db-$DATE.sql.gz"
|
|
|
|
# Postgres baserow (embedded — exec dans le container baserow)
|
|
docker compose -f compose.yml -f compose.prod.yml exec -T baserow \
|
|
pg_dumpall -U postgres | gzip > "$BACKUP_DIR/baserow-db-$DATE.sql.gz"
|
|
|
|
# Files
|
|
docker compose -f compose.yml -f compose.prod.yml exec -T docmost \
|
|
tar czf - /app/data/storage > "$BACKUP_DIR/docmost-files-$DATE.tar.gz"
|
|
|
|
docker compose -f compose.yml -f compose.prod.yml exec -T baserow \
|
|
tar czf - /baserow/data > "$BACKUP_DIR/baserow-data-$DATE.tar.gz"
|
|
|
|
# Sync distant via rclone (configure separement)
|
|
rclone copy "$BACKUP_DIR/" s3:acadenice-formation-hub-backup/ --include "*-$DATE.*"
|
|
|
|
# Retention locale (supprime > 30 jours)
|
|
find "$BACKUP_DIR" -type f -mtime +30 -delete
|
|
```
|
|
|
|
`/etc/cron.d/formation-hub-backup` :
|
|
```
|
|
0 3 * * * corentin /opt/formation-hub/scripts/backup.sh >> /var/log/formation-hub-backup.log 2>&1
|
|
```
|
|
|
|
### 4.4 Test restauration mensuel
|
|
|
|
`scripts/restore-test.sh` execute le 1er du mois sur env isole :
|
|
1. Provisionne un VPS test ephemere
|
|
2. Restore le backup le plus recent
|
|
3. Lance smoke tests
|
|
4. Verifie integrite (checksum, nb rows)
|
|
5. Si fail : alerte CRITICAL + log
|
|
6. Detruit le VPS test
|
|
|
|
## 5. Disaster recovery
|
|
|
|
### 5.1 Scenarios DR
|
|
|
|
| Scenario | Probabilite | Impact | Plan |
|
|
|----------|-------------|--------|------|
|
|
| VPS down (provider issue) | Faible | Service down 0-4h | Attendre provider OU failover manuel vers VPS backup |
|
|
| Corruption Postgres | Faible | Data loss < 24h | Restore depuis backup quotidien |
|
|
| Compromission complete (rootkit) | Tres faible | Vol de donnees | Wipe + reinstall + restore data + audit complet + RGPD declaration |
|
|
| Provider abandonne service | Tres faible | Service migre | Migration vers autre provider, jusqu'a 1 semaine downtime acceptable |
|
|
| Erreur humaine (rm -rf) | Moyenne | Variable | Backup quotidien + soft delete in DB |
|
|
|
|
### 5.2 RTO / RPO targets (rappel CDC)
|
|
|
|
- **RTO** (Recovery Time Objective) : 4h max
|
|
- **RPO** (Recovery Point Objective) : 24h max (backup quotidien)
|
|
|
|
### 5.3 Plan de DR — etape par etape
|
|
|
|
```
|
|
1. DETECT
|
|
- Alerte automatique OU report utilisateur
|
|
- Confirmer le scope (qui est down ? quoi est perdu ?)
|
|
|
|
2. TRIAGE (15 min)
|
|
- Severite (CRITICAL / WARNING)
|
|
- Notifier Yan + Ludo si CRITICAL
|
|
- Annoncer canal #ops + banner status si user-facing
|
|
|
|
3. MITIGATE (selon scenario)
|
|
- Restore backup
|
|
- Failover
|
|
- Hotfix
|
|
- Rollback
|
|
|
|
4. RESTORE
|
|
- Verifier integrite donnees (rollups, FK, nb rows)
|
|
- Smoke tests
|
|
- Notification "back online"
|
|
|
|
5. POST-MORTEM (sous 7 jours)
|
|
- Timeline
|
|
- Root cause
|
|
- Action items
|
|
- Ajouter au runbook si pattern recurrent
|
|
```
|
|
|
|
## 6. Runbooks
|
|
|
|
Documentation par incident type. Format standardise :
|
|
|
|
```
|
|
# Runbook : <INCIDENT_TYPE>
|
|
|
|
## Symptomes
|
|
- ...
|
|
|
|
## Diagnostic
|
|
1. Verifier ...
|
|
2. Verifier ...
|
|
|
|
## Resolution
|
|
1. Step
|
|
2. Step
|
|
|
|
## Prevention future
|
|
- ...
|
|
|
|
## Rollback / escalade
|
|
- ...
|
|
```
|
|
|
|
### 6.1 Runbooks Phase 1 (a creer)
|
|
|
|
| Runbook | Priorite |
|
|
|---------|----------|
|
|
| `runbook-docmost-down.md` | Haute |
|
|
| `runbook-baserow-down.md` | Haute |
|
|
| `runbook-disk-full.md` | Haute |
|
|
| `runbook-postgres-corrupted.md` | Haute |
|
|
| `runbook-restore-from-backup.md` | Haute |
|
|
| `runbook-rotate-secrets.md` | Moyenne |
|
|
| `runbook-bump-docmost-version.md` | Moyenne |
|
|
| `runbook-bump-baserow-version.md` | Moyenne |
|
|
| `runbook-add-new-user.md` | Faible |
|
|
| `runbook-renewal-tls.md` | Faible (auto via Traefik) |
|
|
|
|
A stocker dans `docs/runbooks/` ou directement sur Outline pour acces rapide en incident.
|
|
|
|
## 7. Maintenance
|
|
|
|
### 7.1 Bumps dependances
|
|
|
|
| Type | Frequence | Process |
|
|
|------|-----------|---------|
|
|
| Auto via Dependabot (security) | Hebdo | Auto-PR + CI + merge si vert |
|
|
| Auto via Dependabot (minor/patch) | Hebdo | Auto-PR + review humaine |
|
|
| Major bumps | Manuel | PR dediee + tests E2E + decision business |
|
|
| Docmost upstream | Decision manuelle (testing staging) | PR change image tag + test E2E |
|
|
| Baserow upstream | idem | idem |
|
|
| Postgres major | Annuel max, planifie | Backup + migration + restore + verification |
|
|
|
|
### 7.2 OS patches
|
|
|
|
| Type | Frequence |
|
|
|------|-----------|
|
|
| Security patches Debian | Auto via `unattended-upgrades` |
|
|
| Major Debian release | Tous les 2-3 ans, planifie |
|
|
| Reboot apres kernel patch | Mensuel max, fenetre maintenance |
|
|
|
|
### 7.3 Window de maintenance
|
|
|
|
Communiquer 48h avant si downtime > 5 min :
|
|
- Email a tous les utilisateurs Acadenice
|
|
- Banner Docmost / Baserow
|
|
- Slack #internal
|
|
|
|
Creneau prefere : **dimanche 06:00-08:00 UTC** (zero usage probable).
|
|
|
|
## 8. Capacity planning
|
|
|
|
### 8.1 Indicateurs a surveiller
|
|
|
|
- Nb users actifs (mensuel)
|
|
- Volume rows Baserow (par table)
|
|
- Volume documents Docmost
|
|
- Storage uploads
|
|
- CPU/RAM moyenne sur 7 jours
|
|
|
|
### 8.2 Triggers d'upsizing
|
|
|
|
| Indicateur | Seuil | Action |
|
|
|-----------|-------|--------|
|
|
| CPU moyen > 60% sur 1 semaine | Trigger | Upsize VPS (4 → 8 vCPU) |
|
|
| RAM moyen > 75% sur 1 semaine | Trigger | Upsize RAM (8 → 16 Go) |
|
|
| Disque > 70% | Trigger | Upsize storage OU clean old backups |
|
|
| Nb users simultanes peak > 50 | Trigger | Considerer 2 replicas + load balancer |
|
|
|
|
### 8.3 Review trimestrielle
|
|
|
|
Tous les 3 mois, Corentin review :
|
|
- Couts infra
|
|
- Adequation specs
|
|
- Croissance attendue prochain trimestre
|
|
- Decision upsize/downsize/migrate
|
|
|
|
## 9. Incident response
|
|
|
|
### 9.1 Severites (rappel)
|
|
|
|
- **SEV1** : Service down complet (CRITICAL)
|
|
- **SEV2** : Degradation majeure (WARNING)
|
|
- **SEV3** : Bug isole, workaround possible (INFO)
|
|
|
|
### 9.2 Comm template
|
|
|
|
Pendant incident :
|
|
```
|
|
[SEV1] formation-hub - Service degraded
|
|
Symptom: <quoi>
|
|
Started: <quand>
|
|
Investigation: <where we are>
|
|
ETA: <estimate restore>
|
|
Channel: #ops
|
|
```
|
|
|
|
Mise a jour toutes les 30 min.
|
|
|
|
### 9.3 Post-mortem template
|
|
|
|
`docs/post-mortems/YYYY-MM-DD-titre.md` :
|
|
|
|
```markdown
|
|
# Post-mortem : <titre incident>
|
|
|
|
## Timeline
|
|
- HH:MM detection
|
|
- HH:MM triage
|
|
- HH:MM mitigation start
|
|
- HH:MM service restored
|
|
- HH:MM root cause confirmed
|
|
|
|
## Impact
|
|
- Duree downtime : Xh
|
|
- Users impactes : Y
|
|
- Data loss : oui/non, si oui : combien
|
|
|
|
## Root cause
|
|
<...>
|
|
|
|
## Pourquoi notre monitoring n'a pas alerte plus tot ?
|
|
<...>
|
|
|
|
## Action items
|
|
- [ ] AI 1 : ... (owner @who, due date)
|
|
- [ ] AI 2 : ...
|
|
|
|
## Lessons learned
|
|
<...>
|
|
```
|
|
|
|
Post-mortem **blameless** : focus sur le systeme, pas la personne.
|
|
|
|
## 10. Daily / Weekly / Monthly tasks
|
|
|
|
### 10.1 Daily (5 min, matin)
|
|
|
|
```
|
|
[ ] Check uptime monitoring (vert ?)
|
|
[ ] Verifier logs containers (pas d'erreur recurrente ?)
|
|
[ ] Verifier backup quotidien execute (status email ou log)
|
|
[ ] Check Slack #ops (rien d'urgent ?)
|
|
```
|
|
|
|
### 10.2 Weekly (30 min, lundi matin)
|
|
|
|
```
|
|
[ ] Review Dependabot PRs
|
|
[ ] Check disque/CPU graphs (anomalies ?)
|
|
[ ] Review issues GitHub ops/sec
|
|
[ ] Update CHANGELOG si releases passees
|
|
[ ] Plan release prochaine si features pretes
|
|
```
|
|
|
|
### 10.3 Monthly (2h, 1er du mois)
|
|
|
|
```
|
|
[ ] Test restauration backup (DR exercice)
|
|
[ ] Audit access list (qui a acces a quoi ?)
|
|
[ ] Review security alerts (CVE, audits)
|
|
[ ] Capacity planning review
|
|
[ ] Review couts infra (vs budget)
|
|
[ ] Update runbooks si nouveaux patterns
|
|
[ ] Review monitoring : alertes sur-bruyantes ? sous-detectes ?
|
|
```
|
|
|
|
## 11. On-call rotation (futur)
|
|
|
|
Pour l'instant : **Corentin = oncall principal**, Yan = backup.
|
|
|
|
Si plus d'admin technique embauches plus tard :
|
|
- Rotation hebdo Corentin / Yan / N
|
|
- Handoff weekly avec recap
|
|
- Compensation oncall (jour off ou prime)
|
|
|
|
## 12. Communication metier
|
|
|
|
Channels :
|
|
- **#ops** Slack/Teams : equipe technique
|
|
- **#internal** : tous les salaries Acadenice
|
|
- **Email all** : announcements majeurs (releases breaking, maintenance)
|
|
- **Banner Docmost** : info live downtime / maintenance
|
|
|
|
## 13. Documentation des operations
|
|
|
|
Tout doit etre dans `docs/runbooks/` (ou Outline `[INTERNE] Runbooks`) :
|
|
- Comment faire un backup manuel
|
|
- Comment restorer
|
|
- Comment ajouter un user
|
|
- Comment rotate les secrets
|
|
- Comment bump une version Docmost ou Baserow
|
|
- Comment investiguer un alert
|
|
- Comment escalader un incident
|
|
|
|
## 14. Outils ops — recap
|
|
|
|
| Outil | Phase | Cout/mois |
|
|
|-------|-------|-----------|
|
|
| UptimeRobot free | Phase 1+ | 0€ |
|
|
| Uptime Kuma self-host | Phase 2+ | 0€ |
|
|
| Prometheus + Grafana | Phase 3+ | ~5€ resources |
|
|
| Loki | Phase 3+ | ~5€ resources |
|
|
| Sentry | Phase 4+ | 0-25€ |
|
|
| pg_dump + tar + rclone | Phase 1+ | 0€ |
|
|
| OVH Object Storage / Backblaze | Phase 1+ | ~5-10€ |
|
|
| Slack / Teams webhook | Phase 1+ | 0€ (existant) |
|
|
|
|
## 15. Questions ouvertes
|
|
|
|
- [ ] Self-host Uptime Kuma vs SaaS UptimeRobot pour Phase 1 ?
|
|
- [ ] Backup distant : OVH (souverainete FR) vs Backblaze (cout) ?
|
|
- [ ] On-call rotation et compensation a definir si embauche
|
|
- [ ] Runbook execution automatique (Rundeck ?) ou pure markdown ?
|
|
- [ ] Status page publique (Statuspage.io / self-host) pour transparence vers users ?
|