Déploiement de modèles IA : Kubernetes, Docker et stratégies MLOps
Le déploiement de modèles IA est le processus consistant à déplacer les modèles développés vers un environnement de production de manière fiable, évolutive et maintenable. Dans ce guide, nous examinons les stratégies de déploiement modernes.
Modèles de déploiement
1. Inférence par lot
Traitement de données par lot, tâches planifiées :
Data Lake → Batch Job → Model Inference → Results Storage
2. Inférence en temps réel
Prédictions instantanées basées sur une API :
Request → API Gateway → Model Server → Response
3. Inférence en streaming
Traitement continu de flux de données :
Kafka Stream → Stream Processor → Model → Output Stream
4. Déploiement en périphérie
Inférence sur l’appareil :
Mobile/IoT Device → Optimized Model → Local Inference
Containerisation de modèles avec Docker
Dockerfile de base
1FROM python:3.11-slim 2 3WORKDIR /app 4 5# System dependencies 6RUN apt-get update && apt-get install -y \ 7 libgomp1 \ 8 && rm -rf /var/lib/apt/lists/* 9 10# Python dependencies 11COPY requirements.txt . 12RUN pip install --no-cache-dir -r requirements.txt 13 14# Model and code 15COPY model/ ./model/ 16COPY src/ ./src/ 17 18# Port 19EXPOSE 8000 20 21# Healthcheck 22HEALTHCHECK \ 23 CMD curl -f http://localhost:8000/health || exit 1 24 25# Start command 26CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
Build multi-étages
1# Build stage 2FROM python:3.11 AS builder 3WORKDIR /app 4COPY requirements.txt . 5RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt 6 7# Production stage 8FROM python:3.11-slim 9WORKDIR /app 10COPY /wheels /wheels 11RUN pip install --no-cache-dir /wheels/* 12COPY . . 13CMD ["python", "main.py"]
Support GPU
1FROM NVIDIA/cuda:12.1-runtime-ubuntu22.04 2 3ENV PYTHONDONTWRITEBYTECODE=1 4ENV PYTHONUNBUFFERED=1 5 6# Python installation 7RUN apt-get update && apt-get install -y python3 python3-pip 8 9# PyTorch GPU 10RUN pip3 install torch --index-url https://download.pytorch.org/whl/cu121 11 12COPY . /app 13WORKDIR /app 14CMD ["python3", "inference.py"]
Frameworks de service de modèles
Serveur FastAPI
1from fastapi import FastAPI, HTTPException 2from pydantic import BaseModel 3import torch 4 5app = FastAPI() 6 7# Load model (startup) 8model = None 9 10@app.on_event("startup") 11async def load_model(): 12 global model 13 model = torch.load("model.pt") 14 model.eval() 15 16class PredictionRequest(BaseModel): 17 text: str 18 19class PredictionResponse(BaseModel): 20 prediction: str 21 confidence: float 22 23@app.post("/predict", response_model=PredictionResponse) 24async def predict(request: PredictionRequest): 25 if model is None: 26 raise HTTPException(500, "Model not loaded") 27 28 with torch.no_grad(): 29 output = model(request.text) 30 31 return PredictionResponse( 32 prediction=output["label"], 33 confidence=output["score"] 34 ) 35 36@app.get("/health") 37async def health(): 38 return {"status": "healthy", "model_loaded": model is not None}
TorchServe
1# Create model archive 2torch-model-archiver \ 3 --model-name mymodel \ 4 --version 1.0 \ 5 --model-file model.py \ 6 --serialized-file model.pt \ 7 --handler handler.py 8 9# Start serving 10torchserve --start \ 11 --model-store model_store \ 12 --models mymodel=mymodel.mar
Triton Inference Server
1# config.pbtxt 2name: "text_classifier" 3platform: "pytorch_libtorch" 4max_batch_size: 32 5input [ 6 { 7 name: "INPUT__0" 8 data_type: TYPE_INT64 9 dims: [ -1 ] 10 } 11] 12output [ 13 { 14 name: "OUTPUT__0" 15 data_type: TYPE_FP32 16 dims: [ -1, 2 ] 17 } 18] 19instance_group [ 20 { count: 2, kind: KIND_GPU } 21] 22## Déploiement Kubernetes 23 24### Déploiement basique 25 26```yaml 27apiVersion: apps/v1 28kind: Deployment 29metadata: 30 name: model-server 31spec: 32 replicas: 3 33 selector: 34 matchLabels: 35 app: model-server 36 template: 37 metadata: 38 labels: 39 app: model-server 40 spec: 41 containers: 42 - name: model-server 43 image: myregistry/model-server:v1.0 44 ports: 45 - containerPort: 8000 46 resources: 47 requests: 48 memory: "2Gi" 49 cpu: "1" 50 limits: 51 memory: "4Gi" 52 cpu: "2" 53 livenessProbe: 54 httpGet: 55 path: /health 56 port: 8000 57 initialDelaySeconds: 30 58 periodSeconds: 10 59 readinessProbe: 60 httpGet: 61 path: /ready 62 port: 8000 63 initialDelaySeconds: 5 64 periodSeconds: 5
Déploiement GPU
1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: gpu-model-server 5spec: 6 replicas: 2 7 template: 8 spec: 9 containers: 10 - name: model 11 image: myregistry/gpu-model:v1.0 12 resources: 13 limits: 14 NVIDIA.com/gpu: 1 15 nodeSelector: 16 accelerator: NVIDIA-tesla-t4 17 tolerations: 18 - key: "NVIDIA.com/gpu" 19 operator: "Exists" 20 effect: "NoSchedule"
Horizontal Pod Autoscaler
1apiVersion: autoscaling/v2 2kind: HorizontalPodAutoscaler 3metadata: 4 name: model-server-hpa 5spec: 6 scaleTargetRef: 7 apiVersion: apps/v1 8 kind: Deployment 9 name: model-server 10 minReplicas: 2 11 maxReplicas: 10 12 metrics: 13 - type: Resource 14 resource: 15 name: cpu 16 target: 17 type: Utilization 18 averageUtilization: 70 19 - type: Pods 20 pods: 21 metric: 22 name: requests_per_second 23 target: 24 type: AverageValue 25 averageValue: 100
Service & Ingress
1apiVersion: v1 2kind: Service 3metadata: 4 name: model-service 5spec: 6 selector: 7 app: model-server 8 ports: 9 - port: 80 10 targetPort: 8000 11 type: ClusterIP 12--- 13apiVersion: networking.k8s.io/v1 14kind: Ingress 15metadata: 16 name: model-ingress 17 annotations: 18 nginx.ingress.kubernetes.io/rate-limit: "100" 19spec: 20 rules: 21 - host: model.example.com 22 http: 23 paths: 24 - path: / 25 pathType: Prefix 26 backend: 27 service: 28 name: model-service 29 port: 30 number: 80
Pipeline MLOps
Pipeline CI/CD
1# .github/workflows/mlops.yml 2name: MLOps Pipeline 3 4on: 5 push: 6 branches: [main] 7 8jobs: 9 test: 10 runs-on: ubuntu-latest 11 steps: 12 - uses: actions/checkout@v3 13 - name: Run tests 14 run: pytest tests/ 15 16 train: 17 needs: test 18 runs-on: ubuntu-latest 19 steps: 20 - name: Train model 21 run: python train.py 22 - name: Evaluate model 23 run: python evaluate.py 24 - name: Register model 25 if: success() 26 run: python register_model.py 27 28 deploy: 29 needs: train 30 runs-on: ubuntu-latest 31 steps: 32 - name: Build image 33 run: docker build -t model:${{ github.sha }} . 34 - name: Push to registry 35 run: docker push myregistry/model:${{ github.sha }} 36 - name: Deploy to K8s 37 run: kubectl set image deployment/model model=myregistry/model:${{ github.sha }}
Registre de modèles
1import mlflow 2 3# Registering model 4with mlflow.start_run(): 5 mlflow.log_params({"learning_rate": 0.001, "epochs": 10}) 6 mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93}) 7 mlflow.pytorch.log_model(model, "model") 8 9# Loading model 10model_uri = "models:/text-classifier/production" 11model = mlflow.pytorch.load_model(model_uri) 12## Déploiement Canary 13 14```yaml 15apiVersion: networking.istio.io/v1alpha3 16kind: VirtualService 17metadata: 18 name: model-service 19spec: 20 hosts: 21 - model-service 22 http: 23 - route: 24 - destination: 25 host: model-service-v1 26 weight: 90 27 - destination: 28 host: model-service-v2 29 weight: 10
Monitoring
Métriques Prometheus
1from prometheus_client import Counter, Histogram, start_http_server 2 3PREDICTIONS = Counter('predictions_total', 'Total predictions', ['model', 'status']) 4LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency') 5 6@LATENCY.time() 7def predict(input_data): 8 result = model(input_data) 9 PREDICTIONS.labels(model='v1', status='success').inc() 10 return result
Tableau de bord Grafana
Principales métriques à surveiller :
- Taux de requêtes (RPS)
- Latence (p50, p95, p99)
- Taux d’erreur
- Utilisation du GPU
- Utilisation de la mémoire
- Indicateurs de dérive du modèle
Conclusion
Le déploiement de modèles d’IA peut être rendu fiable et scalable grâce aux pratiques modernes de MLOps. Docker, Kubernetes et les pipelines CI/CD en sont les composants fondamentaux.
Chez Veni AI, nous proposons des solutions de déploiement d’IA pour les entreprises. Contactez-nous pour vos projets.
