Field	Value	Source
Canonical Path	/blog/ai-model-deployment-kubernetes-docker-mlops	Veni AI Blog
Primary Category	MLOps	Post Metadata
Author	Veni AI Technical Team	Post Metadata

Déploiement de modèles IA : Kubernetes, Docker et stratégies MLOps

Le déploiement de modèles IA est le processus consistant à déplacer les modèles développés vers un environnement de production de manière fiable, évolutive et maintenable. Dans ce guide, nous examinons les stratégies de déploiement modernes.

Modèles de déploiement

1. Inférence par lot

Traitement de données par lot, tâches planifiées :

Data Lake → Batch Job → Model Inference → Results Storage

2. Inférence en temps réel

Prédictions instantanées basées sur une API :

Request → API Gateway → Model Server → Response

3. Inférence en streaming

Traitement continu de flux de données :

Kafka Stream → Stream Processor → Model → Output Stream

4. Déploiement en périphérie

Inférence sur l’appareil :

Mobile/IoT Device → Optimized Model → Local Inference

Containerisation de modèles avec Docker

Dockerfile de base

1FROM python:3.11-slim
2
3WORKDIR /app
4
5# System dependencies
6RUN apt-get update && apt-get install -y \
7    libgomp1 \
8    && rm -rf /var/lib/apt/lists/*
9
10# Python dependencies
11COPY requirements.txt .
12RUN pip install --no-cache-dir -r requirements.txt
13
14# Model and code
15COPY model/ ./model/
16COPY src/ ./src/
17
18# Port
19EXPOSE 8000
20
21# Healthcheck
22HEALTHCHECK --interval=30s --timeout=10s \
23    CMD curl -f http://localhost:8000/health || exit 1
24
25# Start command
26CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Build multi-étages

1# Build stage
2FROM python:3.11 AS builder
3WORKDIR /app
4COPY requirements.txt .
5RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt
6
7# Production stage
8FROM python:3.11-slim
9WORKDIR /app
10COPY --from=builder /wheels /wheels
11RUN pip install --no-cache-dir /wheels/*
12COPY . .
13CMD ["python", "main.py"]

Support GPU

1FROM NVIDIA/cuda:12.1-runtime-ubuntu22.04
2
3ENV PYTHONDONTWRITEBYTECODE=1
4ENV PYTHONUNBUFFERED=1
5
6# Python installation
7RUN apt-get update && apt-get install -y python3 python3-pip
8
9# PyTorch GPU
10RUN pip3 install torch --index-url https://download.pytorch.org/whl/cu121
11
12COPY . /app
13WORKDIR /app
14CMD ["python3", "inference.py"]

Frameworks de service de modèles

Serveur FastAPI

1from fastapi import FastAPI, HTTPException
2from pydantic import BaseModel
3import torch
4
5app = FastAPI()
6
7# Load model (startup)
8model = None
9
10@app.on_event("startup")
11async def load_model():
12    global model
13    model = torch.load("model.pt")
14    model.eval()
15
16class PredictionRequest(BaseModel):
17    text: str
18
19class PredictionResponse(BaseModel):
20    prediction: str
21    confidence: float
22
23@app.post("/predict", response_model=PredictionResponse)
24async def predict(request: PredictionRequest):
25    if model is None:
26        raise HTTPException(500, "Model not loaded")
27    
28    with torch.no_grad():
29        output = model(request.text)
30    
31    return PredictionResponse(
32        prediction=output["label"],
33        confidence=output["score"]
34    )
35
36@app.get("/health")
37async def health():
38    return {"status": "healthy", "model_loaded": model is not None}

TorchServe

1# Create model archive
2torch-model-archiver \
3    --model-name mymodel \
4    --version 1.0 \
5    --model-file model.py \
6    --serialized-file model.pt \
7    --handler handler.py
8
9# Start serving
10torchserve --start \
11    --model-store model_store \
12    --models mymodel=mymodel.mar

Triton Inference Server

1# config.pbtxt
2name: "text_classifier"
3platform: "pytorch_libtorch"
4max_batch_size: 32
5input [
6  {
7    name: "INPUT__0"
8    data_type: TYPE_INT64
9    dims: [ -1 ]
10  }
11]
12output [
13  {
14    name: "OUTPUT__0"
15    data_type: TYPE_FP32
16    dims: [ -1, 2 ]
17  }
18]
19instance_group [
20  { count: 2, kind: KIND_GPU }
21]
22## Déploiement Kubernetes
23
24### Déploiement basique
25
26```yaml
27apiVersion: apps/v1
28kind: Deployment
29metadata:
30  name: model-server
31spec:
32  replicas: 3
33  selector:
34    matchLabels:
35      app: model-server
36  template:
37    metadata:
38      labels:
39        app: model-server
40    spec:
41      containers:
42      - name: model-server
43        image: myregistry/model-server:v1.0
44        ports:
45        - containerPort: 8000
46        resources:
47          requests:
48            memory: "2Gi"
49            cpu: "1"
50          limits:
51            memory: "4Gi"
52            cpu: "2"
53        livenessProbe:
54          httpGet:
55            path: /health
56            port: 8000
57          initialDelaySeconds: 30
58          periodSeconds: 10
59        readinessProbe:
60          httpGet:
61            path: /ready
62            port: 8000
63          initialDelaySeconds: 5
64          periodSeconds: 5

Déploiement GPU

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: gpu-model-server
5spec:
6  replicas: 2
7  template:
8    spec:
9      containers:
10      - name: model
11        image: myregistry/gpu-model:v1.0
12        resources:
13          limits:
14            NVIDIA.com/gpu: 1
15      nodeSelector:
16        accelerator: NVIDIA-tesla-t4
17      tolerations:
18      - key: "NVIDIA.com/gpu"
19        operator: "Exists"
20        effect: "NoSchedule"

Horizontal Pod Autoscaler

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: model-server-hpa
5spec:
6  scaleTargetRef:
7    apiVersion: apps/v1
8    kind: Deployment
9    name: model-server
10  minReplicas: 2
11  maxReplicas: 10
12  metrics:
13  - type: Resource
14    resource:
15      name: cpu
16      target:
17        type: Utilization
18        averageUtilization: 70
19  - type: Pods
20    pods:
21      metric:
22        name: requests_per_second
23      target:
24        type: AverageValue
25        averageValue: 100

Service & Ingress

1apiVersion: v1
2kind: Service
3metadata:
4  name: model-service
5spec:
6  selector:
7    app: model-server
8  ports:
9  - port: 80
10    targetPort: 8000
11  type: ClusterIP
12---
13apiVersion: networking.k8s.io/v1
14kind: Ingress
15metadata:
16  name: model-ingress
17  annotations:
18    nginx.ingress.kubernetes.io/rate-limit: "100"
19spec:
20  rules:
21  - host: model.example.com
22    http:
23      paths:
24      - path: /
25        pathType: Prefix
26        backend:
27          service:
28            name: model-service
29            port:
30              number: 80

Pipeline MLOps

Pipeline CI/CD

1# .github/workflows/mlops.yml
2name: MLOps Pipeline
3
4on:
5  push:
6    branches: [main]
7
8jobs:
9  test:
10    runs-on: ubuntu-latest
11    steps:
12      - uses: actions/checkout@v3
13      - name: Run tests
14        run: pytest tests/
15
16  train:
17    needs: test
18    runs-on: ubuntu-latest
19    steps:
20      - name: Train model
21        run: python train.py
22      - name: Evaluate model
23        run: python evaluate.py
24      - name: Register model
25        if: success()
26        run: python register_model.py
27
28  deploy:
29    needs: train
30    runs-on: ubuntu-latest
31    steps:
32      - name: Build image
33        run: docker build -t model:${{ github.sha }} .
34      - name: Push to registry
35        run: docker push myregistry/model:${{ github.sha }}
36      - name: Deploy to K8s
37        run: kubectl set image deployment/model model=myregistry/model:${{ github.sha }}

Registre de modèles

1import mlflow
2
3# Registering model
4with mlflow.start_run():
5    mlflow.log_params({"learning_rate": 0.001, "epochs": 10})
6    mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93})
7    mlflow.pytorch.log_model(model, "model")
8    
9# Loading model
10model_uri = "models:/text-classifier/production"
11model = mlflow.pytorch.load_model(model_uri)
12## Déploiement Canary
13
14```yaml
15apiVersion: networking.istio.io/v1alpha3
16kind: VirtualService
17metadata:
18  name: model-service
19spec:
20  hosts:
21  - model-service
22  http:
23  - route:
24    - destination:
25        host: model-service-v1
26      weight: 90
27    - destination:
28        host: model-service-v2
29      weight: 10

Monitoring

Métriques Prometheus

1from prometheus_client import Counter, Histogram, start_http_server
2
3PREDICTIONS = Counter('predictions_total', 'Total predictions', ['model', 'status'])
4LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency')
5
6@LATENCY.time()
7def predict(input_data):
8    result = model(input_data)
9    PREDICTIONS.labels(model='v1', status='success').inc()
10    return result

Tableau de bord Grafana

Principales métriques à surveiller :

Taux de requêtes (RPS)
Latence (p50, p95, p99)
Taux d’erreur
Utilisation du GPU
Utilisation de la mémoire
Indicateurs de dérive du modèle

Conclusion

Le déploiement de modèles d’IA peut être rendu fiable et scalable grâce aux pratiques modernes de MLOps. Docker, Kubernetes et les pipelines CI/CD en sont les composants fondamentaux.

Chez Veni AI, nous proposons des solutions de déploiement d’IA pour les entreprises. Contactez-nous pour vos projets.

Déploiement de modèles d'IA : Kubernetes, Docker et stratégies MLOps

Reference Overview