Veni AI
MLOps

AI Model Deployment: Kubernetes, Docker, and MLOps Strategies

Comprehensive technical guide for deploying AI models to production, containerization, Kubernetes orchestration, and MLOps pipelines.

Veni AI Technical TeamJanuary 8, 20255 min read

Reference Overview

FieldValueSource
Canonical Path/blog/ai-model-deployment-kubernetes-docker-mlopsVeni AI Blog
Primary CategoryMLOpsPost Metadata
AuthorVeni AI Technical TeamPost Metadata
AI Model Deployment: Kubernetes, Docker, and MLOps Strategies

AI Model Deployment: Kubernetes, Docker, and MLOps Strategies

AI model deployment is the process of moving developed models to a production environment in a reliable, scalable, and maintainable way. In this guide, we examine modern deployment strategies.

Deployment Patterns

1. Batch Inference

Batch data processing, scheduled jobs:

Data Lake → Batch Job → Model Inference → Results Storage

2. Real-time Inference

Instant API-based predictions:

Request → API Gateway → Model Server → Response

3. Streaming Inference

Continuous data stream processing:

Kafka Stream → Stream Processor → Model → Output Stream

4. Edge Deployment

Inference on device:

Mobile/IoT Device → Optimized Model → Local Inference

Model Containerization with Docker

Basic Dockerfile

1FROM python:3.11-slim 2 3WORKDIR /app 4 5# System dependencies 6RUN apt-get update && apt-get install -y \ 7 libgomp1 \ 8 && rm -rf /var/lib/apt/lists/* 9 10# Python dependencies 11COPY requirements.txt . 12RUN pip install --no-cache-dir -r requirements.txt 13 14# Model and code 15COPY model/ ./model/ 16COPY src/ ./src/ 17 18# Port 19EXPOSE 8000 20 21# Healthcheck 22HEALTHCHECK --interval=30s --timeout=10s \ 23 CMD curl -f http://localhost:8000/health || exit 1 24 25# Start command 26CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Multi-stage Build

1# Build stage 2FROM python:3.11 AS builder 3WORKDIR /app 4COPY requirements.txt . 5RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt 6 7# Production stage 8FROM python:3.11-slim 9WORKDIR /app 10COPY --from=builder /wheels /wheels 11RUN pip install --no-cache-dir /wheels/* 12COPY . . 13CMD ["python", "main.py"]

GPU Support

1FROM NVIDIA/cuda:12.1-runtime-ubuntu22.04 2 3ENV PYTHONDONTWRITEBYTECODE=1 4ENV PYTHONUNBUFFERED=1 5 6# Python installation 7RUN apt-get update && apt-get install -y python3 python3-pip 8 9# PyTorch GPU 10RUN pip3 install torch --index-url https://download.pytorch.org/whl/cu121 11 12COPY . /app 13WORKDIR /app 14CMD ["python3", "inference.py"]

Model Serving Frameworks

FastAPI Server

1from fastapi import FastAPI, HTTPException 2from pydantic import BaseModel 3import torch 4 5app = FastAPI() 6 7# Load model (startup) 8model = None 9 10@app.on_event("startup") 11async def load_model(): 12 global model 13 model = torch.load("model.pt") 14 model.eval() 15 16class PredictionRequest(BaseModel): 17 text: str 18 19class PredictionResponse(BaseModel): 20 prediction: str 21 confidence: float 22 23@app.post("/predict", response_model=PredictionResponse) 24async def predict(request: PredictionRequest): 25 if model is None: 26 raise HTTPException(500, "Model not loaded") 27 28 with torch.no_grad(): 29 output = model(request.text) 30 31 return PredictionResponse( 32 prediction=output["label"], 33 confidence=output["score"] 34 ) 35 36@app.get("/health") 37async def health(): 38 return {"status": "healthy", "model_loaded": model is not None}

TorchServe

1# Create model archive 2torch-model-archiver \ 3 --model-name mymodel \ 4 --version 1.0 \ 5 --model-file model.py \ 6 --serialized-file model.pt \ 7 --handler handler.py 8 9# Start serving 10torchserve --start \ 11 --model-store model_store \ 12 --models mymodel=mymodel.mar

Triton Inference Server

1# config.pbtxt 2name: "text_classifier" 3platform: "pytorch_libtorch" 4max_batch_size: 32 5input [ 6 { 7 name: "INPUT__0" 8 data_type: TYPE_INT64 9 dims: [ -1 ] 10 } 11] 12output [ 13 { 14 name: "OUTPUT__0" 15 data_type: TYPE_FP32 16 dims: [ -1, 2 ] 17 } 18] 19instance_group [ 20 { count: 2, kind: KIND_GPU } 21]

Kubernetes Deployment

Basic Deployment

1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: model-server 5spec: 6 replicas: 3 7 selector: 8 matchLabels: 9 app: model-server 10 template: 11 metadata: 12 labels: 13 app: model-server 14 spec: 15 containers: 16 - name: model-server 17 image: myregistry/model-server:v1.0 18 ports: 19 - containerPort: 8000 20 resources: 21 requests: 22 memory: "2Gi" 23 cpu: "1" 24 limits: 25 memory: "4Gi" 26 cpu: "2" 27 livenessProbe: 28 httpGet: 29 path: /health 30 port: 8000 31 initialDelaySeconds: 30 32 periodSeconds: 10 33 readinessProbe: 34 httpGet: 35 path: /ready 36 port: 8000 37 initialDelaySeconds: 5 38 periodSeconds: 5

GPU Deployment

1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: gpu-model-server 5spec: 6 replicas: 2 7 template: 8 spec: 9 containers: 10 - name: model 11 image: myregistry/gpu-model:v1.0 12 resources: 13 limits: 14 NVIDIA.com/gpu: 1 15 nodeSelector: 16 accelerator: NVIDIA-tesla-t4 17 tolerations: 18 - key: "NVIDIA.com/gpu" 19 operator: "Exists" 20 effect: "NoSchedule"

Horizontal Pod Autoscaler

1apiVersion: autoscaling/v2 2kind: HorizontalPodAutoscaler 3metadata: 4 name: model-server-hpa 5spec: 6 scaleTargetRef: 7 apiVersion: apps/v1 8 kind: Deployment 9 name: model-server 10 minReplicas: 2 11 maxReplicas: 10 12 metrics: 13 - type: Resource 14 resource: 15 name: cpu 16 target: 17 type: Utilization 18 averageUtilization: 70 19 - type: Pods 20 pods: 21 metric: 22 name: requests_per_second 23 target: 24 type: AverageValue 25 averageValue: 100

Service & Ingress

1apiVersion: v1 2kind: Service 3metadata: 4 name: model-service 5spec: 6 selector: 7 app: model-server 8 ports: 9 - port: 80 10 targetPort: 8000 11 type: ClusterIP 12--- 13apiVersion: networking.k8s.io/v1 14kind: Ingress 15metadata: 16 name: model-ingress 17 annotations: 18 nginx.ingress.kubernetes.io/rate-limit: "100" 19spec: 20 rules: 21 - host: model.example.com 22 http: 23 paths: 24 - path: / 25 pathType: Prefix 26 backend: 27 service: 28 name: model-service 29 port: 30 number: 80

MLOps Pipeline

CI/CD Pipeline

1# .github/workflows/mlops.yml 2name: MLOps Pipeline 3 4on: 5 push: 6 branches: [main] 7 8jobs: 9 test: 10 runs-on: ubuntu-latest 11 steps: 12 - uses: actions/checkout@v3 13 - name: Run tests 14 run: pytest tests/ 15 16 train: 17 needs: test 18 runs-on: ubuntu-latest 19 steps: 20 - name: Train model 21 run: python train.py 22 - name: Evaluate model 23 run: python evaluate.py 24 - name: Register model 25 if: success() 26 run: python register_model.py 27 28 deploy: 29 needs: train 30 runs-on: ubuntu-latest 31 steps: 32 - name: Build image 33 run: docker build -t model:${{ github.sha }} . 34 - name: Push to registry 35 run: docker push myregistry/model:${{ github.sha }} 36 - name: Deploy to K8s 37 run: kubectl set image deployment/model model=myregistry/model:${{ github.sha }}

Model Registry

1import mlflow 2 3# Registering model 4with mlflow.start_run(): 5 mlflow.log_params({"learning_rate": 0.001, "epochs": 10}) 6 mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93}) 7 mlflow.pytorch.log_model(model, "model") 8 9# Loading model 10model_uri = "models:/text-classifier/production" 11model = mlflow.pytorch.load_model(model_uri)

Canary Deployment

1apiVersion: networking.istio.io/v1alpha3 2kind: VirtualService 3metadata: 4 name: model-service 5spec: 6 hosts: 7 - model-service 8 http: 9 - route: 10 - destination: 11 host: model-service-v1 12 weight: 90 13 - destination: 14 host: model-service-v2 15 weight: 10

Monitoring

Prometheus Metrics

1from prometheus_client import Counter, Histogram, start_http_server 2 3PREDICTIONS = Counter('predictions_total', 'Total predictions', ['model', 'status']) 4LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency') 5 6@LATENCY.time() 7def predict(input_data): 8 result = model(input_data) 9 PREDICTIONS.labels(model='v1', status='success').inc() 10 return result

Grafana Dashboard

Key metrics to monitor:

  • Request rate (RPS)
  • Latency (p50, p95, p99)
  • Error rate
  • GPU utilization
  • Memory usage
  • Model drift indicators

Conclusion

AI model deployment can be made reliable and scalable with modern MLOps practices. Docker, Kubernetes, and CI/CD pipelines are the fundamental components of this process.

At Veni AI, we offer enterprise AI deployment solutions. Contact us for your projects.

İlgili Makaleler