Field	Value	Source
Canonical Path	/blog/ai-model-deployment-kubernetes-docker-mlops	Veni AI Blog
Primary Category	MLOps	Post Metadata
Author	Veni AI Technical Team	Post Metadata

AI Model Deployment: Kubernetes, Docker, and MLOps Strategies

AI model deployment is the process of moving developed models to a production environment in a reliable, scalable, and maintainable way. In this guide, we examine modern deployment strategies.

Deployment Patterns

1. Batch Inference

Batch data processing, scheduled jobs:

Data Lake → Batch Job → Model Inference → Results Storage

2. Real-time Inference

Instant API-based predictions:

Request → API Gateway → Model Server → Response

3. Streaming Inference

Continuous data stream processing:

Kafka Stream → Stream Processor → Model → Output Stream

4. Edge Deployment

Inference on device:

Mobile/IoT Device → Optimized Model → Local Inference

Model Containerization with Docker

Basic Dockerfile

1FROM python:3.11-slim
2
3WORKDIR /app
4
5# System dependencies
6RUN apt-get update && apt-get install -y \
7    libgomp1 \
8    && rm -rf /var/lib/apt/lists/*
9
10# Python dependencies
11COPY requirements.txt .
12RUN pip install --no-cache-dir -r requirements.txt
13
14# Model and code
15COPY model/ ./model/
16COPY src/ ./src/
17
18# Port
19EXPOSE 8000
20
21# Healthcheck
22HEALTHCHECK --interval=30s --timeout=10s \
23    CMD curl -f http://localhost:8000/health || exit 1
24
25# Start command
26CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Multi-stage Build

1# Build stage
2FROM python:3.11 AS builder
3WORKDIR /app
4COPY requirements.txt .
5RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt
6
7# Production stage
8FROM python:3.11-slim
9WORKDIR /app
10COPY --from=builder /wheels /wheels
11RUN pip install --no-cache-dir /wheels/*
12COPY . .
13CMD ["python", "main.py"]

GPU Support

1FROM NVIDIA/cuda:12.1-runtime-ubuntu22.04
2
3ENV PYTHONDONTWRITEBYTECODE=1
4ENV PYTHONUNBUFFERED=1
5
6# Python installation
7RUN apt-get update && apt-get install -y python3 python3-pip
8
9# PyTorch GPU
10RUN pip3 install torch --index-url https://download.pytorch.org/whl/cu121
11
12COPY . /app
13WORKDIR /app
14CMD ["python3", "inference.py"]

Model Serving Frameworks

FastAPI Server

1from fastapi import FastAPI, HTTPException
2from pydantic import BaseModel
3import torch
4
5app = FastAPI()
6
7# Load model (startup)
8model = None
9
10@app.on_event("startup")
11async def load_model():
12    global model
13    model = torch.load("model.pt")
14    model.eval()
15
16class PredictionRequest(BaseModel):
17    text: str
18
19class PredictionResponse(BaseModel):
20    prediction: str
21    confidence: float
22
23@app.post("/predict", response_model=PredictionResponse)
24async def predict(request: PredictionRequest):
25    if model is None:
26        raise HTTPException(500, "Model not loaded")
27    
28    with torch.no_grad():
29        output = model(request.text)
30    
31    return PredictionResponse(
32        prediction=output["label"],
33        confidence=output["score"]
34    )
35
36@app.get("/health")
37async def health():
38    return {"status": "healthy", "model_loaded": model is not None}

TorchServe

1# Create model archive
2torch-model-archiver \
3    --model-name mymodel \
4    --version 1.0 \
5    --model-file model.py \
6    --serialized-file model.pt \
7    --handler handler.py
8
9# Start serving
10torchserve --start \
11    --model-store model_store \
12    --models mymodel=mymodel.mar

Triton Inference Server

1# config.pbtxt
2name: "text_classifier"
3platform: "pytorch_libtorch"
4max_batch_size: 32
5input [
6  {
7    name: "INPUT__0"
8    data_type: TYPE_INT64
9    dims: [ -1 ]
10  }
11]
12output [
13  {
14    name: "OUTPUT__0"
15    data_type: TYPE_FP32
16    dims: [ -1, 2 ]
17  }
18]
19instance_group [
20  { count: 2, kind: KIND_GPU }
21]

Kubernetes Deployment

Basic Deployment

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: model-server
5spec:
6  replicas: 3
7  selector:
8    matchLabels:
9      app: model-server
10  template:
11    metadata:
12      labels:
13        app: model-server
14    spec:
15      containers:
16      - name: model-server
17        image: myregistry/model-server:v1.0
18        ports:
19        - containerPort: 8000
20        resources:
21          requests:
22            memory: "2Gi"
23            cpu: "1"
24          limits:
25            memory: "4Gi"
26            cpu: "2"
27        livenessProbe:
28          httpGet:
29            path: /health
30            port: 8000
31          initialDelaySeconds: 30
32          periodSeconds: 10
33        readinessProbe:
34          httpGet:
35            path: /ready
36            port: 8000
37          initialDelaySeconds: 5
38          periodSeconds: 5

GPU Deployment

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: gpu-model-server
5spec:
6  replicas: 2
7  template:
8    spec:
9      containers:
10      - name: model
11        image: myregistry/gpu-model:v1.0
12        resources:
13          limits:
14            NVIDIA.com/gpu: 1
15      nodeSelector:
16        accelerator: NVIDIA-tesla-t4
17      tolerations:
18      - key: "NVIDIA.com/gpu"
19        operator: "Exists"
20        effect: "NoSchedule"

Horizontal Pod Autoscaler

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: model-server-hpa
5spec:
6  scaleTargetRef:
7    apiVersion: apps/v1
8    kind: Deployment
9    name: model-server
10  minReplicas: 2
11  maxReplicas: 10
12  metrics:
13  - type: Resource
14    resource:
15      name: cpu
16      target:
17        type: Utilization
18        averageUtilization: 70
19  - type: Pods
20    pods:
21      metric:
22        name: requests_per_second
23      target:
24        type: AverageValue
25        averageValue: 100

Service & Ingress

1apiVersion: v1
2kind: Service
3metadata:
4  name: model-service
5spec:
6  selector:
7    app: model-server
8  ports:
9  - port: 80
10    targetPort: 8000
11  type: ClusterIP
12---
13apiVersion: networking.k8s.io/v1
14kind: Ingress
15metadata:
16  name: model-ingress
17  annotations:
18    nginx.ingress.kubernetes.io/rate-limit: "100"
19spec:
20  rules:
21  - host: model.example.com
22    http:
23      paths:
24      - path: /
25        pathType: Prefix
26        backend:
27          service:
28            name: model-service
29            port:
30              number: 80

MLOps Pipeline

CI/CD Pipeline

1# .github/workflows/mlops.yml
2name: MLOps Pipeline
3
4on:
5  push:
6    branches: [main]
7
8jobs:
9  test:
10    runs-on: ubuntu-latest
11    steps:
12      - uses: actions/checkout@v3
13      - name: Run tests
14        run: pytest tests/
15
16  train:
17    needs: test
18    runs-on: ubuntu-latest
19    steps:
20      - name: Train model
21        run: python train.py
22      - name: Evaluate model
23        run: python evaluate.py
24      - name: Register model
25        if: success()
26        run: python register_model.py
27
28  deploy:
29    needs: train
30    runs-on: ubuntu-latest
31    steps:
32      - name: Build image
33        run: docker build -t model:${{ github.sha }} .
34      - name: Push to registry
35        run: docker push myregistry/model:${{ github.sha }}
36      - name: Deploy to K8s
37        run: kubectl set image deployment/model model=myregistry/model:${{ github.sha }}

Model Registry

1import mlflow
2
3# Registering model
4with mlflow.start_run():
5    mlflow.log_params({"learning_rate": 0.001, "epochs": 10})
6    mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93})
7    mlflow.pytorch.log_model(model, "model")
8    
9# Loading model
10model_uri = "models:/text-classifier/production"
11model = mlflow.pytorch.load_model(model_uri)

Canary Deployment

1apiVersion: networking.istio.io/v1alpha3
2kind: VirtualService
3metadata:
4  name: model-service
5spec:
6  hosts:
7  - model-service
8  http:
9  - route:
10    - destination:
11        host: model-service-v1
12      weight: 90
13    - destination:
14        host: model-service-v2
15      weight: 10

Monitoring

Prometheus Metrics

1from prometheus_client import Counter, Histogram, start_http_server
2
3PREDICTIONS = Counter('predictions_total', 'Total predictions', ['model', 'status'])
4LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency')
5
6@LATENCY.time()
7def predict(input_data):
8    result = model(input_data)
9    PREDICTIONS.labels(model='v1', status='success').inc()
10    return result

Grafana Dashboard

Key metrics to monitor:

Request rate (RPS)
Latency (p50, p95, p99)
Error rate
GPU utilization
Memory usage
Model drift indicators

Conclusion

AI model deployment can be made reliable and scalable with modern MLOps practices. Docker, Kubernetes, and CI/CD pipelines are the fundamental components of this process.

At Veni AI, we offer enterprise AI deployment solutions. Contact us for your projects.

AI Model Deployment: Kubernetes, Docker, and MLOps Strategies

Reference Overview

AI Model Deployment: Kubernetes, Docker, and MLOps Strategies

Deployment Patterns

1. Batch Inference

2. Real-time Inference

3. Streaming Inference

4. Edge Deployment

Model Containerization with Docker

Basic Dockerfile

Multi-stage Build

GPU Support

Model Serving Frameworks

FastAPI Server

TorchServe

Triton Inference Server

Kubernetes Deployment

Basic Deployment

GPU Deployment

Horizontal Pod Autoscaler

Service & Ingress

MLOps Pipeline

CI/CD Pipeline

Model Registry

Canary Deployment

Monitoring

Prometheus Metrics

Grafana Dashboard

Conclusion

İlgili Makaleler

What Is OpenClaw? The Self-Hosted Agent Infrastructure Moving AI Beyond Chatbots

Enterprise AI Agent Standards: Operational Patterns Emerging in Early 2026

Enterprise AI Governance: Model Registry and Evaluation Standards