Field	Value	Source
Canonical Path	/blog/ai-model-deployment-kubernetes-docker-mlops	Veni AI Blog
Primary Category	MLOps	Post Metadata
Author	Veni AI Technical Team	Post Metadata

AI 모델 배포: Kubernetes, Docker, 그리고 MLOps 전략

AI 모델 배포는 개발된 모델을 신뢰할 수 있고 확장 가능하며 유지 관리가 가능한 방식으로 프로덕션 환경에 이동시키는 과정이다. 이 가이드에서는 최신 배포 전략을 살펴본다.

배포 패턴

1. 배치 추론

배치 데이터 처리, 스케줄링된 작업:

Data Lake → Batch Job → Model Inference → Results Storage

2. 실시간 추론

즉각적인 API 기반 예측:

Request → API Gateway → Model Server → Response

3. 스트리밍 추론

지속적인 데이터 스트림 처리:

Kafka Stream → Stream Processor → Model → Output Stream

4. 엣지 배포

디바이스 기반 추론:

Mobile/IoT Device → Optimized Model → Local Inference

Docker를 활용한 모델 컨테이너화

기본 Dockerfile

1FROM python:3.11-slim
2
3WORKDIR /app
4
5# System dependencies
6RUN apt-get update && apt-get install -y \
7    libgomp1 \
8    && rm -rf /var/lib/apt/lists/*
9
10# Python dependencies
11COPY requirements.txt .
12RUN pip install --no-cache-dir -r requirements.txt
13
14# Model and code
15COPY model/ ./model/
16COPY src/ ./src/
17
18# Port
19EXPOSE 8000
20
21# Healthcheck
22HEALTHCHECK --interval=30s --timeout=10s \
23    CMD curl -f http://localhost:8000/health || exit 1
24
25# Start command
26CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

멀티 스테이지 빌드

1# Build stage
2FROM python:3.11 AS builder
3WORKDIR /app
4COPY requirements.txt .
5RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt
6
7# Production stage
8FROM python:3.11-slim
9WORKDIR /app
10COPY --from=builder /wheels /wheels
11RUN pip install --no-cache-dir /wheels/*
12COPY . .
13CMD ["python", "main.py"]

GPU 지원

1FROM NVIDIA/cuda:12.1-runtime-ubuntu22.04
2
3ENV PYTHONDONTWRITEBYTECODE=1
4ENV PYTHONUNBUFFERED=1
5
6# Python installation
7RUN apt-get update && apt-get install -y python3 python3-pip
8
9# PyTorch GPU
10RUN pip3 install torch --index-url https://download.pytorch.org/whl/cu121
11
12COPY . /app
13WORKDIR /app
14CMD ["python3", "inference.py"]

모델 서빙 프레임워크

FastAPI 서버

1from fastapi import FastAPI, HTTPException
2from pydantic import BaseModel
3import torch
4
5app = FastAPI()
6
7# Load model (startup)
8model = None
9
10@app.on_event("startup")
11async def load_model():
12    global model
13    model = torch.load("model.pt")
14    model.eval()
15
16class PredictionRequest(BaseModel):
17    text: str
18
19class PredictionResponse(BaseModel):
20    prediction: str
21    confidence: float
22
23@app.post("/predict", response_model=PredictionResponse)
24async def predict(request: PredictionRequest):
25    if model is None:
26        raise HTTPException(500, "Model not loaded")
27    
28    with torch.no_grad():
29        output = model(request.text)
30    
31    return PredictionResponse(
32        prediction=output["label"],
33        confidence=output["score"]
34    )
35
36@app.get("/health")
37async def health():
38    return {"status": "healthy", "model_loaded": model is not None}

TorchServe

1# Create model archive
2torch-model-archiver \
3    --model-name mymodel \
4    --version 1.0 \
5    --model-file model.py \
6    --serialized-file model.pt \
7    --handler handler.py
8
9# Start serving
10torchserve --start \
11    --model-store model_store \
12    --models mymodel=mymodel.mar

Triton Inference Server

1# config.pbtxt
2name: "text_classifier"
3platform: "pytorch_libtorch"
4max_batch_size: 32
5input [
6  {
7    name: "INPUT__0"
8    data_type: TYPE_INT64
9    dims: [ -1 ]
10  }
11]
12output [
13  {
14    name: "OUTPUT__0"
15    data_type: TYPE_FP32
16    dims: [ -1, 2 ]
17  }
18]
19instance_group [
20  { count: 2, kind: KIND_GPU }
21]
22## Kubernetes 배포
23
24### 기본 Deployment
25
26```yaml
27apiVersion: apps/v1
28kind: Deployment
29metadata:
30  name: model-server
31spec:
32  replicas: 3
33  selector:
34    matchLabels:
35      app: model-server
36  template:
37    metadata:
38      labels:
39        app: model-server
40    spec:
41      containers:
42      - name: model-server
43        image: myregistry/model-server:v1.0
44        ports:
45        - containerPort: 8000
46        resources:
47          requests:
48            memory: "2Gi"
49            cpu: "1"
50          limits:
51            memory: "4Gi"
52            cpu: "2"
53        livenessProbe:
54          httpGet:
55            path: /health
56            port: 8000
57          initialDelaySeconds: 30
58          periodSeconds: 10
59        readinessProbe:
60          httpGet:
61            path: /ready
62            port: 8000
63          initialDelaySeconds: 5
64          periodSeconds: 5

GPU Deployment

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: gpu-model-server
5spec:
6  replicas: 2
7  template:
8    spec:
9      containers:
10      - name: model
11        image: myregistry/gpu-model:v1.0
12        resources:
13          limits:
14            NVIDIA.com/gpu: 1
15      nodeSelector:
16        accelerator: NVIDIA-tesla-t4
17      tolerations:
18      - key: "NVIDIA.com/gpu"
19        operator: "Exists"
20        effect: "NoSchedule"

Horizontal Pod Autoscaler

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: model-server-hpa
5spec:
6  scaleTargetRef:
7    apiVersion: apps/v1
8    kind: Deployment
9    name: model-server
10  minReplicas: 2
11  maxReplicas: 10
12  metrics:
13  - type: Resource
14    resource:
15      name: cpu
16      target:
17        type: Utilization
18        averageUtilization: 70
19  - type: Pods
20    pods:
21      metric:
22        name: requests_per_second
23      target:
24        type: AverageValue
25        averageValue: 100

Service & Ingress

1apiVersion: v1
2kind: Service
3metadata:
4  name: model-service
5spec:
6  selector:
7    app: model-server
8  ports:
9  - port: 80
10    targetPort: 8000
11  type: ClusterIP
12---
13apiVersion: networking.k8s.io/v1
14kind: Ingress
15metadata:
16  name: model-ingress
17  annotations:
18    nginx.ingress.kubernetes.io/rate-limit: "100"
19spec:
20  rules:
21  - host: model.example.com
22    http:
23      paths:
24      - path: /
25        pathType: Prefix
26        backend:
27          service:
28            name: model-service
29            port:
30              number: 80

MLOps 파이프라인

CI/CD 파이프라인

1# .github/workflows/mlops.yml
2name: MLOps Pipeline
3
4on:
5  push:
6    branches: [main]
7
8jobs:
9  test:
10    runs-on: ubuntu-latest
11    steps:
12      - uses: actions/checkout@v3
13      - name: Run tests
14        run: pytest tests/
15
16  train:
17    needs: test
18    runs-on: ubuntu-latest
19    steps:
20      - name: Train model
21        run: python train.py
22      - name: Evaluate model
23        run: python evaluate.py
24      - name: Register model
25        if: success()
26        run: python register_model.py
27
28  deploy:
29    needs: train
30    runs-on: ubuntu-latest
31    steps:
32      - name: Build image
33        run: docker build -t model:${{ github.sha }} .
34      - name: Push to registry
35        run: docker push myregistry/model:${{ github.sha }}
36      - name: Deploy to K8s
37        run: kubectl set image deployment/model model=myregistry/model:${{ github.sha }}

모델 레지스트리

1import mlflow
2
3# Registering model
4with mlflow.start_run():
5    mlflow.log_params({"learning_rate": 0.001, "epochs": 10})
6    mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93})
7    mlflow.pytorch.log_model(model, "model")
8    
9# Loading model
10model_uri = "models:/text-classifier/production"
11model = mlflow.pytorch.load_model(model_uri)
12## 카나리 배포
13
14```yaml
15apiVersion: networking.istio.io/v1alpha3
16kind: VirtualService
17metadata:
18  name: model-service
19spec:
20  hosts:
21  - model-service
22  http:
23  - route:
24    - destination:
25        host: model-service-v1
26      weight: 90
27    - destination:
28        host: model-service-v2
29      weight: 10

모니터링

Prometheus 메트릭

1from prometheus_client import Counter, Histogram, start_http_server
2
3PREDICTIONS = Counter('predictions_total', 'Total predictions', ['model', 'status'])
4LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency')
5
6@LATENCY.time()
7def predict(input_data):
8    result = model(input_data)
9    PREDICTIONS.labels(model='v1', status='success').inc()
10    return result

Grafana 대시보드

모니터링해야 할 주요 메트릭:

요청 처리율 (RPS)
지연 시간 (p50, p95, p99)
오류율
GPU 사용량
메모리 사용량
모델 드리프트 지표

결론

현대적인 MLOps 실 practices을 적용하면 AI 모델 배포를 안정적이고 확장 가능하게 만들 수 있습니다. Docker, Kubernetes, 그리고 CI/CD 파이프라인은 이 과정의 핵심 구성 요소입니다.

Veni AI는 엔터프라이즈 AI 배포 솔루션을 제공합니다. 프로젝트 관련 문의는 언제든지 연락해 주세요.

AI 모델 배포: 쿠버네티스, 도커 및 MLOps 전략

Reference Overview