Veni AI
MLOps

AI 모델 배포: 쿠버네티스, 도커 및 MLOps 전략

AI 모델의 프로덕션 배포, 컨테이너화, 쿠버네티스 오케스트레이션 및 MLOps 파이프라인에 대한 종합 기술 가이드.

Veni AI Technical Team8 Ocak 20255 dk okuma
AI 모델 배포: 쿠버네티스, 도커 및 MLOps 전략

AI 모델 배포: Kubernetes, Docker, 그리고 MLOps 전략

AI 모델 배포는 개발된 모델을 신뢰할 수 있고 확장 가능하며 유지 관리가 가능한 방식으로 프로덕션 환경에 이동시키는 과정이다. 이 가이드에서는 최신 배포 전략을 살펴본다.

배포 패턴

1. 배치 추론

배치 데이터 처리, 스케줄링된 작업:

Data Lake → Batch Job → Model Inference → Results Storage

2. 실시간 추론

즉각적인 API 기반 예측:

Request → API Gateway → Model Server → Response

3. 스트리밍 추론

지속적인 데이터 스트림 처리:

Kafka Stream → Stream Processor → Model → Output Stream

4. 엣지 배포

디바이스 기반 추론:

Mobile/IoT Device → Optimized Model → Local Inference

Docker를 활용한 모델 컨테이너화

기본 Dockerfile

1FROM python:3.11-slim 2 3WORKDIR /app 4 5# System dependencies 6RUN apt-get update && apt-get install -y \ 7 libgomp1 \ 8 && rm -rf /var/lib/apt/lists/* 9 10# Python dependencies 11COPY requirements.txt . 12RUN pip install --no-cache-dir -r requirements.txt 13 14# Model and code 15COPY model/ ./model/ 16COPY src/ ./src/ 17 18# Port 19EXPOSE 8000 20 21# Healthcheck 22HEALTHCHECK --interval=30s --timeout=10s \ 23 CMD curl -f http://localhost:8000/health || exit 1 24 25# Start command 26CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

멀티 스테이지 빌드

1# Build stage 2FROM python:3.11 AS builder 3WORKDIR /app 4COPY requirements.txt . 5RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt 6 7# Production stage 8FROM python:3.11-slim 9WORKDIR /app 10COPY --from=builder /wheels /wheels 11RUN pip install --no-cache-dir /wheels/* 12COPY . . 13CMD ["python", "main.py"]

GPU 지원

1FROM NVIDIA/cuda:12.1-runtime-ubuntu22.04 2 3ENV PYTHONDONTWRITEBYTECODE=1 4ENV PYTHONUNBUFFERED=1 5 6# Python installation 7RUN apt-get update && apt-get install -y python3 python3-pip 8 9# PyTorch GPU 10RUN pip3 install torch --index-url https://download.pytorch.org/whl/cu121 11 12COPY . /app 13WORKDIR /app 14CMD ["python3", "inference.py"]

모델 서빙 프레임워크

FastAPI 서버

1from fastapi import FastAPI, HTTPException 2from pydantic import BaseModel 3import torch 4 5app = FastAPI() 6 7# Load model (startup) 8model = None 9 10@app.on_event("startup") 11async def load_model(): 12 global model 13 model = torch.load("model.pt") 14 model.eval() 15 16class PredictionRequest(BaseModel): 17 text: str 18 19class PredictionResponse(BaseModel): 20 prediction: str 21 confidence: float 22 23@app.post("/predict", response_model=PredictionResponse) 24async def predict(request: PredictionRequest): 25 if model is None: 26 raise HTTPException(500, "Model not loaded") 27 28 with torch.no_grad(): 29 output = model(request.text) 30 31 return PredictionResponse( 32 prediction=output["label"], 33 confidence=output["score"] 34 ) 35 36@app.get("/health") 37async def health(): 38 return {"status": "healthy", "model_loaded": model is not None}

TorchServe

1# Create model archive 2torch-model-archiver \ 3 --model-name mymodel \ 4 --version 1.0 \ 5 --model-file model.py \ 6 --serialized-file model.pt \ 7 --handler handler.py 8 9# Start serving 10torchserve --start \ 11 --model-store model_store \ 12 --models mymodel=mymodel.mar

Triton Inference Server

1# config.pbtxt 2name: "text_classifier" 3platform: "pytorch_libtorch" 4max_batch_size: 32 5input [ 6 { 7 name: "INPUT__0" 8 data_type: TYPE_INT64 9 dims: [ -1 ] 10 } 11] 12output [ 13 { 14 name: "OUTPUT__0" 15 data_type: TYPE_FP32 16 dims: [ -1, 2 ] 17 } 18] 19instance_group [ 20 { count: 2, kind: KIND_GPU } 21] 22## Kubernetes 배포 23 24### 기본 Deployment 25 26```yaml 27apiVersion: apps/v1 28kind: Deployment 29metadata: 30 name: model-server 31spec: 32 replicas: 3 33 selector: 34 matchLabels: 35 app: model-server 36 template: 37 metadata: 38 labels: 39 app: model-server 40 spec: 41 containers: 42 - name: model-server 43 image: myregistry/model-server:v1.0 44 ports: 45 - containerPort: 8000 46 resources: 47 requests: 48 memory: "2Gi" 49 cpu: "1" 50 limits: 51 memory: "4Gi" 52 cpu: "2" 53 livenessProbe: 54 httpGet: 55 path: /health 56 port: 8000 57 initialDelaySeconds: 30 58 periodSeconds: 10 59 readinessProbe: 60 httpGet: 61 path: /ready 62 port: 8000 63 initialDelaySeconds: 5 64 periodSeconds: 5

GPU Deployment

1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: gpu-model-server 5spec: 6 replicas: 2 7 template: 8 spec: 9 containers: 10 - name: model 11 image: myregistry/gpu-model:v1.0 12 resources: 13 limits: 14 NVIDIA.com/gpu: 1 15 nodeSelector: 16 accelerator: NVIDIA-tesla-t4 17 tolerations: 18 - key: "NVIDIA.com/gpu" 19 operator: "Exists" 20 effect: "NoSchedule"

Horizontal Pod Autoscaler

1apiVersion: autoscaling/v2 2kind: HorizontalPodAutoscaler 3metadata: 4 name: model-server-hpa 5spec: 6 scaleTargetRef: 7 apiVersion: apps/v1 8 kind: Deployment 9 name: model-server 10 minReplicas: 2 11 maxReplicas: 10 12 metrics: 13 - type: Resource 14 resource: 15 name: cpu 16 target: 17 type: Utilization 18 averageUtilization: 70 19 - type: Pods 20 pods: 21 metric: 22 name: requests_per_second 23 target: 24 type: AverageValue 25 averageValue: 100

Service & Ingress

1apiVersion: v1 2kind: Service 3metadata: 4 name: model-service 5spec: 6 selector: 7 app: model-server 8 ports: 9 - port: 80 10 targetPort: 8000 11 type: ClusterIP 12--- 13apiVersion: networking.k8s.io/v1 14kind: Ingress 15metadata: 16 name: model-ingress 17 annotations: 18 nginx.ingress.kubernetes.io/rate-limit: "100" 19spec: 20 rules: 21 - host: model.example.com 22 http: 23 paths: 24 - path: / 25 pathType: Prefix 26 backend: 27 service: 28 name: model-service 29 port: 30 number: 80

MLOps 파이프라인

CI/CD 파이프라인

1# .github/workflows/mlops.yml 2name: MLOps Pipeline 3 4on: 5 push: 6 branches: [main] 7 8jobs: 9 test: 10 runs-on: ubuntu-latest 11 steps: 12 - uses: actions/checkout@v3 13 - name: Run tests 14 run: pytest tests/ 15 16 train: 17 needs: test 18 runs-on: ubuntu-latest 19 steps: 20 - name: Train model 21 run: python train.py 22 - name: Evaluate model 23 run: python evaluate.py 24 - name: Register model 25 if: success() 26 run: python register_model.py 27 28 deploy: 29 needs: train 30 runs-on: ubuntu-latest 31 steps: 32 - name: Build image 33 run: docker build -t model:${{ github.sha }} . 34 - name: Push to registry 35 run: docker push myregistry/model:${{ github.sha }} 36 - name: Deploy to K8s 37 run: kubectl set image deployment/model model=myregistry/model:${{ github.sha }}

모델 레지스트리

1import mlflow 2 3# Registering model 4with mlflow.start_run(): 5 mlflow.log_params({"learning_rate": 0.001, "epochs": 10}) 6 mlflow.log_metrics({"accuracy": 0.95, "f1": 0.93}) 7 mlflow.pytorch.log_model(model, "model") 8 9# Loading model 10model_uri = "models:/text-classifier/production" 11model = mlflow.pytorch.load_model(model_uri) 12## 카나리 배포 13 14```yaml 15apiVersion: networking.istio.io/v1alpha3 16kind: VirtualService 17metadata: 18 name: model-service 19spec: 20 hosts: 21 - model-service 22 http: 23 - route: 24 - destination: 25 host: model-service-v1 26 weight: 90 27 - destination: 28 host: model-service-v2 29 weight: 10

모니터링

Prometheus 메트릭

1from prometheus_client import Counter, Histogram, start_http_server 2 3PREDICTIONS = Counter('predictions_total', 'Total predictions', ['model', 'status']) 4LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency') 5 6@LATENCY.time() 7def predict(input_data): 8 result = model(input_data) 9 PREDICTIONS.labels(model='v1', status='success').inc() 10 return result

Grafana 대시보드

모니터링해야 할 주요 메트릭:

  • 요청 처리율 (RPS)
  • 지연 시간 (p50, p95, p99)
  • 오류율
  • GPU 사용량
  • 메모리 사용량
  • 모델 드리프트 지표

결론

현대적인 MLOps 실 practices을 적용하면 AI 모델 배포를 안정적이고 확장 가능하게 만들 수 있습니다. Docker, Kubernetes, 그리고 CI/CD 파이프라인은 이 과정의 핵심 구성 요소입니다.

Veni AI는 엔터프라이즈 AI 배포 솔루션을 제공합니다. 프로젝트 관련 문의는 언제든지 연락해 주세요.

İlgili Makaleler