Field	Value	Source
Canonical Path	/blog/transformer-mimarisi-attention-mekanizmasi-teknik-analiz	Veni AI Blog
Primary Category	딥러닝	Post Metadata
Author	Veni AI Technical Team	Post Metadata

Transformer Architecture and Attention Mechanism: Technical Analysis

2017년 Google이 발표한 논문 "Attention Is All You Need"에서 소개된 Transformer 아키텍처는 현대 인공지능의 중추를 이루는 구조이다. GPT, Claude, Gemini와 같은 주요 언어 모델은 모두 이 아키텍처를 기반으로 구축되었다.

Before Transformers: RNN and LSTM Limitations

Transformer 등장 이전, NLP 작업은 Recurrent Neural Networks (RNN)과 Long Short-Term Memory (LSTM) 네트워크에 의존했다.

RNN/LSTM Problems:

순차적 처리 요구 → 병렬화 불가능.
긴 시퀀스에서 기울기 소실/폭발 문제.
장기 의존성 학습의 어려움.
매우 긴 학습 시간.

Self-Attention Mechanism

Self-attention은 시퀀스의 각 요소가 다른 모든 요소와의 관계를 계산하는 메커니즘이다.

Mathematical Formulation

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Parameters:

Q (Query): 질의 벡터.
K (Key): 매칭될 키 벡터.
V (Value): 실제 정보 벡터.
d_k: Key 벡터의 차원.

Step-by-Step Calculation

Projection: Input → Q, K, V 행렬

1Q = X × W_Q
2K = X × W_K
3V = X × W_V

Attention Scores: Q와 K의 내적

scores = Q × K^T

Scaling: 기울기 안정성을 위해 √d_k 로 나누기

scaled_scores = scores / √d_k

Softmax: 확률 분포로 변환

attention_weights = softmax(scaled_scores)

Weighted Sum: Value와 곱하기

output = attention_weights × V

Multi-Head Attention

단일 어텐션 헤드 대신 여러 병렬 어텐션 헤드를 사용한다:

1MultiHead(Q, K, V) = Concat(head_1, ..., head_h) × W_O
2
3where head_i = Attention(QW_Q^i, KW_K^i, VW_V^i)

Advantages of Multi-Head Attention

서로 다른 표현 공간에서 학습 가능.
다양한 유형의 문맥 관계 포착.
더 풍부한 특징 추출.

Typical Configurations:

GPT-3: 96 attention heads, d_model = 12288.
GPT-4: 약 120개 이상의 헤드로 추정.

Positional Encoding

Transformers는 데이터를 병렬로 처리하므로 순차적 문맥을 보존하기 위해 위치 정보를 추가한다.

Sinusoidal Positional Encoding

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Rotary Positional Embedding (RoPE)

현대 모델에서 사용되는 더 발전된 방식:

f(x, pos) = x × e^(i × pos × θ)

RoPE Advantages:

상대적 위치 정보를 자연스럽게 인코딩.
긴 시퀀스에 대해 더 나은 외삽 능력.
GPT-NeoX, LLaMA, Mistral 모델에서 사용됨.

Feed-Forward Network

모든 attention layer 뒤에 따라오는 MLP layer:

FFN(x) = GELU(xW_1 + b_1)W_2 + b_2

Typical Dimensions:

d_model = 4096.
d_ff = 4 × d_model = 16384.

Activation Functions

ReLU: 고전적이고 단순함.
GELU: GPT 계열 모델에서 선호됨.
SwiGLU: LLaMA와 PaLM 모델에서 사용됨.

Layer Normalization

학습 안정성에 매우 중요하다.

Pre-LN vs Post-LN

Post-LN (Original):

x = x + Attention(LayerNorm(x))

Pre-LN (Modern):

x = LayerNorm(x + Attention(x))

Pre-LN은 더 안정적인 학습을 제공하며 현재 업계 표준이 되었다.

Encoder vs Decoder Architectures

Encoder-Only (BERT-style)

양방향 attention.
Classification, NER, semantic similarity에 사용.
Masked Language Modeling.

Decoder-Only (GPT-style)

Causal/autoregressive attention.
텍스트 생성 및 대화에 사용.
Next token prediction.

Encoder-Decoder (T5, BART)

Sequence-to-sequence 작업.
번역 및 요약.

Causal Masking

Decoder 모델에서 미래 토큰을 마스킹:

mask = triu(ones(seq_len, seq_len), diagonal=1)
masked_scores = scores + mask × (-inf)

이 방식은 모델이 생성 시 이전 토큰만 참조하도록 보장한다.

KV-Cache Optimization

추론 과정에서 불필요한 재계산을 방지하기 위해:

1Step 1: Calculate K_1, V_1 → save to cache
2Step 2: Calculate K_2, V_2 → K = [K_1, K_2], V = [V_1, V_2]
3Step n: Calculate only for the new token, retrieve old values from cache

메모리 절감: 처리 단계에서 O(n²) → O(n)

Flash Attention

메모리 효율적인 attention 구현 방식:

기존 Attention의 문제점:

O(n²) 메모리 사용량.
HBM(High Bandwidth Memory) 병목.

Flash Attention의 해결 방식:

Tiling: Attention을 블록 단위로 분할.
Online Softmax: 점진적/누적 계산.
I/O Aware: GPU 메모리 계층 최적화.

결과: 2-4배 속도 향상, 5-20% 메모리 절감.

Sparse Attention Variants

긴 컨텍스트에서 attention 복잡도를 감소시키는 기법:

Local Attention

근접한 토큰만 집중.

Dilated Attention

일정 간격으로 attention 적용.

Longformer Pattern

Local + Global attention 결합.

Modern Transformer Variants

Model	Feature	Context Length
GPT-4	MoE, long context	128K
Claude 3	Constitutional AI	200K
Gemini 1.5	Sparse MoE	1M
Mistral	Sliding window	32K

Conclusion

Transformer 아키텍처는 현대 AI의 근본적인 기반을 이루는 구조입니다. Self-attention 메커니즘, 병렬 처리 능력, 장기 의존성을 학습하는 능력 덕분에 이 아키텍처는 혁신적이라고 평가받습니다.

Veni AI는 엔터프라이즈 솔루션에서 transformer 기반 모델을 효과적으로 활용하고 있습니다. 기술 컨설팅이 필요하다면 언제든지 문의해 주세요.

트랜스포머 아키텍처와 어텐션 메커니즘: 기술 분석

Reference Overview