Field	Value	Source
Canonical Path	/blog/multimodal-ai-sistemleri-goruntu-metin-ses-analizi	Veni AI Blog
Primary Category	Artificial Intelligence	Post Metadata
Author	Veni AI Technical Team	Post Metadata

Multimodal AI Systems: Image, Text, and Audio Analysis

Multimodal AI is artificial intelligence systems capable of understanding and processing multiple data types (text, image, audio, video). Models like GPTV, Gemini, and Claude 3 have broken new ground in this field.

Multimodal AI Fundamentals

Modality Types

Text: Natural language, code, structured data
Vision: Photo, diagram, screenshot
Audio: Speech, music, environmental sounds
Video: Combination of moving image + audio

Why Multimodal?

Human communication is inherently multimodal
Context information missed with single modality
Richer meaning extraction
Suitability for real-world applications

Vision-Language Models

Architectural Approaches

1. Contrastive Learning (CLIP style)

1Image Encoder → Image Embedding
2Text Encoder → Text Embedding
3Contrastive Loss: Match(image, text)

2. Generative (GPTV style)

Image → Vision Encoder → Visual Tokens
Visual Tokens + Text Tokens → LLM → Response

3. Cross-Attention Fusion

Image Features ←Cross-Attention→ Text Features

Vision Encoder Types

Encoder	Architecture	Resolution	Feature
ViT	Transformer	224-1024	Patch-based
CLIP ViT	Transformer	336	Contrastive
SigLIP	Transformer	384	Sigmoid loss
ConvNeXt	CNN	Flexible	Efficient

Image Tokenization

Patch Embedding:

224×224 image → 14×14 patch grid → 196 visual tokens
Each patch: 16×16 pixel → Linear projection → Embedding

Variable Resolution:

1Anyres approach:
21. Divide image into tiles
32. Encode each tile separately
43. Add global thumbnail
54. Concatenate all tokens

Multimodal LLM Implementation

GPTV Usage

1from openai import OpenAI
2import base64
3
4client = OpenAI()
5
6def encode_image(image_path):
7    with open(image_path, "rb") as f:
8        return base64.b64encode(f.read()).decode('utf-8')
9
10response = client.chat.completions.create(
11    model="gpt-4-vision-preview",
12    messages=[
13        {
14            "role": "user",
15            "content": [
16                {"type": "text", "text": "Analyze this image"},
17                {
18                    "type": "image_url",
19                    "image_url": {
20                        "url": f"data:image/jpeg;base64,{encode_image('image.webp')}",
21                        "detail": "high"  # low, high, auto
22                    }
23                }
24            ]
25        }
26    ],
27    max_tokens=1000
28)

Claude 3 Vision

1from anthropic import Anthropic
2import base64
3
4client = Anthropic()
5
6with open("image.webp", "rb") as f:
7    image_data = base64.standard_b64encode(f.read()).decode("utf-8")
8
9message = client.messages.create(
10    model="claude-3-opus-20240229",
11    max_tokens=1024,
12    messages=[
13        {
14            "role": "user",
15            "content": [
16                {
17                    "type": "image",
18                    "source": {
19                        "type": "base64",
20                        "media_type": "image/jpeg",
21                        "data": image_data
22                    }
23                },
24                {"type": "text", "text": "What is in this image?"}
25            ]
26        }
27    ]
28)

Audio Processing

Speech-to-Text (STT)

Whisper Model:

1from openai import OpenAI
2
3client = OpenAI()
4
5with open("audio.mp3", "rb") as audio_file:
6    transcript = client.audio.transcriptions.create(
7        model="whisper-1",
8        file=audio_file,
9        language="en"
10    )
11    
12print(transcript.text)

Text-to-Speech (TTS)

1response = client.audio.speech.create(
2    model="tts-1-hd",
3    voice="alloy",  # alloy, echo, fable, onyx, nova, shimmer
4    input="Hello, I am an AI assistant."
5)
6
7response.stream_to_file("output.mp3")

Real-time Audio Pipeline

1Microphone → VAD → Chunking → STT → LLM → TTS → Speaker
2             ↓
3        Voice Activity
4        Detection

Video Understanding

Frame Sampling Strategies

1. Uniform Sampling:

1def uniform_sample(video_path, num_frames=8):
2    cap = cv2.VideoCapture(video_path)
3    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
4    indices = np.linspace(0, total_frames-1, num_frames, dtype=int)
5    
6    frames = []
7    for idx in indices:
8        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
9        ret, frame = cap.read()
10        if ret:
11            frames.append(frame)
12    
13    return frames

2. Keyframe Extraction:

1def extract_keyframes(video_path, threshold=30):
2    # Finding keyframes with Scene change detection
3    pass

Video-LLM Pipeline

1Video → Frame Sampling → Per-frame Encoding → Temporal Aggregation → LLM
2                              ↓
3                        Audio Extraction → STT → Text

Modality Fusion

Early Fusion

Combining modalities at model input:

[CLS] [IMG_1] ... [IMG_N] [SEP] [TXT_1] ... [TXT_M] [SEP]

Late Fusion

Processing each modality separately and combining results:

1Image → Image Model → Image Features ─┐
2                                       ├→ Fusion Layer → Output
3Text → Text Model → Text Features ────┘

Cross-Modal Attention

Attention between modalities:

1Q = Text Features
2K, V = Image Features
3Cross_Attention(Q, K, V) = softmax(QK^T/√d)V

OCR and Document Understanding

Document AI Pipeline

1def process_document(image_path):
2    # 1. Layout Detection
3    layout = detect_layout(image)  # Headings, paragraphs, tables
4    
5    # 2. OCR
6    text_regions = ocr_extract(image)
7    
8    # 3. Structure Understanding
9    structured_doc = parse_structure(layout, text_regions)
10    
11    # 4. LLM Analysis
12    analysis = llm_analyze(structured_doc)
13    
14    return analysis

Table Extraction

1response = client.chat.completions.create(
2    model="gpt-4-vision-preview",
3    messages=[{
4        "role": "user",
5        "content": [
6            {"type": "image_url", "image_url": {"url": table_image_url}},
7            {"type": "text", "text": "Extract this table in JSON format"}
8        ]
9    }]
10)

Enterprise Multimodal Applications

1. Document Processing

Invoice/receipt OCR
Contract analysis
Form data extraction

2. Visual Search

Search from product image
Similar image finding
Visual Q&A

3. Content Moderation

Inappropriate image detection
Brand logo check
Text + image consistency

4. Customer Support

Screenshot analysis
Visual troubleshooting
Voice support

Performance Optimization

Image Preprocessing

1def optimize_image(image_path, max_size=1024, quality=85):
2    img = Image.open(image_path)
3    
4    # Resize
5    if max(img.size) > max_size:
6        ratio = max_size / max(img.size)
7        new_size = tuple(int(d * ratio) for d in img.size)
8        img = img.resize(new_size, Image.LANCZOS)
9    
10    # Compress
11    buffer = io.BytesIO()
12    img.save(buffer, format="JPEG", quality=quality)
13    
14    return buffer.getvalue()

Batch Processing

1async def batch_image_analysis(images, batch_size=5):
2    results = []
3    for i in range(0, len(images), batch_size):
4        batch = images[i:i+batch_size]
5        tasks = [analyze_image(img) for img in batch]
6        batch_results = await asyncio.gather(*tasks)
7        results.extend(batch_results)
8    return results

Cost Management

Token Calculation (Vision)

1GPTV Token Cost:
2- Low detail: 85 token/image
3- High detail: 85 + 170 × tile_count
4
5Example (2048×1024, high):
6Tiles: ceil(2048/512) × ceil(1024/512) = 4 × 2 = 8
7Tokens: 85 + 170 × 8 = 1445 tokens

Optimization Strategies

Adjust detail level: Do not use "high" unless necessary
Reduce image size: Reduces token count
Caching: Do not re-analyze the same image
Batch operations: Reduce API call count

Conclusion

Multimodal AI is the closest approach to human-like understanding capacity of artificial intelligence. The combination of image, text, and audio modalities makes it possible to create more powerful and useful AI applications.

At Veni AI, we develop multimodal AI solutions. Contact us for your projects.