Veni AI
Artificial Intelligence

Multimodal AI Systems: Image, Text, and Audio Analysis

Comprehensive technical guide on the technical architecture of multimodal AI systems, vision-language models, audio processing, and multi-modal fusion.

Veni AI Technical TeamJanuary 9, 20255 min read
Multimodal AI Systems: Image, Text, and Audio Analysis

Multimodal AI Systems: Image, Text, and Audio Analysis

Multimodal AI is artificial intelligence systems capable of understanding and processing multiple data types (text, image, audio, video). Models like GPTV, Gemini, and Claude 3 have broken new ground in this field.

Multimodal AI Fundamentals

Modality Types

  1. Text: Natural language, code, structured data
  2. Vision: Photo, diagram, screenshot
  3. Audio: Speech, music, environmental sounds
  4. Video: Combination of moving image + audio

Why Multimodal?

  • Human communication is inherently multimodal
  • Context information missed with single modality
  • Richer meaning extraction
  • Suitability for real-world applications

Vision-Language Models

Architectural Approaches

1. Contrastive Learning (CLIP style)

1Image Encoder → Image Embedding 2Text Encoder → Text Embedding 3Contrastive Loss: Match(image, text)

2. Generative (GPTV style)

Image → Vision Encoder → Visual Tokens Visual Tokens + Text Tokens → LLM → Response

3. Cross-Attention Fusion

Image Features ←Cross-Attention→ Text Features

Vision Encoder Types

EncoderArchitectureResolutionFeature
ViTTransformer224-1024Patch-based
CLIP ViTTransformer336Contrastive
SigLIPTransformer384Sigmoid loss
ConvNeXtCNNFlexibleEfficient

Image Tokenization

Patch Embedding:

224×224 image → 14×14 patch grid → 196 visual tokens Each patch: 16×16 pixel → Linear projection → Embedding

Variable Resolution:

1Anyres approach: 21. Divide image into tiles 32. Encode each tile separately 43. Add global thumbnail 54. Concatenate all tokens

Multimodal LLM Implementation

GPTV Usage

1from openai import OpenAI 2import base64 3 4client = OpenAI() 5 6def encode_image(image_path): 7 with open(image_path, "rb") as f: 8 return base64.b64encode(f.read()).decode('utf-8') 9 10response = client.chat.completions.create( 11 model="gpt-4-vision-preview", 12 messages=[ 13 { 14 "role": "user", 15 "content": [ 16 {"type": "text", "text": "Analyze this image"}, 17 { 18 "type": "image_url", 19 "image_url": { 20 "url": f"data:image/jpeg;base64,{encode_image('image.webp')}", 21 "detail": "high" # low, high, auto 22 } 23 } 24 ] 25 } 26 ], 27 max_tokens=1000 28)

Claude 3 Vision

1from anthropic import Anthropic 2import base64 3 4client = Anthropic() 5 6with open("image.webp", "rb") as f: 7 image_data = base64.standard_b64encode(f.read()).decode("utf-8") 8 9message = client.messages.create( 10 model="claude-3-opus-20240229", 11 max_tokens=1024, 12 messages=[ 13 { 14 "role": "user", 15 "content": [ 16 { 17 "type": "image", 18 "source": { 19 "type": "base64", 20 "media_type": "image/jpeg", 21 "data": image_data 22 } 23 }, 24 {"type": "text", "text": "What is in this image?"} 25 ] 26 } 27 ] 28)

Audio Processing

Speech-to-Text (STT)

Whisper Model:

1from openai import OpenAI 2 3client = OpenAI() 4 5with open("audio.mp3", "rb") as audio_file: 6 transcript = client.audio.transcriptions.create( 7 model="whisper-1", 8 file=audio_file, 9 language="en" 10 ) 11 12print(transcript.text)

Text-to-Speech (TTS)

1response = client.audio.speech.create( 2 model="tts-1-hd", 3 voice="alloy", # alloy, echo, fable, onyx, nova, shimmer 4 input="Hello, I am an AI assistant." 5) 6 7response.stream_to_file("output.mp3")

Real-time Audio Pipeline

1Microphone → VAD → Chunking → STT → LLM → TTS → Speaker 23 Voice Activity 4 Detection

Video Understanding

Frame Sampling Strategies

1. Uniform Sampling:

1def uniform_sample(video_path, num_frames=8): 2 cap = cv2.VideoCapture(video_path) 3 total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) 4 indices = np.linspace(0, total_frames-1, num_frames, dtype=int) 5 6 frames = [] 7 for idx in indices: 8 cap.set(cv2.CAP_PROP_POS_FRAMES, idx) 9 ret, frame = cap.read() 10 if ret: 11 frames.append(frame) 12 13 return frames

2. Keyframe Extraction:

1def extract_keyframes(video_path, threshold=30): 2 # Finding keyframes with Scene change detection 3 pass

Video-LLM Pipeline

1Video → Frame Sampling → Per-frame Encoding → Temporal Aggregation → LLM 23 Audio Extraction → STT → Text

Modality Fusion

Early Fusion

Combining modalities at model input:

[CLS] [IMG_1] ... [IMG_N] [SEP] [TXT_1] ... [TXT_M] [SEP]

Late Fusion

Processing each modality separately and combining results:

1Image → Image Model → Image Features ─┐ 2 ├→ Fusion Layer → Output 3Text → Text Model → Text Features ────┘

Cross-Modal Attention

Attention between modalities:

1Q = Text Features 2K, V = Image Features 3Cross_Attention(Q, K, V) = softmax(QK^T/√d)V

OCR and Document Understanding

Document AI Pipeline

1def process_document(image_path): 2 # 1. Layout Detection 3 layout = detect_layout(image) # Headings, paragraphs, tables 4 5 # 2. OCR 6 text_regions = ocr_extract(image) 7 8 # 3. Structure Understanding 9 structured_doc = parse_structure(layout, text_regions) 10 11 # 4. LLM Analysis 12 analysis = llm_analyze(structured_doc) 13 14 return analysis

Table Extraction

1response = client.chat.completions.create( 2 model="gpt-4-vision-preview", 3 messages=[{ 4 "role": "user", 5 "content": [ 6 {"type": "image_url", "image_url": {"url": table_image_url}}, 7 {"type": "text", "text": "Extract this table in JSON format"} 8 ] 9 }] 10)

Enterprise Multimodal Applications

1. Document Processing

  • Invoice/receipt OCR
  • Contract analysis
  • Form data extraction

2. Visual Search

  • Search from product image
  • Similar image finding
  • Visual Q&A

3. Content Moderation

  • Inappropriate image detection
  • Brand logo check
  • Text + image consistency

4. Customer Support

  • Screenshot analysis
  • Visual troubleshooting
  • Voice support

Performance Optimization

Image Preprocessing

1def optimize_image(image_path, max_size=1024, quality=85): 2 img = Image.open(image_path) 3 4 # Resize 5 if max(img.size) > max_size: 6 ratio = max_size / max(img.size) 7 new_size = tuple(int(d * ratio) for d in img.size) 8 img = img.resize(new_size, Image.LANCZOS) 9 10 # Compress 11 buffer = io.BytesIO() 12 img.save(buffer, format="JPEG", quality=quality) 13 14 return buffer.getvalue()

Batch Processing

1async def batch_image_analysis(images, batch_size=5): 2 results = [] 3 for i in range(0, len(images), batch_size): 4 batch = images[i:i+batch_size] 5 tasks = [analyze_image(img) for img in batch] 6 batch_results = await asyncio.gather(*tasks) 7 results.extend(batch_results) 8 return results

Cost Management

Token Calculation (Vision)

1GPTV Token Cost: 2- Low detail: 85 token/image 3- High detail: 85 + 170 × tile_count 4 5Example (2048×1024, high): 6Tiles: ceil(2048/512) × ceil(1024/512) = 4 × 2 = 8 7Tokens: 85 + 170 × 8 = 1445 tokens

Optimization Strategies

  1. Adjust detail level: Do not use "high" unless necessary
  2. Reduce image size: Reduces token count
  3. Caching: Do not re-analyze the same image
  4. Batch operations: Reduce API call count

Conclusion

Multimodal AI is the closest approach to human-like understanding capacity of artificial intelligence. The combination of image, text, and audio modalities makes it possible to create more powerful and useful AI applications.

At Veni AI, we develop multimodal AI solutions. Contact us for your projects.

İlgili Makaleler