Multimodal AI Systems: Image, Text, and Audio Analysis
Multimodal AI is artificial intelligence systems capable of understanding and processing multiple data types (text, image, audio, video). Models like GPTV, Gemini, and Claude 3 have broken new ground in this field.
Multimodal AI Fundamentals
Modality Types
- Text: Natural language, code, structured data
- Vision: Photo, diagram, screenshot
- Audio: Speech, music, environmental sounds
- Video: Combination of moving image + audio
Why Multimodal?
- Human communication is inherently multimodal
- Context information missed with single modality
- Richer meaning extraction
- Suitability for real-world applications
Vision-Language Models
Architectural Approaches
1. Contrastive Learning (CLIP style)
1Image Encoder → Image Embedding 2Text Encoder → Text Embedding 3Contrastive Loss: Match(image, text)
2. Generative (GPTV style)
Image → Vision Encoder → Visual Tokens Visual Tokens + Text Tokens → LLM → Response
3. Cross-Attention Fusion
Image Features ←Cross-Attention→ Text Features
Vision Encoder Types
| Encoder | Architecture | Resolution | Feature |
|---|---|---|---|
| ViT | Transformer | 224-1024 | Patch-based |
| CLIP ViT | Transformer | 336 | Contrastive |
| SigLIP | Transformer | 384 | Sigmoid loss |
| ConvNeXt | CNN | Flexible | Efficient |
Image Tokenization
Patch Embedding:
224×224 image → 14×14 patch grid → 196 visual tokens Each patch: 16×16 pixel → Linear projection → Embedding
Variable Resolution:
1Anyres approach: 21. Divide image into tiles 32. Encode each tile separately 43. Add global thumbnail 54. Concatenate all tokens
Multimodal LLM Implementation
GPTV Usage
1from openai import OpenAI 2import base64 3 4client = OpenAI() 5 6def encode_image(image_path): 7 with open(image_path, "rb") as f: 8 return base64.b64encode(f.read()).decode('utf-8') 9 10response = client.chat.completions.create( 11 model="gpt-4-vision-preview", 12 messages=[ 13 { 14 "role": "user", 15 "content": [ 16 {"type": "text", "text": "Analyze this image"}, 17 { 18 "type": "image_url", 19 "image_url": { 20 "url": f"data:image/jpeg;base64,{encode_image('image.webp')}", 21 "detail": "high" # low, high, auto 22 } 23 } 24 ] 25 } 26 ], 27 max_tokens=1000 28)
Claude 3 Vision
1from anthropic import Anthropic 2import base64 3 4client = Anthropic() 5 6with open("image.webp", "rb") as f: 7 image_data = base64.standard_b64encode(f.read()).decode("utf-8") 8 9message = client.messages.create( 10 model="claude-3-opus-20240229", 11 max_tokens=1024, 12 messages=[ 13 { 14 "role": "user", 15 "content": [ 16 { 17 "type": "image", 18 "source": { 19 "type": "base64", 20 "media_type": "image/jpeg", 21 "data": image_data 22 } 23 }, 24 {"type": "text", "text": "What is in this image?"} 25 ] 26 } 27 ] 28)
Audio Processing
Speech-to-Text (STT)
Whisper Model:
1from openai import OpenAI 2 3client = OpenAI() 4 5with open("audio.mp3", "rb") as audio_file: 6 transcript = client.audio.transcriptions.create( 7 model="whisper-1", 8 file=audio_file, 9 language="en" 10 ) 11 12print(transcript.text)
Text-to-Speech (TTS)
1response = client.audio.speech.create( 2 model="tts-1-hd", 3 voice="alloy", # alloy, echo, fable, onyx, nova, shimmer 4 input="Hello, I am an AI assistant." 5) 6 7response.stream_to_file("output.mp3")
Real-time Audio Pipeline
1Microphone → VAD → Chunking → STT → LLM → TTS → Speaker 2 ↓ 3 Voice Activity 4 Detection
Video Understanding
Frame Sampling Strategies
1. Uniform Sampling:
1def uniform_sample(video_path, num_frames=8): 2 cap = cv2.VideoCapture(video_path) 3 total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) 4 indices = np.linspace(0, total_frames-1, num_frames, dtype=int) 5 6 frames = [] 7 for idx in indices: 8 cap.set(cv2.CAP_PROP_POS_FRAMES, idx) 9 ret, frame = cap.read() 10 if ret: 11 frames.append(frame) 12 13 return frames
2. Keyframe Extraction:
1def extract_keyframes(video_path, threshold=30): 2 # Finding keyframes with Scene change detection 3 pass
Video-LLM Pipeline
1Video → Frame Sampling → Per-frame Encoding → Temporal Aggregation → LLM 2 ↓ 3 Audio Extraction → STT → Text
Modality Fusion
Early Fusion
Combining modalities at model input:
[CLS] [IMG_1] ... [IMG_N] [SEP] [TXT_1] ... [TXT_M] [SEP]
Late Fusion
Processing each modality separately and combining results:
1Image → Image Model → Image Features ─┐ 2 ├→ Fusion Layer → Output 3Text → Text Model → Text Features ────┘
Cross-Modal Attention
Attention between modalities:
1Q = Text Features 2K, V = Image Features 3Cross_Attention(Q, K, V) = softmax(QK^T/√d)V
OCR and Document Understanding
Document AI Pipeline
1def process_document(image_path): 2 # 1. Layout Detection 3 layout = detect_layout(image) # Headings, paragraphs, tables 4 5 # 2. OCR 6 text_regions = ocr_extract(image) 7 8 # 3. Structure Understanding 9 structured_doc = parse_structure(layout, text_regions) 10 11 # 4. LLM Analysis 12 analysis = llm_analyze(structured_doc) 13 14 return analysis
Table Extraction
1response = client.chat.completions.create( 2 model="gpt-4-vision-preview", 3 messages=[{ 4 "role": "user", 5 "content": [ 6 {"type": "image_url", "image_url": {"url": table_image_url}}, 7 {"type": "text", "text": "Extract this table in JSON format"} 8 ] 9 }] 10)
Enterprise Multimodal Applications
1. Document Processing
- Invoice/receipt OCR
- Contract analysis
- Form data extraction
2. Visual Search
- Search from product image
- Similar image finding
- Visual Q&A
3. Content Moderation
- Inappropriate image detection
- Brand logo check
- Text + image consistency
4. Customer Support
- Screenshot analysis
- Visual troubleshooting
- Voice support
Performance Optimization
Image Preprocessing
1def optimize_image(image_path, max_size=1024, quality=85): 2 img = Image.open(image_path) 3 4 # Resize 5 if max(img.size) > max_size: 6 ratio = max_size / max(img.size) 7 new_size = tuple(int(d * ratio) for d in img.size) 8 img = img.resize(new_size, Image.LANCZOS) 9 10 # Compress 11 buffer = io.BytesIO() 12 img.save(buffer, format="JPEG", quality=quality) 13 14 return buffer.getvalue()
Batch Processing
1async def batch_image_analysis(images, batch_size=5): 2 results = [] 3 for i in range(0, len(images), batch_size): 4 batch = images[i:i+batch_size] 5 tasks = [analyze_image(img) for img in batch] 6 batch_results = await asyncio.gather(*tasks) 7 results.extend(batch_results) 8 return results
Cost Management
Token Calculation (Vision)
1GPTV Token Cost: 2- Low detail: 85 token/image 3- High detail: 85 + 170 × tile_count 4 5Example (2048×1024, high): 6Tiles: ceil(2048/512) × ceil(1024/512) = 4 × 2 = 8 7Tokens: 85 + 170 × 8 = 1445 tokens
Optimization Strategies
- Adjust detail level: Do not use "high" unless necessary
- Reduce image size: Reduces token count
- Caching: Do not re-analyze the same image
- Batch operations: Reduce API call count
Conclusion
Multimodal AI is the closest approach to human-like understanding capacity of artificial intelligence. The combination of image, text, and audio modalities makes it possible to create more powerful and useful AI applications.
At Veni AI, we develop multimodal AI solutions. Contact us for your projects.
