Vision Framework Computer Vision

Guides you through implementing computer vision: subject segmentation, hand/body pose detection, person detection, text recognition (OCR), barcode/QR scanning, document scanning, and combining Vision APIs to solve complex problems.

Overview

The Vision framework provides computer vision capabilities for:

Subject segmentation - Isolate foreground objects from backgrounds
Hand pose detection - 21 landmarks per hand for gesture recognition
Body pose detection - 18 joints (2D) or 17 joints (3D) for fitness/action classification
Person segmentation - Separate masks for up to 4 people
Face detection - Bounding boxes and detailed landmarks
Text recognition - Fast or accurate OCR with language support
Barcode/QR detection - 20+ symbologies with revision history
Document scanning - Edge detection, perspective correction, structured extraction (iOS 26+)
Live scanning - DataScannerViewController for real-time text/barcode (iOS 16+)

When to Use This Skill

Use when you need to:

☑ Isolate subjects from backgrounds (subject lifting)
☑ Detect and track hand poses for gestures
☑ Detect and track body poses for fitness/action classification
☑ Segment multiple people separately
☑ Exclude hands from object bounding boxes (combining APIs)
☑ Choose between VisionKit and Vision framework
☑ Combine Vision with CoreImage for compositing
☑ Recognize text in images (OCR)
☑ Scan barcodes and QR codes
☑ Scan documents with perspective correction
☑ Build live camera scanning (DataScannerViewController)

Key Decision Trees

API Selection

What do you need to do?

Isolate subject(s) from background?
├─ Need system UI → VisionKit (ImageAnalysisInteraction)
├─ Need custom pipeline/HDR → Vision (VNGenerateForegroundInstanceMaskRequest)
└─ Need to EXCLUDE hands → Combine subject mask + hand pose

Segment people?
├─ All people in one mask → VNGeneratePersonSegmentationRequest
└─ Separate mask per person → VNGeneratePersonInstanceMaskRequest (up to 4)

Detect hand pose/gestures?
└─ 21 hand landmarks → VNDetectHumanHandPoseRequest

Detect body pose?
├─ 2D normalized landmarks → VNDetectHumanBodyPoseRequest
├─ 3D real-world coordinates → VNDetectHumanBodyPose3DRequest
└─ Action classification → Body pose + CreateML model

Recognize text?
├─ Static image → VNRecognizeTextRequest (fast or accurate)
├─ Live camera → DataScannerViewController (iOS 16+)
└─ Need custom words → VNRecognizeTextRequest.customWords

Detect barcodes?
├─ Static image → VNDetectBarcodesRequest
├─ Live camera → DataScannerViewController (iOS 16+)
└─ Need specific symbologies → Set .symbologies property

Scan documents?
├─ Need system UI → VNDocumentCameraViewController (iOS 13+)
├─ Need structured data (iOS 26+) → RecognizeDocumentsRequest
└─ Programmatic edges → VNDetectDocumentSegmentationRequest

Common Use Cases

Isolate Object While Excluding Hand

The most common request: Getting a bounding box around an object held in hand, without including the hand.

Problem: VNGenerateForegroundInstanceMaskRequest is class-agnostic and treats hand+object as one subject.

Solution: Combine subject mask with hand pose detection to create exclusion mask.

See the full skill for implementation details.

VisionKit Simple Subject Lifting

Add system-like subject lifting UI with just a few lines:

swift

let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject
imageView.addInteraction(interaction)

Hand Gesture Recognition

Detect pinch gestures for custom camera controls:

swift

let request = VNDetectHumanHandPoseRequest()
let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

let distance = hypot(
    thumbTip.location.x - indexTip.location.x,
    thumbTip.location.y - indexTip.location.y
)

let isPinching = distance < 0.05  // Threshold

Text Recognition (OCR)

Recognize text in images with fast or accurate modes:

swift

let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // or .fast
request.recognitionLanguages = ["en-US"]
request.usesLanguageCorrection = true

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

let observations = request.results ?? []
let text = observations.compactMap { $0.topCandidates(1).first?.string }

Live Camera Scanning (DataScannerViewController)

Scan barcodes and text in real-time with iOS 16+ VisionKit:

swift

let scanner = DataScannerViewController(
    recognizedDataTypes: [
        .barcode(symbologies: [.qr, .ean13]),
        .text(textContentType: .URL)
    ],
    qualityLevel: .balanced
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}

Common Pitfalls

❌ Processing on main thread (blocks UI)
❌ Ignoring confidence scores (low confidence = unreliable)
❌ Forgetting to convert coordinates (lower-left vs top-left origin)
❌ Setting maximumHandCount too high (performance impact)
❌ Using ARKit when Vision suffices (offline processing)
❌ Using .fast text recognition when accuracy matters
❌ Not checking DataScannerViewController.isSupported before using
❌ Processing every video frame (use frame skipping for performance)

Platform Support

API	Minimum Version
Subject segmentation (instance masks)	iOS 17+
VisionKit subject lifting	iOS 16+
Hand pose	iOS 14+
Body pose (2D)	iOS 14+
Body pose (3D)	iOS 17+
Person instance segmentation	iOS 17+
Text recognition (VNRecognizeTextRequest)	iOS 13+
Barcode detection (VNDetectBarcodesRequest)	iOS 11+
DataScannerViewController	iOS 16+
Document camera (VNDocumentCameraViewController)	iOS 13+
RecognizeDocumentsRequest (structured)	iOS 26+

Vision Framework API Reference - Complete API docs with code examples
Vision Framework Diagnostics - Troubleshooting when things go wrong

Vision Framework Computer Vision

Overview

When to Use This Skill

Key Decision Trees

API Selection

Common Use Cases

Isolate Object While Excluding Hand

VisionKit Simple Subject Lifting

Hand Gesture Recognition

Text Recognition (OCR)

Live Camera Scanning (DataScannerViewController)

Common Pitfalls

Platform Support

WWDC Sessions

Apple Documentation

Vision Framework Computer Vision ​

Overview ​

When to Use This Skill ​

Key Decision Trees ​

API Selection ​

Common Use Cases ​

Isolate Object While Excluding Hand ​

VisionKit Simple Subject Lifting ​

Hand Gesture Recognition ​

Text Recognition (OCR) ​

Live Camera Scanning (DataScannerViewController) ​

Common Pitfalls ​

Platform Support ​

Related Resources ​

WWDC Sessions ​

Apple Documentation ​

Vision Framework Computer Vision

Overview

When to Use This Skill

Key Decision Trees

API Selection

Common Use Cases

Isolate Object While Excluding Hand

VisionKit Simple Subject Lifting

Hand Gesture Recognition

Text Recognition (OCR)

Live Camera Scanning (DataScannerViewController)

Common Pitfalls

Platform Support

Related Resources

WWDC Sessions

Apple Documentation