Pipeline

From upload to treatment prompt

Ten steps, four of them MadClip components, all running on a shared encoder.

Image Input → Quality Gate

Rejects underexposed, cropped, or wrong-orientation images before they reach the model.

Modality Router (MMKD-CLIP zero-shot)

Classifies bitewing / panoramic / periapical / FMS / CBCT without a dedicated classifier.

Shared DentVFM-2D Encoder

ViT-L/14, DINOv2-pretrained on 1.6M dental images — one backbone for every 2D modality.

C1: Cross-slice Attention

CBCT and full-mouth series only — attends across neighboring slices, not a single frame.

Modality-specific Detection Head

Each modality branches into its own fine-tuned head after the shared encoder.

C2: Consistency Filter

Cross-checks detections against neighboring views before they count as a finding.

NMS + FDI Assignment

De-duplicates overlapping boxes and maps every detection to an FDI tooth number.

C3: Per-tooth Clustering

Groups multiple findings on one tooth into a single compound clinical pattern.

Overlay Renderer

Draws the color-coded annotation layer dentists actually see on the image.

C4: Treatment Prompt

Template Tier 1 for common patterns, DentVLM Tier 2 for ranked, reasoned options.

MadClip deep-dive

What each component actually does

No math — just the problem it solves, how it solves it, and where it runs.

Training strategy

1 labeled CBCT = 500 free 2D training images

Joint 2D + 3D training means every expensive 3D annotation pays for itself many times over.

Phase 1

2D backbone pretraining

DentVFM-2D learns general dental representations across 1.6M images from public + licensed datasets.

Phase 2

DRR generation

Every labeled CBCT volume is projected into synthetic 2D digitally-reconstructed radiographs at multiple angles.

Phase 3

Joint 2D + 3D fine-tune

Real 2D radiographs and DRR-derived 2D images train the same encoder side by side.

Phase 4

Modality head specialization

Detection heads fine-tune per modality on top of the shared, now richly-pretrained backbone.