From upload to treatment prompt
Ten steps, four of them MadClip components, all running on a shared encoder.
Image Input → Quality Gate
Rejects underexposed, cropped, or wrong-orientation images before they reach the model.
Modality Router (MMKD-CLIP zero-shot)
Classifies bitewing / panoramic / periapical / FMS / CBCT without a dedicated classifier.
Shared DentVFM-2D Encoder
ViT-L/14, DINOv2-pretrained on 1.6M dental images — one backbone for every 2D modality.
C1: Cross-slice Attention
CBCT and full-mouth series only — attends across neighboring slices, not a single frame.
Modality-specific Detection Head
Each modality branches into its own fine-tuned head after the shared encoder.
C2: Consistency Filter
Cross-checks detections against neighboring views before they count as a finding.
NMS + FDI Assignment
De-duplicates overlapping boxes and maps every detection to an FDI tooth number.
C3: Per-tooth Clustering
Groups multiple findings on one tooth into a single compound clinical pattern.
Overlay Renderer
Draws the color-coded annotation layer dentists actually see on the image.
C4: Treatment Prompt
Template Tier 1 for common patterns, DentVLM Tier 2 for ranked, reasoned options.
What each component actually does
No math — just the problem it solves, how it solves it, and where it runs.
1 labeled CBCT = 500 free 2D training images
Joint 2D + 3D training means every expensive 3D annotation pays for itself many times over.
2D backbone pretraining
DentVFM-2D learns general dental representations across 1.6M images from public + licensed datasets.
DRR generation
Every labeled CBCT volume is projected into synthetic 2D digitally-reconstructed radiographs at multiple angles.
Joint 2D + 3D fine-tune
Real 2D radiographs and DRR-derived 2D images train the same encoder side by side.
Modality head specialization
Detection heads fine-tune per modality on top of the shared, now richly-pretrained backbone.