0.1 TL;DR
When a never seen before disease appears in a hospital, you might only have a few days to build a classifier with as little as 23 labeled examples to work with. Standard supervised training needs thousands of labeled examples to generalize. Self-supervised learning (SSL) proposes a fix: pretrain a backbone on a large pool of unlabeled data first so it already understands the domain. When you hand it 23 labeled examples, it only needs to learn the new pathological pattern medical anatomy from scratch.
I tested four SSL methods on 112k unlabeled NIH chest X-rays, then evaluated each backbone on 10 rare diseases at shot counts from 1 to 50 labeled examples per class using linear probe and gradient finetuning methods. Everything was trained on a single L4 GPU powered by Google Cloud free trial credits.
Adapting four different SSL methods to this use case meant hitting a different failure mode each time. MoCo v2's representations collapsed on homogeneous X-ray data and needed a variance regularization term to recover. BarlowTwins hit OOM errors at any viable batch size with ResNet50 and had to drop to ResNet18. SparK's per-stage sparse simulation was simplified to input-level masking while DINO flatlined from epoch 0 and didn't produce any meaningful representations.
SSL pretraining on domain-specific chest X-rays consistently outperforms ImageNet features at low shot counts on the linear probe. On the gradient finetune, the gap narrows as more labeled data lets ImageNet features adapt through gradient descent but SSL pretraining still outperforms it. MoCo v2 is the strongest overall, with BarlowTwins close behind despite using a 4x smaller backbone. SparK's generative pretraining produces weaker linear probe representations but recovers much of the gap under gradient finetuning.
0.2 Series Overview
Part 1: Motivation for this Guide. What Do You Do When You Only Have 23 Labeled X-Rays?
Part 2: Pretraining Setup for Robustness and Invariance. Optimizing X-Ray Training Per Dollar: RAM Caches, ResNets, and Augmentations.
Part 3: How SSL Works: MoCo + BarlowTwins (And How to Fix Collapse). Contrastive Learning is Lazy. Why MoCo Collapsed and How VICReg Fixed It.
Part 4: How SSL Works: DINO + SparK (And Dealing with Plateauing). Stop Blindly Applying ImageNet SSL to Medical Data. What DINO and SparK Taught Me.
Part 5: Probing and Finetuning. 340GB of Manual Downloads, Disease Selection, Domain Gaps, and Probing vs Finetuning with Grad-CAM Considerations.
Part 6: Analysis + Future Directions. MoCo v2 vs. BarlowTwins vs. SparK. Results on Medical SSL (With GradCAMs).