Part 2 of 6 in the SCARCE-CXR series
2.1 Lanczos Resizing
The first practical problem with NIH ChestX-ray14 was
storage, and it came with a secondary problem that's easy to
miss. NIH images are natively 1024×1024. PadChest images
come from a Philips Digital Diagnost system at 3000×2992
pixels, nearly 9× the pixel area. If you train on NIH and
evaluate on PadChest without normalizing resolution, the
model sees a different effective field-of-view and feature
scale at test time than it saw during pretraining. The fix
for both problems is the same: resize everything to 256×256
grayscale before any training or evaluation runs.
Out of all the resizing methods, I chose Lanczos. Interstitial patterns and granulomas are high-frequency
features: fine lines, small dots, subtle texture differences
between tissue types. Another popular method is bilinear, a
weighted average of neighboring pixels. That's exactly the
wrong thing to do when the pathology you care about is a few
pixels across. Lanczos costs more CPU, but this step runs
once offline before training even starts, so it doesn't
matter.
There's also an ordering trap. PadChest is 16-bit, and a
given scan might only use values 4,200 to 18,500 out of a
possible 65,535. Call PIL's convert("L") first and
it just divides everything by 256. That 14,300-value range becomes
~56 distinct gray levels before normalization ever runs. You can
stretch 56 levels back to 256, but you can't recover what you
threw away. Normalize on the raw values first, then convert.
# data/download.py: run once offline, before any training
img = Image.open(path) # don't convert yet; keep raw bit depth
arr = np.array(img, dtype=np.float32)
lo, hi = arr.min(), arr.max()
if hi > lo: # 16-bit (range out of 65535)
arr = (arr - lo) / (hi - lo) * 255 # normalize before quantizing to keep all 256 levels
img_8bit = Image.fromarray(arr.astype("uint8"), mode="L")
img_8bit.resize((256, 256), Image.LANCZOS).save(out_path)
The storage numbers: raw PNGs are 22GB for NIH and 170GB for
PadChest. Pre-resizing to 256×256 got NIH down to 3.1GB (7×
reduction) and PadChest down to 856MB (~200× reduction). The
resize happens once on local hardware; the compressed
archives transfer to the cloud VM in under two minutes.
2.2 Optimizing Training Per Dollar
The bigger problem was throughput. I set up 14 dataloader
workers on the 16-vCPU L4 instance, reserving 1 vCPU for the
main training process and 1 for the OS. On the first
training run the W&B dashboard showed GPU utilization stuck
at ~60%. There was a visible stall between batches while the
dataloader caught up. All 14 workers were hitting persistent
disk and decompressing PNGs, which is serial at the disk
layer regardless of how many workers you throw at it.
The fix was RAM caching. Load all 112,000 images into CPU
RAM at startup as a single uint8 tensor, share it across all
workers via share_memory_():
# From data/dataloader.py
arrays = [_load_gray256(p) for p in image_paths]
stacked = np.stack(arrays, axis=0)
self._cache = torch.from_numpy(stacked).share_memory_()
# Workers access self._cache[idx] so no disk IO or pickling
The share_memory_() call is doing the heavy lifting.
Without it, each worker would get its own copy of the tensor (because
multiprocessing forks), tripling RAM usage. With it, all 14 workers
read from the same physical memory pages. 112k images at 256×256
uint8 is ~3.1GB, which fits comfortably in the 64GB RAM on the
L4 instance. GPU utilization went from ~60% to ~95%. That's more
training per dollar than any other change.
Lesson: Fix IO before anything else.
2.3 ResNet50 Backbone
ResNet50 (Res for residual, Net for neural network, 50
layers). The network is four stages of residual bottleneck blocks followed
by global average pooling. Each bottleneck block takes input x
and has some desired output H(x). But instead of learning H(x)
directly, the block adds a skip connection that passes x straight
through and only learns the small change F(x) = H(x) − x. The
output becomes F(x) + x = H(x). Without the skip connection, the
network must rebuild H(x) from x entirely, which is harder to
optimize and less stable in deep models. In chest X-rays, most
structure is consistent across images, so H(x) stays close to
x and F(x) stays near 0.
Bottleneck residual block
flowchart LR
x["x"] --> c1["1×1 conv\nreduce channels"]
c1 --> c2["3×3 conv\nspatial"]
c2 --> c3["1×1 conv\nexpand channels"]
c3 -->|"F(x)"| add["⊕"]
x -->|"skip connection"| add
add -->|"F(x) + x"| relu["ReLU"]
Learning near-zero residuals is easier than having the conv
layers reproduce the identity from scratch, and the skip
path gives gradients a direct route to earlier layers. Four
stages stack these blocks (3 / 4 / 6 / 3 blocks at 256 / 512
/ 1024 / 2048 channels). Global average pooling collapses
the final 7×7 spatial map into a 2048-dimensional vector,
the representation every downstream task reads. The
projection heads and loss functions used during SSL are
discarded afterward; we only care about this
2048-dimensional vector matters.
ResNet50 full architecture
flowchart LR
inp["input\n224×224"] --> stem["stem\n7×7 conv /2\nmax pool /2\n56×56×64"]
stem --> s1["stage 1\n×3 blocks\n56×56×256"]
s1 --> s2["stage 2\n×4 blocks\n28×28×512"]
s2 --> s3["stage 3\n×6 blocks\n14×14×1024"]
s3 --> s4["stage 4\n×3 blocks\n7×7×2048"]
s4 --> gap["global\navg pool\n2048-d"]
gap --> out["SSL head\n(discarded)"]
2.4 Why ResNet and Not Others
I used an off-the-shelf ResNet because this project is
measuring SSL pretraining, not backbone architecture. A
custom backbone adds another uncontrolled variable. Using
ResNet50 exactly as the SSL paper authors used it means any
performance difference between methods is attributable to
the SSL objective, not the network.
Why not UNet? While also used for medical
applications, UNet is a segmentation
architecture useful when you want pixel-wise labels. We want a
classification
architecture that needs a single fixed-size embedding. SparK does
use a UNet-style decoder, but only during pretraining for the
reconstruction task. Once pretraining finishes, the decoder is
discarded. Only the ResNet50 encoder (2048-d output) is kept for
downstream evaluation.
Why not ViT? Vision Transformers need scale.
MAE, DINO, and the best ViT SSL methods are validated on ImageNet
(1.28M images) and larger. The NIH corpus is 112,000 images. At
our scale a ViT backbone trained from scratch would underfit badly:
the attention maps wouldn't learn meaningful structure and overfit
to superficial patterns. Besides, ResNet's convolutional inductive
biases (local connectivity, translation equivariance, shared weights
across spatial positions) already let it extract useful representations
from 100k images in ways that ViTs can't without far more data.
Why not EfficientNet? EfficientNet was tuned
on ImageNet and adds compound scaling and tighter coupling between
depth, width, and resolution. Our setting is different. Chest
X-rays are mostly homogeneous, so layers mostly refine existing
features rather than learn new ones. At our scale and with cross-hospital
transfer, ResNet's simplicity and stability matter more than the
extra scaling efficiency of EfficientNet.
Backbone choice sets capacity; augmentations decide what the
model learns. The next step is choosing which variations to
ignore.
2.5 Augmentation Strategy
The augmentation pipeline defines what factors the backbone
considers invariant. Augmentations themselves show the
network which differences are meaningful and which are
noise. For example, randomly changing brightness teaches the
network to treat lighting differences as irrelevant.
| Augmentation | ImageNet default | This project | Why |
| Crop scale | [0.2, 1.0] | [0.08, 1.0] | Conservative crops miss small findings. A
fibrotic band or calcified granuloma can occupy
2–5% of image area. |
| Brightness jitter | 0.4 | 0.8 | kVp (tube voltage) controls overall X-ray
brightness. Different hospitals run different
protocols. NIH vs PadChest may differ by ±20%. |
| Contrast jitter | 0.4 | 0.8 | X-ray detectors have different dynamic ranges.
Window/level adjustments vary by institution. |
| Gaussian noise | None | std=0.1 | X-ray quanta follow a Poisson distribution.
Low-dose protocols produce noisier images. |
| Saturation/Hue | Yes | Removed | Chest X-rays are grayscale. Hue and saturation
are no-ops on a single-channel image converted
to RGB. |
| Blur kernel | 3–5px | 9–23px | Detector defocus and patient motion produce
blur at a much larger scale than natural image
noise. |
# From ssl_methods/data/transforms.py
ssl_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.08, 1.0)), # crop to 224 from 256 storage; standard ResNet input size
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.8, contrast=0.8), # simulates kVp variation & detector range
transforms.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0)), # detector defocus / patient motion
transforms.ToTensor(),
AddGaussianNoise(std=0.1), # X-ray quanta: Poisson noise in low-dose protocols
transforms.Normalize(mean=[0.518], std=[0.254]), # NIH dataset statistics, single-channel
# no Hue/Saturation: chest X-rays are grayscale
])
More importantly: augmentation results in cross-hospital and
cross-dataset generalization. By making the backbone
invariant to brightness, contrast, and noise variation
during pretraining on NIH, we are implicitly making it
invariant to equipment differences between NIH and PadChest.
The backbone has no way of knowing whether the brightness
variation in a training batch came from augmentation or from
a different scanner. That's the point.