Part 2 of 6 in the SCARCE-CXR series
2.1 Lanczos Resizing
The first practical problem with NIH ChestX-ray14 was storage, and it came with a secondary problem that's easy to miss. NIH images are natively 1024×1024. PadChest images come from a Philips Digital Diagnost system at 3000×2992 pixels, nearly 9× the pixel area. If you train on NIH and evaluate on PadChest without normalizing resolution, the model sees a different effective field-of-view and feature scale at test time than it saw during pretraining. The fix for both problems is the same: resize everything to 256×256 grayscale before any training or evaluation runs.
Out of all the resizing methods, I chose Lanczos. Interstitial patterns and granulomas are high-frequency features: fine lines, small dots, subtle texture differences between tissue types. Another popular method is bilinear, a weighted average of neighboring pixels. That's exactly the wrong thing to do when the pathology you care about is a few pixels across. Lanczos costs more CPU, but this step runs once offline before training even starts, so it doesn't matter.
There's also an ordering trap. PadChest is 16-bit, and a
given scan might only use values 4,200 to 18,500 out of a
possible 65,535. Call PIL's convert("L") first and
it just divides everything by 256. That 14,300-value range becomes
~56 distinct gray levels before normalization ever runs. You can
stretch 56 levels back to 256, but you can't recover what you
threw away. Normalize on the raw values first, then convert.
# data/download.py: run once offline, before any training
img = Image.open(path) # don't convert yet; keep raw bit depth
arr = np.array(img, dtype=np.float32)
lo, hi = arr.min(), arr.max()
if hi > lo: # 16-bit (range out of 65535)
arr = (arr - lo) / (hi - lo) * 255 # normalize before quantizing to keep all 256 levels
img_8bit = Image.fromarray(arr.astype("uint8"), mode="L")
img_8bit.resize((256, 256), Image.LANCZOS).save(out_path) The storage numbers: raw PNGs are 22GB for NIH and 170GB for PadChest. Pre-resizing to 256×256 got NIH down to 3.1GB (7× reduction) and PadChest down to 856MB (~200× reduction). The resize happens once on local hardware; the compressed archives transfer to the cloud VM in under two minutes.
2.2 Optimizing Training Per Dollar
The bigger problem was throughput. I set up 14 dataloader workers on the 16-vCPU L4 instance, reserving 1 vCPU for the main training process and 1 for the OS. On the first training run the W&B dashboard showed GPU utilization stuck at ~60%. There was a visible stall between batches while the dataloader caught up. All 14 workers were hitting persistent disk and decompressing PNGs, which is serial at the disk layer regardless of how many workers you throw at it.
The fix was RAM caching. Load all 112,000 images into CPU
RAM at startup as a single uint8 tensor, share it across all
workers via share_memory_():
# From data/dataloader.py
arrays = [_load_gray256(p) for p in image_paths]
stacked = np.stack(arrays, axis=0)
self._cache = torch.from_numpy(stacked).share_memory_()
# Workers access self._cache[idx] so no disk IO or pickling The share_memory_() call is doing the heavy lifting.
Without it, each worker would get its own copy of the tensor (because
multiprocessing forks), tripling RAM usage. With it, all 14 workers
read from the same physical memory pages. 112k images at 256×256
uint8 is ~3.1GB, which fits comfortably in the 64GB RAM on the
L4 instance. GPU utilization went from ~60% to ~95%. That's more
training per dollar than any other change.
Lesson: Fix IO before anything else.
2.3 ResNet50 Backbone
ResNet50 (Res for residual, Net for neural network, 50 layers). The network is four stages of residual bottleneck blocks followed by global average pooling. Each bottleneck block takes input x and has some desired output H(x). But instead of learning H(x) directly, the block adds a skip connection that passes x straight through and only learns the small change F(x) = H(x) − x. The output becomes F(x) + x = H(x). Without the skip connection, the network must rebuild H(x) from x entirely, which is harder to optimize and less stable in deep models. In chest X-rays, most structure is consistent across images, so H(x) stays close to x and F(x) stays near 0.
flowchart LR
x["x"] --> c1["1×1 conv\nreduce channels"]
c1 --> c2["3×3 conv\nspatial"]
c2 --> c3["1×1 conv\nexpand channels"]
c3 -->|"F(x)"| add["⊕"]
x -->|"skip connection"| add
add -->|"F(x) + x"| relu["ReLU"]
Learning near-zero residuals is easier than having the conv layers reproduce the identity from scratch, and the skip path gives gradients a direct route to earlier layers. Four stages stack these blocks (3 / 4 / 6 / 3 blocks at 256 / 512 / 1024 / 2048 channels). Global average pooling collapses the final 7×7 spatial map into a 2048-dimensional vector, the representation every downstream task reads. The projection heads and loss functions used during SSL are discarded afterward; we only care about this 2048-dimensional vector matters.
flowchart LR
inp["input\n224×224"] --> stem["stem\n7×7 conv /2\nmax pool /2\n56×56×64"]
stem --> s1["stage 1\n×3 blocks\n56×56×256"]
s1 --> s2["stage 2\n×4 blocks\n28×28×512"]
s2 --> s3["stage 3\n×6 blocks\n14×14×1024"]
s3 --> s4["stage 4\n×3 blocks\n7×7×2048"]
s4 --> gap["global\navg pool\n2048-d"]
gap --> out["SSL head\n(discarded)"]
2.4 Why ResNet and Not Others
I used an off-the-shelf ResNet because this project is measuring SSL pretraining, not backbone architecture. A custom backbone adds another uncontrolled variable. Using ResNet50 exactly as the SSL paper authors used it means any performance difference between methods is attributable to the SSL objective, not the network.
Why not UNet? While also used for medical applications, UNet is a segmentation architecture useful when you want pixel-wise labels. We want a classification architecture that needs a single fixed-size embedding. SparK does use a UNet-style decoder, but only during pretraining for the reconstruction task. Once pretraining finishes, the decoder is discarded. Only the ResNet50 encoder (2048-d output) is kept for downstream evaluation.
Why not ViT? Vision Transformers need scale. MAE, DINO, and the best ViT SSL methods are validated on ImageNet (1.28M images) and larger. The NIH corpus is 112,000 images. At our scale a ViT backbone trained from scratch would underfit badly: the attention maps wouldn't learn meaningful structure and overfit to superficial patterns. Besides, ResNet's convolutional inductive biases (local connectivity, translation equivariance, shared weights across spatial positions) already let it extract useful representations from 100k images in ways that ViTs can't without far more data.
Why not EfficientNet? EfficientNet was tuned on ImageNet and adds compound scaling and tighter coupling between depth, width, and resolution. Our setting is different. Chest X-rays are mostly homogeneous, so layers mostly refine existing features rather than learn new ones. At our scale and with cross-hospital transfer, ResNet's simplicity and stability matter more than the extra scaling efficiency of EfficientNet.
Backbone choice sets capacity; augmentations decide what the model learns. The next step is choosing which variations to ignore.
2.5 Augmentation Strategy
The augmentation pipeline defines what factors the backbone considers invariant. Augmentations themselves show the network which differences are meaningful and which are noise. For example, randomly changing brightness teaches the network to treat lighting differences as irrelevant.
| Augmentation | ImageNet default | This project | Why |
|---|---|---|---|
| Crop scale | [0.2, 1.0] | [0.08, 1.0] | Conservative crops miss small findings. A fibrotic band or calcified granuloma can occupy 2–5% of image area. |
| Brightness jitter | 0.4 | 0.8 | kVp (tube voltage) controls overall X-ray brightness. Different hospitals run different protocols. NIH vs PadChest may differ by ±20%. |
| Contrast jitter | 0.4 | 0.8 | X-ray detectors have different dynamic ranges. Window/level adjustments vary by institution. |
| Gaussian noise | None | std=0.1 | X-ray quanta follow a Poisson distribution. Low-dose protocols produce noisier images. |
| Saturation/Hue | Yes | Removed | Chest X-rays are grayscale. Hue and saturation are no-ops on a single-channel image converted to RGB. |
| Blur kernel | 3–5px | 9–23px | Detector defocus and patient motion produce blur at a much larger scale than natural image noise. |
# From ssl_methods/data/transforms.py
ssl_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.08, 1.0)), # crop to 224 from 256 storage; standard ResNet input size
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.8, contrast=0.8), # simulates kVp variation & detector range
transforms.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0)), # detector defocus / patient motion
transforms.ToTensor(),
AddGaussianNoise(std=0.1), # X-ray quanta: Poisson noise in low-dose protocols
transforms.Normalize(mean=[0.518], std=[0.254]), # NIH dataset statistics, single-channel
# no Hue/Saturation: chest X-rays are grayscale
]) More importantly: augmentation results in cross-hospital and cross-dataset generalization. By making the backbone invariant to brightness, contrast, and noise variation during pretraining on NIH, we are implicitly making it invariant to equipment differences between NIH and PadChest. The backbone has no way of knowing whether the brightness variation in a training batch came from augmentation or from a different scanner. That's the point.