How to Reduce Cold Start for Stable Diffusion 3.5

Stable Diffusion 3.5 from Stability.AI allows you to generate images at even better quality, but cold start times are not great with a 16.5 GB file size. Loading up files of this size on AWS EC2 machines can take close to a minute.

This blog is in similar flavor to our other blog post on loading FLUX models, but we've actually implemented some improvements to be able to make the initial load from disk even faster by using buffered multithreaded reads from disk- which we will describe in more detail in a bit.

In summary: Outerport can load Stable Diffusion 3.5 models in 1.5 seconds for cache-to-GPU, and 5.4 seconds for disk-to-GPU (and is bottlenecked almost entirely by the disk read speed).

How does Outerport work?

From the user's perspective, it's a single line of code:

outerport.load("sd3.5_large.safetensors")

See our other blog post for the details of what this call does- but in short, it communicates with a Rust daemon that moves the model weights for you very fast and keeps it persistent & cached.

When the models are cached, this can load a Stable Diffusion 3.5 model in 1.5 seconds.

The catch here is that the model needs to be cached, meaning that you need to either start up Outerport with the Stable Diffusion 3.5 model pre-loaded or you've already called this once and it's the 2nd time loading the model from the daemon.

In our previous implementation, loading up Stable Diffusion 3.5 from disk to GPU would take around 7 seconds. We've pushed an update that allows us to load from disk to GPU in 5.5 seconds (for a ~30% improvement), which is also the same speed as loading from disk to RAM.

How does this work?

In most implementations of model loading, the transfer of the model from RAM to GPU happens after the transfer from disk to RAM.

The problem here is that most model formats store models in terms of named tensors, and memory pages that correspond to those named tensors. Once a named tensor is fully loaded from disk to RAM, the transfer into GPU can start immediately after that while the process loads other memory pages from the file. We can hide latency by interlacing the disk to RAM transfer and RAM to GPU transfer.

Implementing would look like 2 thread pools that communicates to each other with a channel queue. One pool of threads will load, in parallel, memory pages from the Safetensors file and mark named tensors as complete and push it on the queue. Another thread will take in completed named tensors and start the transfer into GPU for that tensor.

In this naive implementation, however, this would actually be very slow since each named tensor can actually be quite small and opening up a CUDA context and kernel for each named tensor would create large amounts of overhead (to be specific, in our naive experiments this implementation would be 3-4x slower than optimal). So, another technique that needs to be implemented is buffering so that there needs to be a sufficient amount of tensors that are marked complete before a batched write to GPU happens.

We experimented with different "page sizes" for this batched write and determined that 64MB pages lets us have optimal performance.

The effect of this is that we can make the "disk to RAM" step entirely the bottle neck- because the RAM to GPU is comparatively fast and can be parallelized with the "disk to RAM" step.

Diagram showing parallel disk reads and GPU transfers

Why does this matter?

Cold start times are annoying, because even if you can optimize cross-process and across-time-slice model loading with our daemon cache, if you were auto-scaling GPU clusters then you would still need to suffer the boot up time to preload models into the daemon cache. We are interested in developing technology to squeeze every bit of performance out of model serving, starting from these cold start times!

How do we get in on this?

If you're interested in using this technology, please reach out! We can also help with benchmarking your current infrastructure or provide hands-on services to help make your image generation pipelines faster.

If you want to know all the nitty-gritty details of how we made it so fast (even compared to alternatives like memory file systems), also feel free to email us.