Thoughts on Model Training Infrastructure

Some notes from recent experiences building model training at scale.

Xander Dunn, 31 January 2024

These thoughts run across the stack, from things only infrastructure engineers need to worry about, to things the experimenters need to worry about. This isn't intended to be a comprehensive playbook. These are some of the things we've thought about and implemented recently in our training infrastructure. I'm sure there are better ways to do some of these things, so don't hesitate to send me your thoughts!

Consider Your Scale

If you've only got 8 machines, then the below will not be very useful to you. You'll want the logging and that's about it. If you've got 10,000 machines, then the below is not enough. For example, Google's training of Gemini Ultra across multiple clusters and multiple data centers would require additional tricks. DiLoCo and DiPaCo may be some of these tricks. This is a scale I have not yet worked on. We're in the middle, with machines measured in the hundreds and our largest single cluster is 8,000 devices.

Cluster Management


DataDog is very good at hiding the daily spend limits. Logs → Configuration → Indexes → main → Edit → Set Daily Quota

Experiment Layer

Uh ohhhh
Example of our execution profiling across pipeline parallelism.