The Smallest Loss Compute Can Buy
With Gaurav Sood, Chris Alexiuk The most expensive portion of model training today is GPU time. Given that, it is useful to ask what is the best way to spend the compute budget. More formally, the optimization problem is: minimize test loss given a FLOPs budget. To achieve the smallest loss, there are many different levers that we can pull, including, Amount of data. Number of parameters. There is an implicit trade-off between this and the previous point given a particular amount of compute. Optimization hyperparameters. For e.g., Learning rate, learning rate schedule, batch size, optimizer, etc. Model architecture Width-to-depth ratio. Deeper aspects of model architecture. For e.g., RETRO, MoE models like switch transformers, MoE with expert choice, etc. Precision in which the parameters and hyperparameters are stored. Data quality. As some of the recent work shows, data quality matters a lot. We could reformulate the optimization problem to make it more general. For instance, rather than use FLOPs or GPU time, we may want to use dollars. This opens up opportunities to think about how to purchase GPU time most cheaply, e.g., using spot GPUs. We can abstract out the optimization problem further. If we knew the ROI of the prediction task, we could ask what is the profit-maximizing loss given a constraint on latency. Inference ROI is a function of ~ accuracy (or another performance metric of choice) and the compute cost of inference. ...