Building Together, Separately: Challenges of Software Development

With Gaurav Sood Microservices, Macroproblems A single page on Doordash can make upward of 1000 gRPC calls (see the interview). For many engineers, upward of a thousand network calls nicely illustrate the chaos and inefficiency unleashed by microservices. Engineers implicitly diff 1000+ gRPC calls with the orders of magnitude fewer calls made by a system designed by an architect looking at the problem afresh today. A 1000+ gRPC calls also seem like a perfect recipe for blowing up latency. There are more items on the debit column. Microservices can also increase costs of monitoring, debugging, and deployment (and hence cause greater downtime and worse performance). ...

December 25, 2024 · Atul Dhingra

The Smallest Loss Compute Can Buy

With Gaurav Sood, Chris Alexiuk The most expensive portion of model training today is GPU time. Given that, it is useful to ask what is the best way to spend the compute budget. More formally, the optimization problem is: minimize test loss given a FLOPs budget. To achieve the smallest loss, there are many different levers that we can pull, including, Amount of data. Number of parameters. There is an implicit trade-off between this and the previous point given a particular amount of compute. Optimization hyperparameters. For e.g., Learning rate, learning rate schedule, batch size, optimizer, etc. Model architecture Width-to-depth ratio. Deeper aspects of model architecture. For e.g., RETRO, MoE models like switch transformers, MoE with expert choice, etc. Precision in which the parameters and hyperparameters are stored. Data quality. As some of the recent work shows, data quality matters a lot. We could reformulate the optimization problem to make it more general. For instance, rather than use FLOPs or GPU time, we may want to use dollars. This opens up opportunities to think about how to purchase GPU time most cheaply, e.g., using spot GPUs. We can abstract out the optimization problem further. If we knew the ROI of the prediction task, we could ask what is the profit-maximizing loss given a constraint on latency. Inference ROI is a function of ~ accuracy (or another performance metric of choice) and the compute cost of inference. ...

August 15, 2023 · Atul Dhingra

ML (O)Ops: What Data to Collect? (Part 3)

With Gaurav Sood The first part of the series, “Improving and Deploying On-Device Models With Confidence,” is posted here. The second part, “Keeping Track of Changes,” is posted here. Introduction In the previous parts of this series, we discussed deployment challenges and change tracking in ML systems. Now, let’s focus on a crucial aspect of MLOps: what data to collect for effective monitoring and improvement of ML systems. Types of Data to Collect 1. Model Performance Metrics Accuracy, precision, recall Latency and throughput Resource utilization Error rates and types 2. Input Data Statistics Distribution of features Data quality metrics Missing value patterns Data drift indicators 3. System Health Metrics CPU/GPU utilization Memory usage Network bandwidth Storage metrics 4. Operational Metrics Request volumes Queue lengths Cache hit rates Error logs Data Collection Strategies 1. Real-time Monitoring Stream processing for immediate insights Live dashboards Alerting systems Anomaly detection 2. Batch Processing Daily aggregations Trend analysis Performance reports Resource optimization 3. Sampling Techniques Random sampling Stratified sampling Error-focused sampling Edge case collection Best Practices Data Storage ...

June 16, 2021 · Atul Dhingra

ML (O)Ops! Keeping Track of Changes (Part 2)

With Gaurav Sood The first part of the series, “Improving and Deploying On-Device Models With Confidence,” is posted here. Tracking Changes in ML Systems ML Engineers spend a lot of time keeping track of changes. The changes can be to the model architecture, hyperparameters, training data, or the deployment infrastructure. Each of these changes can affect the model’s performance in ways that are hard to predict. And when something goes wrong, it can be hard to figure out which change caused the problem. ...

March 22, 2021 · Atul Dhingra

ML (O)Ops! Improving and Deploying On-Device Models With Confidence (Part 1)

With Gaurav Sood It is well known that ML Engineers today spend most of their time doing things that do not have a lot to do with machine learning. They spend time working on technically unsophisticated but important things like deployment of models, keeping track of experiments, etc.—operations. Atul and I dive into the reasons behind the status quo and propose solutions, starting with issues to do with on-device deployments. ...

February 21, 2021 · Atul Dhingra