ML (O)Ops: What Data to Collect? (Part 3)
With Gaurav Sood The first part of the series, “Improving and Deploying On-Device Models With Confidence,” is posted here. The second part, “Keeping Track of Changes,” is posted here. Introduction In the previous parts of this series, we discussed deployment challenges and change tracking in ML systems. Now, let’s focus on a crucial aspect of MLOps: what data to collect for effective monitoring and improvement of ML systems. Types of Data to Collect 1. Model Performance Metrics Accuracy, precision, recall Latency and throughput Resource utilization Error rates and types 2. Input Data Statistics Distribution of features Data quality metrics Missing value patterns Data drift indicators 3. System Health Metrics CPU/GPU utilization Memory usage Network bandwidth Storage metrics 4. Operational Metrics Request volumes Queue lengths Cache hit rates Error logs Data Collection Strategies 1. Real-time Monitoring Stream processing for immediate insights Live dashboards Alerting systems Anomaly detection 2. Batch Processing Daily aggregations Trend analysis Performance reports Resource optimization 3. Sampling Techniques Random sampling Stratified sampling Error-focused sampling Edge case collection Best Practices Data Storage ...