Hexbyte Hacker News Computers
As Machine Learning (ML) is becoming an important part of every industry, the demand for Machine Learning Engineers (MLE) has grown dramatically. MLEs combine machine learning skills with software engineering knowhow to find high-performing models for a given application and handle the implementation challenges that come up — from building out training infrastructure to preparing models for deployment. New online resources have sprouted in parallel to train engineers to build ML models and solve the various software challenges encountered. However, one of the most common hurdles with new ML teams is maintaining the same level of forward progress that engineers are accustomed to with traditional software engineering.
The most pressing reason for this challenge is that the process of developing new ML models is highly uncertain at the outset. After all, it is difficult to know how well a model will perform by the end of a given training run, let alone what performance could be achieved with extensive tuning or different modeling assumptions.
Many types of professionals face similar situations: software and business developers, startups looking for product-market fit, or pilots maneuvering with limited information. Each of these professions has adapted a common framework to help their teams work productively through the uncertainty: agile/scrum for software development, “lean” startups, and the US Air Force’s OODA loop. MLEs can follow a similar framework to cope with uncertainty and deliver great products quickly.
The ML Engineering Loop
In this article, we’ll describe our conception of the “OODA Loop” of ML: the ML Engineering Loop, where ML Engineers iteratively
- Select an approach
to rapidly and efficiently discover the best models and adapt to the unknown. In addition, we will give concrete tips for each of these phases, as well as to optimize the process as a whole.
Success for an ML team often means delivering a highly performing model within given constraints — for example, one that achieves high prediction accuracy, while subject to constraints on memory usage, inference time, and fairness. Performance is defined by whichever metric is most relevant to the success of your end product, whether that be accuracy, speed, diversity of outputs, etc. For simplicity, we’ve elected to minimize “error rate” as our performance metric below.
When you are just starting to scope out a new project, you should accurately define success criteria, which you will then translate to model metrics. In product terms, what level of performance would a service need to be useful? For example, if we are recommending 5 articles to individual users on a news platform, how many of them do we need to be relevant, and how will we define relevance? Given this performance criterion and the data you have, what would be the simplest model you could build?
In product terms, what level of performance would a service need to be useful?
The purpose of the ML Engineering Loop is to put a rote mental framework around the development process, simplifying the decision making process to focus exclusively on the most important next steps. As practitioners progress in experience, the process becomes second nature and growing expertise enables rapid shifts between analysis and implementation without hesitation. That said, this framework is still immensely valuable for even the most experienced engineers when uncertainty increases — for example, when a model unexpectedly fails to meet requirements, when the teams’ goals are suddenly altered (e.g., the test set is changed to reflect changes in product needs), or as team progress stalls just short of the goal.
To bootstrap the loop described below, you should start with a minimal implementation that has very little uncertainty involved. Usually we want to “get a number” as quickly as possible — — to build up enough of the system so that we can evaluate its performance and begin iterating. This typically means:
- Setting up training, development and testing datasets, and
- Getting a simple model working.
For instance, if we’re building a tree detector to survey tree populations in an area, we might use an off-the-shelf training set from a similar Kaggle competition, and a hand-collected set of photos from the target area for development and test sets. We could then run logistic regression on the raw pixels, or run a pre-trained network (like ResNet) on the training images. The goal here is not to solve the project in one go, but to start our iteration cycle. Below are a few tips to help you do that.
On a good test set:
- Since the team is aiming to perform well on the test set, the test set is effectively a description of the team’s goal. Therefore, the test set should reflect the needs of the product or business. For example if you are building an app to detect skin conditions from selfies, feel free to train on any set of images, but make sure that your test set contains images that are as poorly lit and of poor quality as some selfies can be.
- Changing the test set alters the team’s goal, so it is helpful to fix the test set early and modify it only to reflect changes in project, product or business goals.
- Aim to make the test and development sets large enough that your performance metric will be accurate enough to make good distinctions between models. If the sets are too small, you will end up making decisions based on noisy results.
- Similarly, you should try to curate any labels or annotations as much as practical for the development and test sets. A mislabeled test set is about the same as an incorrectly specified product requirement.
- It’s helpful to know how well humans perform on the test set, or how well existing / competing systems perform. These give you a bound on the optimal error rate, the best possible performance you could achieve.
- Reaching parity with human test performance is often a good long-term goal for many tasks. In any event, the ultimate goal is to bring test performance as close to our guess for optimal performance as possible.
Regarding the development and training set:
- The development set is the team’s proxy for test performance that they can use to tune hyper-parameters. As a result, it should be from the same distribution as the test set, but ideally from a disjoint group of users/inputs to avoid data leakage. A good way to ensure this is to first curate a large pool of samples, then shuffle and split them into development and test sets afterward.
- If you imagine that production data will be noisy, however, make sure that you account for that noise in your training set, by using data augmentation or degradation. You cannot expect a model trained exclusively on sharp images to generalize to blurry ones.
Once you have an initial prototype, you should check its performance on the training, development and test sets. This marks the end of your first (degenerate) trip around the loop. Take stock of the gap between the test performance and the performance required for a useful product. Now it’s time to start iterating!
Identify the performance bottleneck
The analysis phase is like medical diagnosis: you’re equipped with a set of diagnostics that you can perform and your goal is to come up with the most likely diagnosis for what limits the performance of your model. In practice, there might be many different overlapping issues responsible for the current results, but your objective is to find the most glaring issues first so that you can resolve them quickly. Don’t get bogged down trying to develop a complete understanding of every shortcoming — — aim instead to understand the biggest factors since many of the smaller issues will change or even disappear as you make improvements to your model.
Below, we list a common set of diagnostics that you will use frequently along with some of the diagnoses. There is at least some art to selecting which diagnostics to run, but as you work your way around the ML Engineering Loop, you will gradually gain intuition for what to try.
A good starting point for every analysis is to look at your training, development, and test performance. We suggest putting code to do this at the end of every experiment to habituate yourself to looking at these numbers every time. On average, we will have: training error <= development set error <= test set error (if