How an Internal Competition Boosted Our Machine Learning Skills

Fuel your AI innovation while having fun

Published in

Towards Data Science

12 min readApr 12, 2022

We’re big fans of open collaboration to learn, grow, and incite innovation, but sometimes you need to feel the heat of a competitor to push yourself forward. With our company’s internal AI competition, we did just that. We selected a relevant challenge and our engineers competed against each other in small groups to develop — at the same time — both the best and most efficient machine learning model.

That’s right: We used a trade-off here because for real-world AI problems, you can’t always grab the biggest model with the highest performance. There are cost and time constraints to consider.

The competition was a part of Slimmer AI’s beloved AI Fellowship program. This program carves out and structures dedicated R&D time in which our machine learning engineers collaborate, experiment, learn, and progress on state-of the-art techniques to solve big problems. The aim is to strengthen both their personal development and progress our near-future company goals. We work in quarterly cycles, so we committed a full three-month cycle to dive into this competition.

Ultimately the format of the competition will determine whether it helps the company and people to grow, or whether it is just a fun event to pass time. To be intentional with your goals, you will need to find the ideal topic, create the right teams, and craft the perfect conditions.

In this article, I will share some of the decisions we made and how those were evaluated in hindsight, so that you can roll out an even more successful contest within your own team.

The rules

The perfect challenge

To be maximally effective, the competition had to both strengthen the knowledge we had gained over the past year and contain a novel element to push our expertise further. We chose to work on a challenge within the Natural Language Processing (NLP) domain, since a considerable part of our innovation in the past year had taken place there. Previously, we explored topics such as named entity recognition and few-shot learning and believed it made sense to dive deeper into these concepts by combining them. Furthermore, it made strategic sense to become proficient in multilingual AI solutions in the not-so-distant future for one of our products. We knew we had to tackle that as well.

The Hugging Face Hub, which contains thousands of datasets, was an invaluable help in finding the right challenge. It allowed us to discover the XTREME dataset, which is a multilingual multi-task NLP benchmark. It covers 40 typologically diverse languages and includes nine tasks that require reasoning about different levels of syntax and semantics. Out of the nine tasks, we selected four that were in line with our recent R&D work. Both the multitask and the multilingual aspects added a novel challenge for the team, thereby ensuring that the competition was a relevant learning experience.

No opt-out

Just like the rest of the Fellowship program, participation in the competition was not optional. The team highly values continuous learning, so active participation in innovation activities is simply a part of the job. In return, half a day per week is reserved for Fellowship activities such as the competition.

There was another mandatory aspect of the competition that teams took great pride in: coming up with a great team name. We therefore had teams like Ditto (a “Transformer” Pokémon), Neurotic Networks, and Astrology (consisting of former astronomy graduates) trying to secure their first place.

Golden couples

We divided our team of machine learning engineers into pairs of two. We didn’t want people to work on the challenge alone because we wanted to foster collaboration and learning from each other’s strengths. However, we also didn’t want to make the teams any bigger and create temptation for free riding.

The selection of the pairs was quite a deliberate process, in which we wanted to match seniors with juniors, have at least one person with relevant topic experience within each pair, and strengthen new bonds by having people join forces who had not collaborated much within the previous year. And of course, the expertise of the teams overall needed to be balanced to make sure everyone had a fair shot. Yes, we were indeed tempted to write an optimization algorithm to sort this out, but we ended up being pretty successful by hand.

Efficiently good

Upfront we were very clear about the winning criteria. We averaged the scores of the four selected tasks to rank each team’s model in terms of performance.

However, teams were also scored on efficiency during the test phase. We intended to use the power metric computed by the Experiment Impact Tracker, but this didn’t provide the reliable results we were hoping for. We therefore switched to simply using the total duration of the test phase as an approximation of model efficiency. After all, when all models run on the same hardware, models that are less efficient are typically slower.

To come to a final winner, we created a re-ranking rule for the two teams with the most efficient models. The team with the most efficient model is pushed up two places in the overall performance ranking. The second most efficient team is pushed one place higher. Note that this re-ranking would only apply if the performance difference in average task score is less than 10 on a scale of 100 and the difference in efficiency is higher than 5%. We didn’t want a super fast but random model to creep up the ranks.

Independent evaluation

While the training dataset was handed over to the competing teams, an evaluation test dataset, and a final test dataset were held back. A non-participating engineer was then assigned to evaluate the models on the held-out data. There were three opportunities to run a model on the evaluation test dataset and after each attempt the intermediate leaderboard was updated. The final test dataset was used only once at the end to determine the final ranking.

Collaborative competition

Of course, teams needed to compete, but we also wanted people to rapidly learn and innovate based on each other’s findings. We held weekly stand-ups in which people shared interesting methods they had come across or tested. The exact implementations and parameters were kept secret by each team.

Additionally, all submissions were to be shared after the competition had ended. Although this was done to ensure more learning and promote clean code, it was also a fail-safe against cheating since the test set was freely available online.

A summarization of the 10 steps to running a successful AI competition — Images courtesy of Slimmer AI

The competition

Each team had to submit one model that could handle 40 vastly different languages and four different NLP tasks:

Determine whether a premise sentence entails, contradicts, or is neutral towards a hypothesis sentence.
Determine whether two sentences are paraphrases of each other.
Perform part-of-speech tagging (e.g., identify nouns, verbs, pronouns, etc.).
Perform named entity recognition (e.g., identify named persons, organizations, locations, etc.).

Four hours a week for three months is not nearly enough time to train a winning model from scratch. Nor should you really have to with the high quality of pre-trained model available open source. Teams experimented with finding and fine-tuning the best suited pre-trained model, always remembering that unnecessary bulkiness would be penalized.

Techniques

Now this is where we get technical for those who are dying to get a glimpse of what our teams did. If this does not interest you, feel free to skip to the next section.

Most teams used an mBERT model as their base. Not because it was the best, but because it’s an architecture we’re very familiar with. BERT models only allow a 512 token sequence as input, which was a problem for some of the longer samples in the dataset. Teams experimented with various techniques to best serve longer input sequences, such as not cutting a sequence in the middle of a word or adding more context to follow-up sentences. They also experimented with the model’s architecture. Improved results were obtained by, for example, adding a second layer to the classification head.

The learning rate is critical during fine-tuning, and two actions proved especially useful. First, some teams did not just train the head, but also unfreezed the encoder and used a continuously lower learning rate for lower layers of the model. Second, a learning rate schedule with warm-up led to enhanced results. During warm-up, the learning rate first rises from a low value to the initial learning rate before decreasing again to lower values.

One team successfully sped up their model and improved efficiency by not padding their samples. Instead, they attached all the samples back-to-back, using a separator token to signify their division. Of course, in a production system this would only work if samples were also presented in batches. Other significant improvements were found through enabling mixed precision training and parallel tokenization.

We’re in favor of the data-centric AI movement that Andrew Ng is currently championing. At a certain point, bigger gains can be obtained by improving the data instead of further fine-tuning a model. To boost the performance on uncommon languages, teams successfully experimented with oversampling and adding translated samples of common languages. Instead of training on the languages one by one, sampling from multiple languages in a single batch enhanced the model’s ability to generalize.

The winning team

In the end, the final results were very close. One team was so bold as to submit an InfoXML architecture instead of using mBERT. Using this beast of a model, they were propelled to first place in terms of performance. However, their model was so inefficient — more than three times slower than the second performing team — that they unfortunately lost their first place during the Great Re-rank (read back if you forgot how we evaluated the models). Actually, this model was so inefficient that even the third performing team moved up ahead of them. This meant that the best performing model ended up third in the final leaderboard. The second performing team came out as the winner. What a ride!

The end results of our AI competition. Although the blue team Neurotic Networks scored highest on their final submission (line chart), their model was much slower and inefficient (bar chart), causing them to drop two places due to penalties. The pink runner-up team Garbage Goober went home with first prize. Images courtesy of Slimmer AI.

Lessons learned

A survey at the end of the competition showed that:

9 out of 10 engineers felt they had expanded both their knowledge of the topic and their skills in building deep learning models.
8 out of 10 found the task interesting and the competition itself a valuable to extremely valuable experience.
Everyone felt capable and supported enough to perform their own experiments.
A majority felt that there wasn’t enough time and computational resources to perform all the experiments they wanted.

We’re a data-driven company, so obviously we like quantifiable survey results to numerically track that we’re heading in the right direction. However, we don’t do without the deeper personal feedback received through retrospectives. Those are the insights that help to continuously improve.

So, what did we learn from the retro?

Start with a working version

A lot of time was spent on creating the data loaders and getting an end-to-end working pipeline. Leaving less room for actual ML innovation. Although some engineers found this experience very rewarding, we don’t feel like we need to repeat that and will probably supply a baseline model and data loader for our next competition so people can jump into the good stuff straight away.

Further refine the criteria

In hindsight, severely penalizing a model for being 5% less efficient than its runner up seems a bit too harsh. We lucked out that the re-ranking rule worked out well in our case, because the difference in efficiency between the top performing teams was so large. But next time, we’ll probably calibrate those thresholds given the performance of the baseline model we’ll supply and check-in with the whole team before carving those numbers in stone. We might also experiment with incentivizing models that have been trained less.

GPUs for everyone

We still have users to serve besides the competition, so the time constraint will remain. However, to overcome the resource limitations, next time we will get a dedicated cloud GPU for each team instead of sharing our in-house GPUs. We will likely give each team the same resource budget to stimulate more efficient and greener training of models.

Introduce new tools

Although several Deep Learning libraries have been in our tech stack for years, the PyTorch Lightning library was new to many engineers and the competition allowed the full team to gain experience with it. Many engineers enjoyed this hands-on experience, and it allowed us to conclusively expand our tech stack now that everyone is familiar with it. Next time, we’ll do a quick scan of useful libraries to incorporate in the competition.

Bring back the zen

People got so enthusiastic about the competition that it became quite stressful to juggle it alongside other responsibilities. We’re not in the business of producing stress, so that’s something we’ll aim to avoid next time. Bigger teams could help, but we all agree we want to avoid free riding, which can be demotivating for everyone. We’ll therefore probably reward intermediate submissions more during our next competition. We might even declare winners at each intermediate stage, have everyone openly share their code, and then let other teams improve on that for the next submission. Hopefully this will spread the workload evenly and keep the tension at a pleasant level.

Keep it academic

We discussed whether the somewhat artificial, rather academic nature of the dataset was a good idea. As an applied AI company, shouldn’t we be using real-world data? Overall, we concluded that the chosen dataset worked really well for our purpose. We work on real-world AI problems every day of the year. Focusing on a more academic task gave everyone room to focus on the innovative parts and improve technical skills, instead of trying to grasp their head around an evolving problem statement with mucky data.

A little less conversation?

We knew from the start that trying to foster a cooperative competition might be a bit silly. Trying to eat your cake and still have it. And indeed, although teams definitely shared progress and breakthroughs during the weekly stand-ups, they also mentioned they were understandably a bit reluctant to share their best ideas. Next time, we will probably use a Q&A format for the stand-ups, since everyone is more than willing to help others who are stuck. That, combined with code sharing after intermediate submissions, might still create an open climate without ruining the competitive element.

Conclusion

The competition was held to push our technical capabilities forward and to have fun along the way. Although next time we might change a few rules, we’re happy that both goals were evidently reached.

During the contest we gained novel hands-on experience with training multi-task models and preparing their data. Already in the next quarter, this experience proved useful when code from the competition was directly applicable in the development of a commercial product. Our experience with multi-lingual training currently helps us in the design process of a novel AI application. Because of their practical experience, all engineers are more aware of opportunities and challenges they can expect. And on a more technical level, our entire team gained hands-on experience with a new ML library that significantly speeds up development time for new products.

But the AI Fellowship and its competition are not only about learning and driving innovation. It’s also a place to connect and celebrate success. While the participants visibly enjoyed the contest, the rest of the company joined in during the weekly huddles in which the intermediate rankings were presented. The winning team was put in the spotlight during our company Christmas party and will find eternal glory on a wall tile with their name on it in our canteen’s Hall of Fame. In a year in which remote work pushed us physically further apart than we wanted, the competition proved a great outlet to have fun and bond over a common goal.

*The competition winners have found eternal glory in our Hall of Fame.* Images courtesy of Slimmer AI.

A special thanks to Sybren Jansen — our Head of AI Innovation — for meticulously designing and organizing this event! We might host another competition again soon, but our pile of other fun ideas to push innovation also keeps growing. I guess it’s best to say… stay tuned!

If you want to know more about our work at Slimmer AI, feel free to reach out or head to our website to learn more.

If you’ve thought of other creative schemes to make continuous learning and innovation fun, please share the joy in a comment!