Can Friendly Competition Lead to Better Models?
Here on Squarespace’s Strategy and Analytics team, we build models that predict customer lifetime value, forecast customer service demand, and even determine how much we should spend on those ubiquitous Squarespace ads you hear on your favorite podcast.
We subject our models to code review by our peers, but this process does not always address larger, system-level questions: how do we know if our model is the best possible model we are capable of building as a team? How does it perform on the specified task relative to models created through alternative approaches?
During Squarespace’s most recent Hack Week, we experimented with a different approach to model building: an internal Kaggle competition. Kaggle is a platform for data science competitions, which follow a simple recipe: 1) define a prediction task, 2) provide training data to participants, and 3) score submissions on a subset of the data and display the results on a leaderboard. Netflix is often credited with popularizing the use of data science competitions to solve business problems via the 2009 Netflix prize, in which teams competed to build the best model for predicting movie ratings. However, the idea of a “common task framework” actually dates back to at least the 1980s, when DARPA challenged teams of researchers to produce the best possible rules for machine translation.
For our internal competition, we wanted to predict subscription rates of customers who start a free trial on Squarespace. The dataset for this competition included anonymized information on customers’ marketing channels, geographic locations, product usage, previous trials, and, of course, whether or not the customers subscribed to Squarespace within 28 days of starting a trial. We used Kaggle’s InClass platform to host a private competition, encrypted unique identifiers, and uploaded no personally identifiable customer information to Kaggle’s servers.
The competition was successful in generating insights from the data. Teams took diverse approaches, experimenting with algorithms ranging from gradient-boosted decision trees to neural nets. Multiple teams independently determined that training on a small subset of the data produced results similar to those produced by training on the full dataset, a finding which reduced training time by a factor of ten. Another surprise was that seasonality was not a major factor in trial conversion relative to other factors. Either the seasonal effects were not strong, or they were captured indirectly through other variables in the dataset.
There are several downsides of the competition format. First, it is of course not the most efficient use of resources. Second, the focus on results above all else can also push teams to neglect system design considerations, such as system-level dependencies and production runtimes.
But the competition was by no means a waste of effort. Participants gained familiarity with this critical dataset, and the friendly competition format encouraged teams to collaborate effectively and push their limits. All of the code produced for the competition was stored in a shared repository, so any individual or small team building a model for a business application would not have to start from scratch. The “common task framework” could be something we revisit in the future, especially in cases where model performance is more important than interpretability.
Want to join our team of passionate data scientists? Check out our open positions.