Reviewing 2017 and Previewing 2018 – Source www.kaggle.com
2017 was a huge year for Kaggle. Aside from joining Google, it also marks the year that our community expanded from being primarily focused on machine learning competitions to a broader data science and machine learning platform. This year our public Datasets platform and Kaggle Kernels both grew ~3x, meaning we now also have a thriving data repository and code sharing environment. Each of those products are on track to pass competitions on most activity metrics in early 2018.
To give the community more visibility into how Kaggle has changed, we have decided to share our major activity metrics and the commentary around those metrics. And, we’re also giving some visibility into our 2018 plans.
Active users (unique annual, logged in users) grew to 895K this year up from 471K in 2016 (chart 1). This represents 90% growth for 2017 up from 71% growth in 2016.
While we are still most famous for machine learning competitions, both our public Datasets platform and Kaggle Kernels are on track to be larger drivers of activity on Kaggle in early 2018.
Chart 1: Active users
We launched 41 machine learning competitions this year, up from 33 last year. This included three competitions with more than $1MM in prize money:
- $1.5MM competition with TSA to identify threat objects from body scans
- $1.2MM competition with Zillow to improve the Zestimate home valuation algorithm
- $1MM competition with NIH and Booz Allen to diagnose lung cancer from CT scans
We have also invested in becoming closer to the research community, launching some important research competitions for NIPS and CVPR workshops. Highlights include a series of adversarial learning challenges and the YouTube 8M challenge. Kaggle is also now hosting ImageNet.
Kaggle inClass, which allows professors to host competitions for free for their students, became a completely self-service platform and saw really nice growth. 1217 machine learning and statistics classes hosted Kaggle InClass competitions in 2017, up from 661 in 2016 (84% growth).
On the community side, 375K users downloaded competition datasets, up 62% YoY. And, 122K users submitted entries to our machine learning competitions, up 54% YoY.
Public Datasets Platform
Our public Datasets platform allows our community to share and collaborate on public datasets. 7044 datasets were uploaded onto the platform in 2017, up from 495 datasets in 2016. The most popular datasets uploaded in 2017 were:
Downloaders of datasets on our public Datasets platform increased more than 3x this year, reaching 339K in 2017 up from 107K in 2016. This growth means the public Datasets platform is driving almost as many data downloads as our machine learning competitions (see chart 2). For context, we launched our public Datasets platform in 2016 and our competition platform in 2010.
Chart 2: downloaders of public Datasets vs competitions
Kaggle Kernels is currently used to share code and models on our competitions and public Datasets platform. In 2017, we had 113K users of Kaggle Kernels, up almost 3x from 39K in 2016. Kernel authoring is quickly becoming just as popular as making a competition submission (see chart 3).
Chart 3: kernel authors vs competition submitters
The most popular publicly shared kernels from this year were:
- A tutorial on pre-processing images for the 2017 Data Science Bowl to predict lung cancer from CT scans
- A tutorial on ensembling and stacking using Python
- A notebook exploring a house price dataset for a popular playground competition
We launched the largest ever survey of data scientists and machine learners. It had 16,716 respondents and resulted in 235 public kernels exploring the dataset.The best coverage of the survey was in the FT and The Verge.
Overall, we were in the press a lot this year with topics including coverage of the acquisition (Techcrunch), profiles of several elite community members (in Wired and Mashable), NIPS adversarial learning challenge (MIT Tech Review), TSA competition (NYTimes) and the Zillow competition (NYTimes).
It’s also worth highlighting the activities by our community that help strengthen Kaggle. We are aware of over 50 Kaggle meetup groups organized by Kaggle community members in cities ranging from Princeton to Paris. These meetups discuss our competitions and datasets. This year, some elite Kaggle members launched a Coursera course on how to win Kaggle competitions. And a group of community members setup a Kaggle slack channel to discuss Kaggle competitions and datasets; it has over 3300 members.
We started with machine learning competitions. We’ve now expanded to add a public Datasets platform and Kaggle Kernels. We eventually want to make Kaggle the place where Kagglers can do all of their data science and machine learning. In 2018, we are focused on improving all of our major products (competitions, the public Datasets platform and Kaggle Kernels) and adding new educational resources to our platform.
Competitions are currently in a strong position. However, it’s important that we are not complacent and that we continue to innovate. In 2018, we plan to start supporting new competition types to make sure we can support problems that are at the cutting edge of machine learning and AI. To do this, we aim to better support code-only competitions (where Kagglers upload code rather than solution files). This will allow us to host new competition types, including reinforcement learning competitions and competitions with compute restrictions.
Public Datasets platform
In 2018, we hope to become as well known for our public Datasets platform as we are for our machine learning competitions. To do this, we need to continue to grow the number of high quality datasets on Kaggle. We are aiming to do this with a range of powerful new features. We are planning to integrate with and add services that allow our community to work with larger datasets through integrations with data warehouses like BigQuery. And to build functionality that allows Kagglers to stream in live datasets rather than just uploading static datasets.
Kaggle Kernels is currently most useful for sharing models and analysis on our competitions and public Datasets platform datasets. In 2018, we want to make Kaggle Kernels a strong standalone product. This includes enabling Kagglers to use Kaggle Kernels with their own private datasets, access GPUs and support more complex pipelines.
Many users come to Kaggle to start their Data Science career and boost their learning. To better support this segment of our community, we’ve launched a platform of hands-on machine learning courses at https://www.kaggle.com/learn. We hope for it to be the fastest path for users to start creating highly accurate machine learning models and to have the skills they need to land their first data science job.
Want to get involved?
We are hiring data scientists as we grow our competition team. You can learn more and apply at: https://www.kaggle.com/careers/datascientist.