This is my 2nd podcast on SoundCloud, this podcast covers one of the most frequently discussed topic by beginners as well as experts—To Kaggle Or Not To Kaggle?
I have attached my podcast to this blog post to share the transcript and the links of the resources mentioned in the podcast.
Want to learn more? visit www.ankitrathi.com
Listen to my podcast here:
Hello everyone, my name is Ankit Rathi & welcome to ‘Data Deft’, a data science podcast.
When I got introduced to Data Science in 2012, the second platform that helped me to shape my data science skills was Kaggle, first one was (of course) Coursera’s Machine Learning course by Andrew Ng. That time, Kaggle was only about competitions, other useful sections like Kernels, Datasets & Learn were not there.
I have participated in 6 competitions till now, learnt a lot and won medals in 3 competitions (1 silver & 2 bronze). I am a Kaggle expert, having my highest rank of around 1K out of 80K participants back then. These days, I don’t get much time to participate but I look at the winning solutions of recently concluded competitions whenever I get some time to keep myself updated.
In this post, I am going to put my views on what ‘Kaggle competitions is good for & what it is not’. In short, Kaggle is great in many aspects but it is not everything to data science. When you work on a real-world data science project, you need to deal with much more challenges and different set of skills are required for that.
Note: By Kaggle competition here, I mean any data science hackathon or competition you intend to compete.
The major difference between Kaggle competitions & real-world data science projects is that Kaggle competitions are based on supervised learning while data science projects can be anything, supervised or unsupervised.
To make the difference clearer, I will elaborate the gap for each step in data science framework. Lets take CRISP-DM methodology as a baseline here:
1. Business Understanding
In Kaggle competitions, you get the business problems formulated for you. While in data science projects, you have to identify and build the problem statement yourself. Most of the time, a stakeholder or customer doesn’t know what problems can be solved by data science; sometimes, even if they know, they have a vague requirement like ‘we need to increase our sales or we need to improvise our operations or we need to optimize our business decisions’. Data scientists need to sit with stakeholders to formulate the problem statement & translate it into a data science problem.
2. Data Understanding
Another point which differentiates Kaggle competitions from real world problems is that in Kaggle competitions, you get the data which is mostly processed & classified into train & test. While in data science projects, you identify what data qualifies for your problem statement. Most of the time, you have to identify what qualifies as a feature, what is the suitable target variable. Sometimes, its not straight-forward to identify the target variable, you define it by working with domain experts. You also have to define the split methodology for data to be split into train, valid & test sets.
3. Data Preparation
This step doesn’t have much difference in Kaggle competitions & real world projects but real-world data is more complex & dirty so more cleaning & preparation is required. Overall, participating in Kaggle competitions will help you to improvise your data cleaning & data preparation skills.
If you liked this post, you may also like this post where I talk about how to start in DS/AI field:
Again, this step doesn’t have much difference in Kaggle competitions & data science projects. In fact, participating in Kaggle competition is beneficial for this particular step as you get to know which model works better for what kind of problem.
Another step where Kaggle competitions are different from real world problems is that the evaluation metric is defined for you. While in data science projects, you choose which evaluation metric will be suitable for your project. But participating in Kaggle competitions will give you exposure to evaluation matrices and what metrics to use where. You will also get to know how not to overfit your model on train data.
In Kaggle competitions, you get a submission format in which you submit your predictions. While in data science projects, you have to deploy the models in live environment for business to use. You also have to understand tech-ecosystem of the customer, how you will integrate your solution and how you will monitor the performance of your model.
Another major difference between Kaggle competitions and data science projects is that the participants build way too many models and keep ensembling them to get advantage on the leaderboard and ultimately, these complex models are not fit to be deployed in production.
But over the years, Kaggle has recognized that gap and now they have other sections like Learn, Kernels & Datasets, do check them out to improvise your skills further.
Thanks for listening to the podcast, I will be waiting for your feedback.
Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to listen to me on SoundCloud.