Photo by Aryan Dhiman on Unsplash
In which I miss Kaggle: The Titanic Competition
Steps to getting started with a Kaggle competition
Table of contents
- What is the Titanic Kaggle Competition?
- Why am I competing in it?
- Steps to a Kaggle Competition
- Step one: read the competition description in the Overview tab
- Step two: find out how your submission will be evaluated
- Step three: create a new notebook!
- Step four: review the goal of the competition
- Step five: import data and necessary libraries
- Step six: do some Exploratory Data Analysis
- Step seven: split the data
- Step eight: encode the data
- Step nine: get a baseline
Okay, I'm really not great at sticking to a plan. I missed data sciencey stuff and drifted over to Kaggle in my web browser, just for a quick visit. Then I started to miss doing Kaggle competitions and thought maybe I'd just see what ones were going on. But I forgot a lot of stuff about building a model and using Kaggle, so then I decided to just go through the infamous Titanic competition, which I've never done.
What is the Titanic Kaggle Competition?
The goal of the Titanic competition is to predict whether each passenger will survive or not, given a bunch of information available about that passenger. The information available is the features in the dataset and the outcome (survived/didn't survive) is the label.
(For a quick intro or refresher on what features and labels are and why they're important, check out my blog article A Brief Summary of Supervised Learning)
Why am I competing in it?
My goal in submitting predictions to this competition is just to refresh some of my data science skills, like using pandas to read data from a csv file, using Python to divide a dataset into features and labels, and using machine learning techniques to make predictions on a dataset. Honestly, I miss the process, and felt this would be a low stakes way to get back into it.
Steps to a Kaggle Competition
Please follow along with my Kaggle notebook that accompanies this article, as it enriches the article and will help it make more sense.
Step one: read the competition description in the Overview tab
This is an essential step because this portion of the competition gives you all the information you'll need - the problem statement (what are you trying to figure out), what the labels are, how your submission should look (usually a csv file of predictions for each item in the dataset), and also how your submission will be evaluated. Which brings me to step 2...
Step two: find out how your submission will be evaluated
You'll find this under the "Evaluation" section of the Overview tab. In the case of the Titanic competition, submissions are judged on how accurate they are. Accuracy is the percentage of passengers you correctly predict the outcome for. Some Kaggle competitions have fancier metrics that you might need to do a google search on to understand fully. It's a really important part of the process, because the metric helps you understand if you are getting good enough predictions or not.
Step three: create a new notebook!
Under the Code tab of the competition, select the option "Create New Notebook" which will generate a new Kaggle notebook for you to write your code in, complete with the dataset for that competition already connected to it.
Step four: review the goal of the competition
Make sure you can answer the question, what am I trying to predict? If you're not sure, refer to step one. In the case of the Titanic competition we want to know if a given passenger survived or didn't survive. This will be represented as a numerical value of either 0 or 1. Since there are only two options for a given individual's label this makes it a binary classification problem.
Step five: import data and necessary libraries
You'll have pandas and numpy imported automatically when you start a new Kaggle notebook, but you might want some other libraries for data visualization, like matplotlib, and of course your machine learning library of choice for actually coding up your model.
Step six: do some Exploratory Data Analysis
(For an overview of EDA check out my article explaining what it is and why it's useful) For an example, you might want to:
- See a sample of your data set and what kinds of features you have
- Find out if your data is balanced by checking the distribution (Our data is slightly unbalanced, because more people didn't survive (0) than did survive (1).)
Step seven: split the data
Next you want to make sure that you split the data into features and labels. If we take a look at our train_data.columns, we'll get a list of our columns in the dataset - ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']. In our case, anything that isn't "Survived" is a feature, while "Survived" is the label, since it's the thing we want to predict. We want to split the data into features and labels so that we can use it in our machine learning model.
Step eight: encode the data
Our machine learning model won't understand the categorical data that we have, unless we turn it into numbers. This process is called encoding. We are going to use Pandas categorical method to do this. This will turn the letters in our categorical data into a different number for each unique letter. Our labels are already numbers, so we don't need to change those.
Step nine: get a baseline
You'll need to choose a model to go with for your baseline model. What is a baseline? It's just the first model you use to predict. It shouldn't be too fancy, because you want to start with something simple and then build on top of that only if necessary. Why bring out all the bells and whistles unless you really need to? If you're not sure what to use as a baseline model, try googling "baseline model for [type of problem]" - in this case, "baseline model for binary classification problem." Alternately, check out some of the public notebooks in the competition and see what other contestants used as their baselines.
I decided to go with a Random Forest Classifier. I split the data into X_train, y_train and X_test and y_test so that some of my data can be used to train my model and some of my data can be used to test my model and see how well it did.
That's a good amount of work for one day. I got a baseline model with 75% accuracy. I definitely think I can do better, so the next steps are to evaluate the model and try to improve it. But that's an article for another day. :)