Photo by Christopher Burns on Unsplash
In which I miss Kaggle part 2: More EDA and More Models
Continuing to refine my process for the Kaggle Titanic Competition
Welcome back to my little series on Kaggle competitions. This is a follow-up article to my article In which I miss Kaggle: the Titanic Competition.
Please take a look at the Kaggle Notebook that accompanies this post as it will really make more sense if you are reading the two side by side.
The goals this time through were to:
- Improve the baseline accuracy from my last post, which was about 75%.
- Do a little more EDA (Exploratory Data Analysis)
- Do some feature engineering
Questions I wanted to answer:
- Check if wealthier passengers survived more
- Check if women survived more
- Check if children survived more
Feature engineering I wanted to carry out:
- Check if having a cabin or not contributed to survival - turn Cabin into a boolean value of either has a cabin or does not have a cabin.
- Change the name feature into a title feature to make it more useful for a machine learning model. I.e. reduce the amount of values in the name feature.
What if I get stuck?
It’s always great to look at other public Kaggle notebooks in the competition (because the Titanic one is so popular and has been going on for such a long time, there’s a lot of excellent material to draw on). I checked out a few notebooks that helped me do some feature engineering and remember how to plot data to get visual answers to some of my questions, above. I have cited those notebooks in my own notebook that accompanies this post.
Answers to Questions:
It looked like having a higher socioeconomic class helped with survival, as well as being female, and a young baby. Having a cabin, also a feature related to being wealthier, contributed to chances of survival.
Models I tried:
We'll see which machine learning model performs best out of the following:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Gaussian NB
- Support Vector Machine The Random Forest classifier turned out giving the highest accuracy, which if you'll recall was the metric we were using to judge our work.
This is a short post, because most of the information is in the accompanying Kaggle notebook. The last thing to do is to make some predictions on the test data and submit them to the competition.