Titanic: data analysis with Python Part 2/2

Romain RUMP
22. Feb. 2018
3 Min. Lesezeit

This is part 2 of the data analysis on the titanic dataset. In part 1, I used cluster analysis to find groups of people with similar features in the data and then compared the survival rate of each group. Even though interesting, it didn't really answer the question "Would you have survived the Titanic?".

So now, I will try a different approach: the Support Vector Machine (SVM). Without becoming too technical, the SVM is a classifier that separates your data with a boundary. Depending on which side of the boundary you land, it will either return 1 or 0, 1 being survived and 0 being didn't survive.

I can pretty much reuse the beginning of the code from part 1. So I import the data and format it into a work dataframe:

Now, instead of using clustering, I will introduce the svm classifier. The data will be split into a training sample and a testing sample. 75% of the data will be used to train the SVM. Once trained, the classifier has to predict the outcome on the test samples. The percentage of correct answers is then computed.

Anything around 50% would be random. The closer it gets to 100%, the more accurate the classifier and therefore the accuracy of prediction.

Output:

Accuracy of SVM on the test sample: 0.8079268292682927

Ok, about 80% accuracy. Repeated it a few times and got similar results. See below the graph showing variance spans and learning curve depending on the training sample size:

So training size of about 800 or more is good. This is roughly 60% of the dataset size.

The overall result is clearly better than random, yet, not quite "reliable" either. Part of me is disappointed while the other part is impressed that it is possible to predict with 80% accuracy if someone would have survived based on a few parameters like age, sex, class etc.

There are two more things that I want to do. First, I want to have a degree of confidence in the answer. Second, I want to test how this black box behaves so I will submit a list of fictive passengers and see the result of the prediction.

To compute the level of confidence and then run a prediction on a list of fictive passengers, we will add following code:

Output:

Table of fictive passengers with survival prediction and decision function value

sex: 0 -> female; 1 -> male

embarked: 0 -> Cherbourg; 1 -> Queenstown; 2 -> Southampton

cabin letter: 0 -> A; 1 -> B; ... ; 7 -> H; 8 -> NaN

Interpretation of the result:

From the graph, we can deduct what the decision function value means in terms of degree of confidence. The decision function value is simply the distance to the hyperplane that separates the data in two groups. The further the featureset is away from the plane, the more likely it is that the prediction is correct. However, in a situation like the titanic, there is no absolute. Even if you should have survived in theory, it doesn't mean you would survive in reality. So here is how the decision function scores can be interpreted:

fairly confident: -0.85 or less and +1.1 or above

fair guess: between -0.85 and -0.2 or between +0.95 and +1.1

random guess: between -0.2 and 0.95

From the list of fictive passenger, it is possible to see how the black box behaves. Women seem to have very high chances of survival and children as well. Men on the other hand are a very uncertain prediction. However, 3rd class male passengers had a quite high probability of not surviving as opposed to 1st and 2nd class male passengers who have just slightly above 50% chances of survival. The rule of women and children first seemed to have applied. The titanic tragedy occurred in a time where honor was valued more than life itself.

Finally, it is noticeable that older people in general seemed to have poor chances of survival. Even first class women had about a 50% chance only. It is very likely in my opinion, that the older people would stay behind to allow younger men, women and children to board a lifeboat.

Conclusion

With the Support Vector Machine, it was possible to answer the question in quite a satisfying manner. If all the values of the different features such as passenger class, age, ticket fare price and so on are known, it would be possible to predict whether that individual would have survived and even possible to know how certain this prediction is.

It is quite an interesting tool and despite it being a black box, it is possible to shine some light into its behavior to better understand the way it classifies the data and makes a decision.