1 Titanic: Machine Learning from Disaster
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during its maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.
It took about $7.5 million to build the Titanic and it sunk under the ocean due to collision. The Titanic Dataset is a very good dataset for begineers to start a journey in data science
The Objective of this notebook is to give an idea about how is the workflow in any predictive modeling problem. How do we check features, how do we add new features, and how do we work applying some Machine Learning Concepts.
1.1 Data overview
The data has been splited into training set (titanic_train) and test set (titanic_test).
- The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class.
- The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way:
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way:
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them
1.2 Machine Learning Procedure
To start with our Titanic Survivals prediction we will start selecting the most important variables of our dataset:
A SELECT * FROM titanic_train dataset
var rs = db.executeQuery("SELECT survived, pclass, name, sex, age, sibsp, parch , fare FROM titanic_train");
We calculate the length of the passangers names trying to figure out if there is any relationship between the name's length and the survivals
var analytics = rs.analytics(); analytics = analytics.apply(['name'], x => x? parseFloat(x.length)*1.00 : 0)
In this case, we clean our data filling the null gaps in the age column with the age_mean function
var age_mean = analytics.mean('age'); console.log('age_mean = ' + age_mean) analytics.fillNulls('age', age_mean)
We calculate the length of the passangers names trying to figure out if there is any relationship between the name's length and the survivals
var analytics = rs.analytics(); analytics = analytics.apply(['name'], x => x? parseFloat(x.length)*1.00 : 0)