Write here your abstract

1 Titanic: Machine Learning from Disaster

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during its maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

It took about $7.5 million to build the Titanic and it sunk under the ocean due to collision. The Titanic Dataset is a very good dataset for begineers to start a journey in data science

The Objective of this notebook is to give an idea about how is the workflow in any predictive modeling problem. How do we check features, how do we add new features, and how do we work applying some Machine Learning Concepts.

1.1 Data overview

The data has been splited into training set (titanic_train) and test set (titanic_test).

  • The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class.
  • The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower


age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5


sibsp: The dataset defines family relations in this way:

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)


parch: The dataset defines family relations in this way:

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them

1.2 Machine Learning Procedure

To start with our Titanic Survivals prediction we will start selecting the most important variables of our dataset:

A SELECT * FROM titanic_train dataset

Copy
var rs = db.executeQuery("SELECT  survived, pclass, name, sex, age, sibsp, parch , fare FROM  titanic_train");

We calculate the length of the passangers names trying to figure out if there is any relationship between the name's length and the survivals

Copy
var analytics = rs.analytics();
        analytics = analytics.apply(['name'], x => x? parseFloat(x.length)*1.00 : 0)

In this case, we clean our data filling the null gaps in the age column with the age_mean function

Copy
var age_mean = analytics.mean('age');
		console.log('age_mean = ' + age_mean)
        analytics.fillNulls('age', age_mean)

We calculate the length of the passangers names trying to figure out if there is any relationship between the name's length and the survivals

Copy
var analytics = rs.analytics();
        analytics = analytics.apply(['name'], x => x? parseFloat(x.length)*1.00 : 0)