Kaggle Prediction competition

Sowjanya Sadashiva
5 min readFeb 22, 2021

--

Competition: Titanic — Machine Learning from Disaster

Overview

It is a first and simple Machine Learning competition on Kaggle for the beginners. The competition is to create a model using the given data that predicts the passengers survived the titanic shipwreck. The competition provides three csv files training set (train.csv), testing set(test.csv), gender analysis submission(gender_submission.csv). Using the data provided in train.csv build a model to predict who survived in the disaster among the passengers in test.csv.

Train.csv file has 891 entries.

Test.csv file has 418 entries.

Working with the dataset:

1. Clean the data: By filling the missing values, check the duplicates.

2. Analyze and Identify the pattern

3. Model the data using the predictions.

4. Predict the test.csv data using the model.

5. Solve the problem.

Importing Libraries

importing libraries

Reading the files.

read train_data.csv
read test_data.csv

Description of data.

Data types of data sets.

Data types of each attribute in the data set is different. To make the processing easier we need to pre-process the data to convert the data types of each attribute to int type.

Data sets with missing values or NAN.

Train data.

There are 177 entries of age, 687 entries of cabin and 2 entries of embarked are either missing or NAN in the train_data.csv file. These data need to be pre-processed and normalized to process and predict the survivals of test data.

Test data

Exploratory Data Analysis,

It is to test the data and generate the hypotheses for the further analysis of the data, it helps in the processing the data to prepare it for the modeling. Data profiling is necessary step to summarize the datasets through descriptive statistics. It includes handling the missing values, correcting the data, drop the data.

Visual representation of datasets,

There are three different Passenger class in the data sets, and the survival rate of the passengers according to the Pclass,

import seaborn as sns

to plot the below.

Survival rate based on Passenger class

Survival rate based on the Sex,

Survival rate based on Age,

Where 0 = agerange(0,22) , 1 = agerange(23, 29), 2 = agerange(30,35), 3 = agerange(35 & above)

Cleaning the data:

from sklearn.impute import SimpleImputer” is used to completing missing values. The missing values.

Parameters of simple imputer:

missing_values: int, float, str, np.nan or None, default=np.nan

strategy: string, default=’mean’/’median’/’most_frequent’/’constant’

fill_value: string or numerical value, default=None

Once the missing vales are filled that data needs to be normalized to a common scale.

For sex column:

After filling the missing values and mapping the attribute values to 0 and 1 the data type of Sex is int.

similarly, for Embarked,

In the above Screen shot the embarked value is changed to 0,1,2 based on the port.

We can continue the same process for the other attribute values, once we finish the normalization the dataset looks like the following.

In the above all the fields values are in 0,1, 2… except for the ticket. And in the given dataset Ticket does not contribute much to the prediction of survival. We can drop Ticket, Passenger ID and Name fields from the data sets.

When we finish the processing of data the datatype of attributes looks like the below,

The dataset looks like,

Once the data has been pre-processed and ready for modelling, we can use different data analysis models available to predict the survival in the test_data.csv.

The data assigned to the variables and can be used these variables in the different modeling algorithms,

Machine Learning Algorithms:

  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. K- Nearest Neighbor
  5. Random Forest Classifier
  6. Gaussian Naive Bayes Classifier

Import scikit-learn libraries to implement the models,

The functionality that scikit-learn include:

  • Regression, including Linear and Logistic Regression
  • Classification, including K-Nearest Neighbors
  • Clustering, including K-Means and K-Means++
  • Model selection
  • Preprocessing, including Min-Max Normalization

Example,

1. Logistic Regression

It gives the accuracy rate of 81.59.

2. Gaussian Naive Bayes Classifier

It gives the accuracy of 78.79.

3. Random Forest Classifier

It gives the highest accuracy of 90.91.

4. Decision Tree

It gives the accuracy of 90.91 like Random forest classifier.

5. KNN(K- Nearest Neighbor)

It gives the accuracy of 90.91.

Similarly, we can predict and check the accuracy against the different model in Machine Learning.

Submitting the prediction.

Click on the save version on the top-right corner to commit the code. Use the following code to convert to csv file, save and Submit the prediction in the homepage of the competition.

--

--

Sowjanya Sadashiva
Sowjanya Sadashiva

Written by Sowjanya Sadashiva

I am a computer science enthusiast with Master's degree in Computer Science and Specialization in Data Science.

No responses yet