Titanic disaster analysis

Posted on Чт 17 Ноябрь 2016 in data analysis

I'm newbie at the Kaggle and I'm new to machine learning. I'll try to make this exploration interesting and detailed.

1. Data analysis

1.1. Expectations

What I do expect from this analysis? I’ll create a model predicting a survival on the Titanic. And on the way to prediction I'll make illustrations for all found dependencies.
First of all, I want to understand what kind of variables do I have.

Variable Description
survived Survival (0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

I guess important variables are sex, age, family size, passenger class, fare, and embarked. Last three might show a distance to lifeboats, corridors width and other important conveniences.

1.2. Loading and examination the dataset

I loaded necessary libraries and now let's take a look at the data and check whether the data is full and clean or not.

In [101]:
import pandas as pd
import numpy as np

#visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
%matplotlib inline

#scikit-learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale
from sklearn import svm

#read the data
train = pd.read_csv('../kaggle/titanic/data/train.csv', index_col='PassengerId')
test = pd.read_csv('../kaggle/titanic/data/test.csv', index_col='PassengerId')

train.info()
test.info()
train.describe()

Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Ticket      418 non-null object
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB
Out[101]:
Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

1.3. Data managing

As you can see above, there are some values missing in the data. I should fill in age, fare and embarked. The fare is the simplest one, I filled it in with a mean value. The embarked is necessary to analyse.

In [102]:
#fill one missed value with mean
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())
#take a look at the embarked
sns.countplot(x='Embarked', data=train);
In [103]:
#fill an empty values with the most common
train['Embarked'] = train['Embarked'].fillna('S')

I filled in age with random values around a mean value. Then I compared the results to make sure there is no skew.

In [104]:
def fill_age(data):
    max_data = data['Age'].mean() + data['Age'].std()
    min_data = data['Age'].mean() - data['Age'].std()
    rand_data = np.random.randint(min_data, max_data, size=data['Age'].isnull().count())
    data.loc[np.isnan(data['Age']), ['Age']] = rand_data
    return data


fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(15, 4))
axis1.set_title('Original')
axis2.set_title('Filled')
train['Age'].dropna().hist(bins=70, ax=axis1);

train = fill_age(train)
test = fill_age(test)

train['Age'].hist(bins=70, ax=axis2);

2. Features analisys

2.1. Passenger Class, Sex

The survival should obviously depend on sex and passenger class. Let's check that out.

In [106]:
g1 = sns.factorplot('Survived', col='Pclass', col_wrap=4, data=train, kind="count", order=[1,0])
(g1.set_xticklabels(['Yes', 'No'])
 .set_titles('{col_name} Class'));
g2 = sns.factorplot('Survived', col='Sex', col_wrap=4, data=train, kind="count", order=[1,0])
(g2.set_xticklabels(['Yes', 'No'])
 .set_titles('{col_name}'));
g3 = sns.factorplot(x='Sex', y='Survived', col='Pclass', data=train, kind='bar', ci=None)
(g3.set_axis_labels('', 'Survival Rate')
 .set_xticklabels(['Men', 'Women'])
 .set_titles('{col_name} Class')
 .set(ylim=(0, 1)));

So, the unluckiest were men, especially in a third class.
Now let's see what do sex and age show us.

In [107]:
g4 = sns.violinplot(x='Survived', y='Age', hue='Sex', data=train, split=True, order=[1,0])
(g4.set_xticklabels(['Yes', 'No']));

Titanic lost a lot of young men and women. The matrix below shows how many people were sunk in each age-class group.

In [108]:
def f(Age):
    for i in range(10, int(train['Age'].max()+1), 10):
        if Age >= (i-10) and Age <= i:
            return str(i-10) + '-' + str(i)

group_data = train[train['Survived'] == 0].loc[:,['Age', 'Pclass', 'Survived']]
group_data['AgeRange'] = group_data['Age'].apply(f)
del(group_data['Age'])
group_data = group_data.groupby(['AgeRange', 'Pclass']).count().reset_index()

pivoted = group_data.pivot(index='AgeRange', columns='Pclass', values='Survived').fillna(0)
sns.heatmap(pivoted);

2.2. Age

We already saw that age is an important feature. Now we can take a closer look at it.

In [109]:
f, ax = plt.subplots()

sunk = train.loc[train['Survived']==0, ['Age']]
sns.distplot(sunk,
            label='Sunk', bins=80, kde=False, color='r')

survived = train.loc[train['Survived']==1, ['Age']]
sns.distplot(survived,
            label='Survived', bins=80, kde=False, color='b')

ax.legend(ncol=2, loc='upper right', frameon=True)
ax.set(xlabel='Ages');

I decided to split data into age group and examine it in these groups.

In [111]:
from scipy.stats import spearmanr

def demographic_category(p):
    age, sex = p
    if age < 18:
        return 'Child'
    elif age > 65:
        return 'Elderly'
    else:
        return sex

train['Demographic'] = train[['Age', 'Sex']].apply(demographic_category, axis=1)
test['Demographic'] = test[['Age', 'Sex']].apply(demographic_category, axis=1)

g5 = sns.countplot(x='Survived', hue='Demographic', data=train, order=[1,0], palette='Set2')
(g5.set_xticklabels(['Yes', 'No']));
In [117]:
g6 = sns.factorplot('Survived', col='Demographic', 
                    data=train, kind='count', palette='Set2', order=[1,0])
(g6.set_xticklabels(['Yes', 'No'])
 .set_titles('{col_name}'));
In [116]:
g7 = sns.factorplot(x='Demographic', y='Survived', col='Pclass', 
                    data=train, kind='bar', palette='Set2', ci=None)
(g7.set_axis_labels('', 'Survival Rate')
 .set_titles('{col_name} Class')
 .set(ylim=(0, 1)));
In [119]:
sns.factorplot('Demographic', col='Embarked', 
                    data=train, kind='count', palette='Set2', ci=None);

2.3. Parch, SibSp

Let's see how many relatives did people have.

In [85]:
g8 = sns.countplot(x='Survived', hue='Parch', data=train, order=[1,0], palette='Set2')
(g8.set_xticklabels(['Yes', 'No']));
In [86]:
g7 = sns.countplot(x='Survived', hue='SibSp', data=train, order=[1,0], palette='Set2')
(g7.set_xticklabels(['Yes', 'No']));

I splited people into groups "With family" and "Alone" and made a new chart in 3.1. Data cleaning.

2.4. Embarked, Fare

In [88]:
g8 = sns.countplot(x='Survived', hue='Embarked', data=train, order=[1,0], palette='Set2')
(g8.set_xticklabels(['Yes', 'No']));
In [123]:
f, ax = plt.subplots()

sunk = train.loc[train['Survived']==0, ['Fare']]
sns.distplot(sunk,
            label='Sunk', bins=50, kde=False, color='r')

survived = train.loc[train['Survived']==1, ['Fare']]
sns.distplot(survived,
            label='Survived', bins=50, kde=False, color='b')

ax.legend(ncol=2, loc='upper right', frameon=True)
ax.set(xlabel='Fares');

3. Predictions

3.1. Data cleaning

I extracted embarked and demographic into a few new binary variables. Then I removed some unnecessary variables because of their linear dependence (for example, Male is Female).

In [124]:
embarked = pd.get_dummies(train['Embarked'])
embarked.columns = ['S', 'C', 'Q']
demographic = pd.get_dummies(train['Demographic'])
demographic.columns = ['Child', 'Elderly', 'Female', 'Male']
train = train.join(embarked)
train = train.join(demographic)

embarked = pd.get_dummies(test['Embarked'])
embarked.columns = ['S', 'C', 'Q']
demographic = pd.get_dummies(test['Demographic'])
demographic.columns = ['Child', 'Elderly', 'Female', 'Male']
test = test.join(embarked)
test = test.join(demographic)

def family(p):
    parch, sibsp = p
    if (parch + sibsp) > 0:
        return 1
    else:
        return 0

train['Family'] = train[['Parch', 'SibSp']].apply(family, axis=1)
test['Family'] = test[['Parch', 'SibSp']].apply(family, axis=1)

y_train = train.loc[:, ['Survived']]
X_train = train.drop(['Sex', 'Ticket', 'Cabin', 'Name', 'Demographic', 
                      'Embarked', 'Parch', 'SibSp', 'Male', 'S', 'Survived'], axis=1)
X_test = test.drop(['Sex', 'Ticket', 'Cabin', 'Name', 'Demographic', 
                    'Embarked', 'Parch', 'SibSp', 'Male', 'S'], axis=1)

g = sns.countplot(x='Survived', hue='Family', data=train, order=[1,0], palette='Set2')
(g.set_xticklabels(['Yes', 'No']));

3.2. Results

I applied 4 prediction methods to the data and compared them with each other. For the prediction I chose the method with the best score.

In [125]:
regression = LogisticRegression()
regression.fit(X_train, np.ravel(y_train))
lr_score = regression.score(X_train, y_train)

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
prediction = tree.predict(X_test)
tree_score = tree.score(X_train,y_train)

knc = KNeighborsClassifier(n_neighbors=25)
knc.fit(X_train, np.ravel(y_train))
knc_score = knc.score(X_train, y_train)

svc = svm.SVC()
svc.fit(X_train, np.ravel(y_train))
svc_score = svc.score(X_train,y_train)

#write prediction to file
submission = pd.DataFrame({
        'PassengerId': test.index,
        'Survived': prediction
    })
submission.to_csv('titanic.csv', index=False)

#look at the scores
#I choose KNC
print("LogisticRegression score:\t\t\t%0.5f" % lr_score)
print("DecisionTreeClassifier score:\t\t\t%0.5f" % tree_score)
print("KNeighborsClassifier score:\t\t\t%0.5f" % knc_score)
print("SVC score:\t\t\t\t\t%0.5f\n" % svc_score)
Out[125]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Out[125]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
Out[125]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=25, p=2,
           weights='uniform')
Out[125]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
LogisticRegression score:			0.79574
DecisionTreeClassifier score:			0.98653
KNeighborsClassifier score:			0.71156
SVC score:					0.86195

4. Conclusion

Now we see important variables:

In [126]:
importance = pd.DataFrame(X_train.columns)
importance.columns = ['Features']
importance["Importance"] = pd.Series(tree.feature_importances_)
importance
Out[126]:
Features Importance
0 Pclass 0.111904
1 Age 0.323860
2 Fare 0.281423
3 C 0.020218
4 Q 0.021831
5 Child 0.000000
6 Elderly 0.000356
7 Female 0.225926
8 Family 0.014482

Thank you for reading this analysis. And I appreciate any feedback. You also may look at the notebook and write comments here.