Titanic disaster analysis

Posted on Чт 17 Ноябрь 2016 in data analysis

I'm newbie at the Kaggle and I'm new to machine learning. I'll try to make this exploration interesting and detailed.

Content¶

1. Data analysis
2. Features analisys
3. Predictions
- 3.1. Data cleaning
- 3.2. Results
4. Conclusion

1. Data analysis¶

1.1. Expectations¶

What I do expect from this analysis? I’ll create a model predicting a survival on the Titanic. And on the way to prediction I'll make illustrations for all found dependencies.
First of all, I want to understand what kind of variables do I have.

Variable	Description
survived	Survival (0 = No; 1 = Yes)
pclass	Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name	Name
sex	Sex
age	Age
sibsp	Number of Siblings/Spouses Aboard
parch	Number of Parents/Children Aboard
ticket	Ticket Number
fare	Passenger Fare
cabin	Cabin
embarked	Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

I guess important variables are sex, age, family size, passenger class, fare, and embarked. Last three might show a distance to lifeboats, corridors width and other important conveniences.

1.2. Loading and examination the dataset¶

I loaded necessary libraries and now let's take a look at the data and check whether the data is full and clean or not.

In [101]:

import pandas as pd
import numpy as np

#visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
%matplotlib inline

#scikit-learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale
from sklearn import svm

#read the data
train = pd.read_csv('../kaggle/titanic/data/train.csv', index_col='PassengerId')
test = pd.read_csv('../kaggle/titanic/data/test.csv', index_col='PassengerId')

train.info()
test.info()
train.describe()


Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Ticket      418 non-null object
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB

Out[101]:

	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

1.3. Data managing¶

As you can see above, there are some values missing in the data. I should fill in age, fare and embarked. The fare is the simplest one, I filled it in with a mean value. The embarked is necessary to analyse.

In [102]:

#fill one missed value with mean
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())
#take a look at the embarked
sns.countplot(x='Embarked', data=train);

In [103]:

#fill an empty values with the most common
train['Embarked'] = train['Embarked'].fillna('S')

I filled in age with random values around a mean value. Then I compared the results to make sure there is no skew.

In [104]:

def fill_age(data):
    max_data = data['Age'].mean() + data['Age'].std()
    min_data = data['Age'].mean() - data['Age'].std()
    rand_data = np.random.randint(min_data, max_data, size=data['Age'].isnull().count())
    data.loc[np.isnan(data['Age']), ['Age']] = rand_data
    return data


fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(15, 4))
axis1.set_title('Original')
axis2.set_title('Filled')
train['Age'].dropna().hist(bins=70, ax=axis1);

train = fill_age(train)
test = fill_age(test)

train['Age'].hist(bins=70, ax=axis2);

2. Features analisys¶

2.1. Passenger Class, Sex¶

The survival should obviously depend on sex and passenger class. Let's check that out.

In [106]:

g1 = sns.factorplot('Survived', col='Pclass', col_wrap=4, data=train, kind="count", order=[1,0])
(g1.set_xticklabels(['Yes', 'No'])
 .set_titles('{col_name} Class'));
g2 = sns.factorplot('Survived', col='Sex', col_wrap=4, data=train, kind="count", order=[1,0])
(g2.set_xticklabels(['Yes', 'No'])
 .set_titles('{col_name}'));
g3 = sns.factorplot(x='Sex', y='Survived', col='Pclass', data=train, kind='bar', ci=None)
(g3.set_axis_labels('', 'Survival Rate')
 .set_xticklabels(['Men', 'Women'])
 .set_titles('{col_name} Class')
 .set(ylim=(0, 1)));

So, the unluckiest were men, especially in a third class.
Now let's see what do sex and age show us.

In [107]:

g4 = sns.violinplot(x='Survived', y='Age', hue='Sex', data=train, split=True, order=[1,0])
(g4.set_xticklabels(['Yes', 'No']));

Titanic lost a lot of young men and women. The matrix below shows how many people were sunk in each age-class group.

In [108]:

def f(Age):
    for i in range(10, int(train['Age'].max()+1), 10):
        if Age >= (i-10) and Age <= i:
            return str(i-10) + '-' + str(i)

group_data = train[train['Survived'] == 0].loc[:,['Age', 'Pclass', 'Survived']]
group_data['AgeRange'] = group_data['Age'].apply(f)
del(group_data['Age'])
group_data = group_data.groupby(['AgeRange', 'Pclass']).count().reset_index()

pivoted = group_data.pivot(index='AgeRange', columns='Pclass', values='Survived').fillna(0)
sns.heatmap(pivoted);

2.2. Age¶

We already saw that age is an important feature. Now we can take a closer look at it.

In [109]:

f, ax = plt.subplots()

sunk = train.loc[train['Survived']==0, ['Age']]
sns.distplot(sunk,
            label='Sunk', bins=80, kde=False, color='r')

survived = train.loc[train['Survived']==1, ['Age']]
sns.distplot(survived,
            label='Survived', bins=80, kde=False, color='b')

ax.legend(ncol=2, loc='upper right', frameon=True)
ax.set(xlabel='Ages');

I decided to split data into age group and examine it in these groups.

In [111]:

from scipy.stats import spearmanr

def demographic_category(p):
    age, sex = p
    if age < 18:
        return 'Child'
    elif age > 65:
        return 'Elderly'
    else:
        return sex

train['Demographic'] = train[['Age', 'Sex']].apply(demographic_category, axis=1)
test['Demographic'] = test[['Age', 'Sex']].apply(demographic_category, axis=1)

g5 = sns.countplot(x='Survived', hue='Demographic', data=train, order=[1,0], palette='Set2')
(g5.set_xticklabels(['Yes', 'No']));

In [117]:

g6 = sns.factorplot('Survived', col='Demographic', 
                    data=train, kind='count', palette='Set2', order=[1,0])
(g6.set_xticklabels(['Yes', 'No'])
 .set_titles('{col_name}'));

In [116]:

g7 = sns.factorplot(x='Demographic', y='Survived', col='Pclass', 
                    data=train, kind='bar', palette='Set2', ci=None)
(g7.set_axis_labels('', 'Survival Rate')
 .set_titles('{col_name} Class')
 .set(ylim=(0, 1)));

In [119]:

sns.factorplot('Demographic', col='Embarked', 
                    data=train, kind='count', palette='Set2', ci=None);

2.3. Parch, SibSp¶

Let's see how many relatives did people have.

In [85]:

g8 = sns.countplot(x='Survived', hue='Parch', data=train, order=[1,0], palette='Set2')
(g8.set_xticklabels(['Yes', 'No']));

In [86]:

g7 = sns.countplot(x='Survived', hue='SibSp', data=train, order=[1,0], palette='Set2')
(g7.set_xticklabels(['Yes', 'No']));

I splited people into groups "With family" and "Alone" and made a new chart in 3.1. Data cleaning.

2.4. Embarked, Fare¶

In [88]:

g8 = sns.countplot(x='Survived', hue='Embarked', data=train, order=[1,0], palette='Set2')
(g8.set_xticklabels(['Yes', 'No']));

In [123]:

f, ax = plt.subplots()

sunk = train.loc[train['Survived']==0, ['Fare']]
sns.distplot(sunk,
            label='Sunk', bins=50, kde=False, color='r')

survived = train.loc[train['Survived']==1, ['Fare']]
sns.distplot(survived,
            label='Survived', bins=50, kde=False, color='b')

ax.legend(ncol=2, loc='upper right', frameon=True)
ax.set(xlabel='Fares');

3. Predictions¶

3.1. Data cleaning¶

I extracted embarked and demographic into a few new binary variables. Then I removed some unnecessary variables because of their linear dependence (for example, Male is Female).

In [124]:

embarked = pd.get_dummies(train['Embarked'])
embarked.columns = ['S', 'C', 'Q']
demographic = pd.get_dummies(train['Demographic'])
demographic.columns = ['Child', 'Elderly', 'Female', 'Male']
train = train.join(embarked)
train = train.join(demographic)

embarked = pd.get_dummies(test['Embarked'])
embarked.columns = ['S', 'C', 'Q']
demographic = pd.get_dummies(test['Demographic'])
demographic.columns = ['Child', 'Elderly', 'Female', 'Male']
test = test.join(embarked)
test = test.join(demographic)

def family(p):
    parch, sibsp = p
    if (parch + sibsp) > 0:
        return 1
    else:
        return 0

train['Family'] = train[['Parch', 'SibSp']].apply(family, axis=1)
test['Family'] = test[['Parch', 'SibSp']].apply(family, axis=1)

y_train = train.loc[:, ['Survived']]
X_train = train.drop(['Sex', 'Ticket', 'Cabin', 'Name', 'Demographic', 
                      'Embarked', 'Parch', 'SibSp', 'Male', 'S', 'Survived'], axis=1)
X_test = test.drop(['Sex', 'Ticket', 'Cabin', 'Name', 'Demographic', 
                    'Embarked', 'Parch', 'SibSp', 'Male', 'S'], axis=1)

g = sns.countplot(x='Survived', hue='Family', data=train, order=[1,0], palette='Set2')
(g.set_xticklabels(['Yes', 'No']));

3.2. Results¶

I applied 4 prediction methods to the data and compared them with each other. For the prediction I chose the method with the best score.

In [125]:

regression = LogisticRegression()
regression.fit(X_train, np.ravel(y_train))
lr_score = regression.score(X_train, y_train)

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
prediction = tree.predict(X_test)
tree_score = tree.score(X_train,y_train)

knc = KNeighborsClassifier(n_neighbors=25)
knc.fit(X_train, np.ravel(y_train))
knc_score = knc.score(X_train, y_train)

svc = svm.SVC()
svc.fit(X_train, np.ravel(y_train))
svc_score = svc.score(X_train,y_train)

#write prediction to file
submission = pd.DataFrame({
        'PassengerId': test.index,
        'Survived': prediction
    })
submission.to_csv('titanic.csv', index=False)

#look at the scores
#I choose KNC
print("LogisticRegression score:\t\t\t%0.5f" % lr_score)
print("DecisionTreeClassifier score:\t\t\t%0.5f" % tree_score)
print("KNeighborsClassifier score:\t\t\t%0.5f" % knc_score)
print("SVC score:\t\t\t\t\t%0.5f\n" % svc_score)

Out[125]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Out[125]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Out[125]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=25, p=2,
           weights='uniform')

Out[125]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

LogisticRegression score:			0.79574
DecisionTreeClassifier score:			0.98653
KNeighborsClassifier score:			0.71156
SVC score:					0.86195

4. Conclusion¶

Now we see important variables:

In [126]:

importance = pd.DataFrame(X_train.columns)
importance.columns = ['Features']
importance["Importance"] = pd.Series(tree.feature_importances_)
importance

Out[126]:

	Features	Importance
0	Pclass	0.111904
1	Age	0.323860
2	Fare	0.281423
3	C	0.020218
4	Q	0.021831
5	Child	0.000000
6	Elderly	0.000356
7	Female	0.225926
8	Family	0.014482

Thank you for reading this analysis. And I appreciate any feedback. You also may look at the notebook and write comments here.