Titanic disaster analysis
Posted on Чт 17 Ноябрь 2016 in data analysis
I'm newbie at the Kaggle and I'm new to machine learning. I'll try to make this exploration interesting and detailed.
1. Data analysis¶
1.1. Expectations¶
What I do expect from this analysis? I’ll create a model predicting a survival on the Titanic. And on the way to prediction I'll make illustrations for all found dependencies.
First of all, I want to understand what kind of variables do I have.
Variable | Description |
---|---|
survived | Survival (0 = No; 1 = Yes) |
pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
name | Name |
sex | Sex |
age | Age |
sibsp | Number of Siblings/Spouses Aboard |
parch | Number of Parents/Children Aboard |
ticket | Ticket Number |
fare | Passenger Fare |
cabin | Cabin |
embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
I guess important variables are sex, age, family size, passenger class, fare, and embarked. Last three might show a distance to lifeboats, corridors width and other important conveniences.
1.2. Loading and examination the dataset¶
I loaded necessary libraries and now let's take a look at the data and check whether the data is full and clean or not.
import pandas as pd
import numpy as np
#visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
%matplotlib inline
#scikit-learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale
from sklearn import svm
#read the data
train = pd.read_csv('../kaggle/titanic/data/train.csv', index_col='PassengerId')
test = pd.read_csv('../kaggle/titanic/data/test.csv', index_col='PassengerId')
train.info()
test.info()
train.describe()
1.3. Data managing¶
As you can see above, there are some values missing in the data. I should fill in age, fare and embarked. The fare is the simplest one, I filled it in with a mean value. The embarked is necessary to analyse.
#fill one missed value with mean
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())
#take a look at the embarked
sns.countplot(x='Embarked', data=train);
#fill an empty values with the most common
train['Embarked'] = train['Embarked'].fillna('S')
I filled in age with random values around a mean value. Then I compared the results to make sure there is no skew.
def fill_age(data):
max_data = data['Age'].mean() + data['Age'].std()
min_data = data['Age'].mean() - data['Age'].std()
rand_data = np.random.randint(min_data, max_data, size=data['Age'].isnull().count())
data.loc[np.isnan(data['Age']), ['Age']] = rand_data
return data
fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(15, 4))
axis1.set_title('Original')
axis2.set_title('Filled')
train['Age'].dropna().hist(bins=70, ax=axis1);
train = fill_age(train)
test = fill_age(test)
train['Age'].hist(bins=70, ax=axis2);
2. Features analisys¶
2.1. Passenger Class, Sex¶
The survival should obviously depend on sex and passenger class. Let's check that out.
g1 = sns.factorplot('Survived', col='Pclass', col_wrap=4, data=train, kind="count", order=[1,0])
(g1.set_xticklabels(['Yes', 'No'])
.set_titles('{col_name} Class'));
g2 = sns.factorplot('Survived', col='Sex', col_wrap=4, data=train, kind="count", order=[1,0])
(g2.set_xticklabels(['Yes', 'No'])
.set_titles('{col_name}'));
g3 = sns.factorplot(x='Sex', y='Survived', col='Pclass', data=train, kind='bar', ci=None)
(g3.set_axis_labels('', 'Survival Rate')
.set_xticklabels(['Men', 'Women'])
.set_titles('{col_name} Class')
.set(ylim=(0, 1)));
So, the unluckiest were men, especially in a third class.
Now let's see what do sex and age show us.
g4 = sns.violinplot(x='Survived', y='Age', hue='Sex', data=train, split=True, order=[1,0])
(g4.set_xticklabels(['Yes', 'No']));
Titanic lost a lot of young men and women. The matrix below shows how many people were sunk in each age-class group.
def f(Age):
for i in range(10, int(train['Age'].max()+1), 10):
if Age >= (i-10) and Age <= i:
return str(i-10) + '-' + str(i)
group_data = train[train['Survived'] == 0].loc[:,['Age', 'Pclass', 'Survived']]
group_data['AgeRange'] = group_data['Age'].apply(f)
del(group_data['Age'])
group_data = group_data.groupby(['AgeRange', 'Pclass']).count().reset_index()
pivoted = group_data.pivot(index='AgeRange', columns='Pclass', values='Survived').fillna(0)
sns.heatmap(pivoted);
2.2. Age¶
We already saw that age is an important feature. Now we can take a closer look at it.
f, ax = plt.subplots()
sunk = train.loc[train['Survived']==0, ['Age']]
sns.distplot(sunk,
label='Sunk', bins=80, kde=False, color='r')
survived = train.loc[train['Survived']==1, ['Age']]
sns.distplot(survived,
label='Survived', bins=80, kde=False, color='b')
ax.legend(ncol=2, loc='upper right', frameon=True)
ax.set(xlabel='Ages');
I decided to split data into age group and examine it in these groups.
from scipy.stats import spearmanr
def demographic_category(p):
age, sex = p
if age < 18:
return 'Child'
elif age > 65:
return 'Elderly'
else:
return sex
train['Demographic'] = train[['Age', 'Sex']].apply(demographic_category, axis=1)
test['Demographic'] = test[['Age', 'Sex']].apply(demographic_category, axis=1)
g5 = sns.countplot(x='Survived', hue='Demographic', data=train, order=[1,0], palette='Set2')
(g5.set_xticklabels(['Yes', 'No']));
g6 = sns.factorplot('Survived', col='Demographic',
data=train, kind='count', palette='Set2', order=[1,0])
(g6.set_xticklabels(['Yes', 'No'])
.set_titles('{col_name}'));
g7 = sns.factorplot(x='Demographic', y='Survived', col='Pclass',
data=train, kind='bar', palette='Set2', ci=None)
(g7.set_axis_labels('', 'Survival Rate')
.set_titles('{col_name} Class')
.set(ylim=(0, 1)));
sns.factorplot('Demographic', col='Embarked',
data=train, kind='count', palette='Set2', ci=None);
2.3. Parch, SibSp¶
Let's see how many relatives did people have.
g8 = sns.countplot(x='Survived', hue='Parch', data=train, order=[1,0], palette='Set2')
(g8.set_xticklabels(['Yes', 'No']));
g7 = sns.countplot(x='Survived', hue='SibSp', data=train, order=[1,0], palette='Set2')
(g7.set_xticklabels(['Yes', 'No']));
I splited people into groups "With family" and "Alone" and made a new chart in 3.1. Data cleaning.
2.4. Embarked, Fare¶
g8 = sns.countplot(x='Survived', hue='Embarked', data=train, order=[1,0], palette='Set2')
(g8.set_xticklabels(['Yes', 'No']));
f, ax = plt.subplots()
sunk = train.loc[train['Survived']==0, ['Fare']]
sns.distplot(sunk,
label='Sunk', bins=50, kde=False, color='r')
survived = train.loc[train['Survived']==1, ['Fare']]
sns.distplot(survived,
label='Survived', bins=50, kde=False, color='b')
ax.legend(ncol=2, loc='upper right', frameon=True)
ax.set(xlabel='Fares');
3. Predictions¶
3.1. Data cleaning¶
I extracted embarked and demographic into a few new binary variables. Then I removed some unnecessary variables because of their linear dependence (for example, Male is Female).
embarked = pd.get_dummies(train['Embarked'])
embarked.columns = ['S', 'C', 'Q']
demographic = pd.get_dummies(train['Demographic'])
demographic.columns = ['Child', 'Elderly', 'Female', 'Male']
train = train.join(embarked)
train = train.join(demographic)
embarked = pd.get_dummies(test['Embarked'])
embarked.columns = ['S', 'C', 'Q']
demographic = pd.get_dummies(test['Demographic'])
demographic.columns = ['Child', 'Elderly', 'Female', 'Male']
test = test.join(embarked)
test = test.join(demographic)
def family(p):
parch, sibsp = p
if (parch + sibsp) > 0:
return 1
else:
return 0
train['Family'] = train[['Parch', 'SibSp']].apply(family, axis=1)
test['Family'] = test[['Parch', 'SibSp']].apply(family, axis=1)
y_train = train.loc[:, ['Survived']]
X_train = train.drop(['Sex', 'Ticket', 'Cabin', 'Name', 'Demographic',
'Embarked', 'Parch', 'SibSp', 'Male', 'S', 'Survived'], axis=1)
X_test = test.drop(['Sex', 'Ticket', 'Cabin', 'Name', 'Demographic',
'Embarked', 'Parch', 'SibSp', 'Male', 'S'], axis=1)
g = sns.countplot(x='Survived', hue='Family', data=train, order=[1,0], palette='Set2')
(g.set_xticklabels(['Yes', 'No']));
3.2. Results¶
I applied 4 prediction methods to the data and compared them with each other. For the prediction I chose the method with the best score.
regression = LogisticRegression()
regression.fit(X_train, np.ravel(y_train))
lr_score = regression.score(X_train, y_train)
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
prediction = tree.predict(X_test)
tree_score = tree.score(X_train,y_train)
knc = KNeighborsClassifier(n_neighbors=25)
knc.fit(X_train, np.ravel(y_train))
knc_score = knc.score(X_train, y_train)
svc = svm.SVC()
svc.fit(X_train, np.ravel(y_train))
svc_score = svc.score(X_train,y_train)
#write prediction to file
submission = pd.DataFrame({
'PassengerId': test.index,
'Survived': prediction
})
submission.to_csv('titanic.csv', index=False)
#look at the scores
#I choose KNC
print("LogisticRegression score:\t\t\t%0.5f" % lr_score)
print("DecisionTreeClassifier score:\t\t\t%0.5f" % tree_score)
print("KNeighborsClassifier score:\t\t\t%0.5f" % knc_score)
print("SVC score:\t\t\t\t\t%0.5f\n" % svc_score)
4. Conclusion¶
Now we see important variables:
importance = pd.DataFrame(X_train.columns)
importance.columns = ['Features']
importance["Importance"] = pd.Series(tree.feature_importances_)
importance
Thank you for reading this analysis. And I appreciate any feedback. You also may look at the notebook and write comments here.