{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#
Тема 2. Обучение с учителем. Методы классификации\n", "##
Практика. Дерево решений в задаче предсказания выживания пассажиров \"Титаника\". Решение" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Соревнование Kaggle \"Titanic: Machine Learning from Disaster\".**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.tree import DecisionTreeClassifier, export_graphviz\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix\n", "%matplotlib inline\n", "from matplotlib import pyplot as plt\n", "import seaborn as sns\n", "import cv2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Функция для формирования csv-файла посылки на Kaggle:**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def write_to_submission_file(predicted_labels, out_file, train_num=891,\n", " target='Survived', index_label=\"PassengerId\"):\n", " # turn predictions into data frame and save as csv file\n", " predicted_df = pd.DataFrame(predicted_labels,\n", " index = np.arange(train_num + 1,\n", " train_num + 1 +\n", " predicted_labels.shape[0]),\n", " columns=[target])\n", " predicted_df.to_csv(out_file, index_label=index_label)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Считываем обучающую и тестовую выборки.**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "train_df = pd.read_csv(\"data/titanic_train.csv\") \n", "test_df = pd.read_csv(\"data/titanic_test.csv\") " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "y = train_df['Survived']" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
count891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889
uniqueNaNNaNNaN8912NaNNaNNaN681NaN1473
topNaNNaNNaNMurdlin, Mr. JosephmaleNaNNaNNaNCA. 2343NaNG6S
freqNaNNaNNaN1577NaNNaNNaN7NaN4644
mean446.0000000.3838382.308642NaNNaN29.6991180.5230080.381594NaN32.204208NaNNaN
std257.3538420.4865920.836071NaNNaN14.5264971.1027430.806057NaN49.693429NaNNaN
min1.0000000.0000001.000000NaNNaN0.4200000.0000000.000000NaN0.000000NaNNaN
25%223.5000000.0000002.000000NaNNaN20.1250000.0000000.000000NaN7.910400NaNNaN
50%446.0000000.0000003.000000NaNNaN28.0000000.0000000.000000NaN14.454200NaNNaN
75%668.5000001.0000003.000000NaNNaN38.0000001.0000000.000000NaN31.000000NaNNaN
max891.0000001.0000003.000000NaNNaN80.0000008.0000006.000000NaN512.329200NaNNaN
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name Sex \\\n", "count 891.000000 891.000000 891.000000 891 891 \n", "unique NaN NaN NaN 891 2 \n", "top NaN NaN NaN Murdlin, Mr. Joseph male \n", "freq NaN NaN NaN 1 577 \n", "mean 446.000000 0.383838 2.308642 NaN NaN \n", "std 257.353842 0.486592 0.836071 NaN NaN \n", "min 1.000000 0.000000 1.000000 NaN NaN \n", "25% 223.500000 0.000000 2.000000 NaN NaN \n", "50% 446.000000 0.000000 3.000000 NaN NaN \n", "75% 668.500000 1.000000 3.000000 NaN NaN \n", "max 891.000000 1.000000 3.000000 NaN NaN \n", "\n", " Age SibSp Parch Ticket Fare Cabin \\\n", "count 714.000000 891.000000 891.000000 891 891.000000 204 \n", "unique NaN NaN NaN 681 NaN 147 \n", "top NaN NaN NaN CA. 2343 NaN G6 \n", "freq NaN NaN NaN 7 NaN 4 \n", "mean 29.699118 0.523008 0.381594 NaN 32.204208 NaN \n", "std 14.526497 1.102743 0.806057 NaN 49.693429 NaN \n", "min 0.420000 0.000000 0.000000 NaN 0.000000 NaN \n", "25% 20.125000 0.000000 0.000000 NaN 7.910400 NaN \n", "50% 28.000000 0.000000 0.000000 NaN 14.454200 NaN \n", "75% 38.000000 1.000000 0.000000 NaN 31.000000 NaN \n", "max 80.000000 8.000000 6.000000 NaN 512.329200 NaN \n", "\n", " Embarked \n", "count 889 \n", "unique 3 \n", "top S \n", "freq 644 \n", "mean NaN \n", "std NaN \n", "min NaN \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max NaN " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.describe(include='all')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
count418.000000418.000000418418332.000000418.000000418.000000418417.00000091418
uniqueNaNNaN4182NaNNaNNaN363NaN763
topNaNNaNDavison, Mr. Thomas HenrymaleNaNNaNNaNPC 17608NaNB57 B59 B63 B66S
freqNaNNaN1266NaNNaNNaN5NaN3270
mean1100.5000002.265550NaNNaN30.2725900.4473680.392344NaN35.627188NaNNaN
std120.8104580.841838NaNNaN14.1812090.8967600.981429NaN55.907576NaNNaN
min892.0000001.000000NaNNaN0.1700000.0000000.000000NaN0.000000NaNNaN
25%996.2500001.000000NaNNaN21.0000000.0000000.000000NaN7.895800NaNNaN
50%1100.5000003.000000NaNNaN27.0000000.0000000.000000NaN14.454200NaNNaN
75%1204.7500003.000000NaNNaN39.0000001.0000000.000000NaN31.500000NaNNaN
max1309.0000003.000000NaNNaN76.0000008.0000009.000000NaN512.329200NaNNaN
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex Age \\\n", "count 418.000000 418.000000 418 418 332.000000 \n", "unique NaN NaN 418 2 NaN \n", "top NaN NaN Davison, Mr. Thomas Henry male NaN \n", "freq NaN NaN 1 266 NaN \n", "mean 1100.500000 2.265550 NaN NaN 30.272590 \n", "std 120.810458 0.841838 NaN NaN 14.181209 \n", "min 892.000000 1.000000 NaN NaN 0.170000 \n", "25% 996.250000 1.000000 NaN NaN 21.000000 \n", "50% 1100.500000 3.000000 NaN NaN 27.000000 \n", "75% 1204.750000 3.000000 NaN NaN 39.000000 \n", "max 1309.000000 3.000000 NaN NaN 76.000000 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "count 418.000000 418.000000 418 417.000000 91 418 \n", "unique NaN NaN 363 NaN 76 3 \n", "top NaN NaN PC 17608 NaN B57 B59 B63 B66 S \n", "freq NaN NaN 5 NaN 3 270 \n", "mean 0.447368 0.392344 NaN 35.627188 NaN NaN \n", "std 0.896760 0.981429 NaN 55.907576 NaN NaN \n", "min 0.000000 0.000000 NaN 0.000000 NaN NaN \n", "25% 0.000000 0.000000 NaN 7.895800 NaN NaN \n", "50% 0.000000 0.000000 NaN 14.454200 NaN NaN \n", "75% 1.000000 0.000000 NaN 31.500000 NaN NaN \n", "max 8.000000 9.000000 NaN 512.329200 NaN NaN " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_df.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Заполним пропуски медианными значениями.**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "train_df['Age'].fillna(train_df['Age'].median(), inplace=True)\n", "test_df['Age'].fillna(train_df['Age'].median(), inplace=True)\n", "train_df['Embarked'].fillna('S', inplace=True)\n", "test_df['Fare'].fillna(train_df['Fare'].median(), inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Кодируем категориальные признаки `Pclass`, `Sex`, `SibSp`, `Parch` и `Embarked` с помощью техники One-Hot-Encoding.**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "train_df = pd.concat([train_df, pd.get_dummies(train_df['Pclass'], \n", " prefix=\"PClass\"),\n", " pd.get_dummies(train_df['Sex'], prefix=\"Sex\"),\n", " pd.get_dummies(train_df['SibSp'], prefix=\"SibSp\"),\n", " pd.get_dummies(train_df['Parch'], prefix=\"Parch\"),\n", " pd.get_dummies(train_df['Embarked'], prefix=\"Embarked\")],\n", " axis=1)\n", "test_df = pd.concat([test_df, pd.get_dummies(test_df['Pclass'], \n", " prefix=\"PClass\"),\n", " pd.get_dummies(test_df['Sex'], prefix=\"Sex\"),\n", " pd.get_dummies(test_df['SibSp'], prefix=\"SibSp\"),\n", " pd.get_dummies(test_df['Parch'], prefix=\"Parch\"),\n", " pd.get_dummies(test_df['Embarked'], prefix=\"Embarked\")],\n", " axis=1)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "train_df.drop(['Survived', 'Pclass', 'Name', 'Sex', 'SibSp', \n", " 'Parch', 'Ticket', 'Cabin', 'Embarked', 'PassengerId'], \n", " axis=1, inplace=True)\n", "test_df.drop(['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked', 'PassengerId'], \n", " axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**В тестовой выборке появляется новое значение Parch = 9, которого нет в обучающей выборке. Проигнорируем его.**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((891, 24), (418, 25))" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.shape, test_df.shape" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Parch_9'}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(test_df.columns) - set(train_df.columns)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "test_df.drop(['Parch_9'], axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeFarePClass_1PClass_2PClass_3Sex_femaleSex_maleSibSp_0SibSp_1SibSp_2...Parch_0Parch_1Parch_2Parch_3Parch_4Parch_5Parch_6Embarked_CEmbarked_QEmbarked_S
022.07.250000101010...1000000001
138.071.283310010010...1000000100
226.07.925000110100...1000000001
335.053.100010010010...1000000001
435.08.050000101100...1000000001
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " Age Fare PClass_1 PClass_2 PClass_3 Sex_female Sex_male SibSp_0 \\\n", "0 22.0 7.2500 0 0 1 0 1 0 \n", "1 38.0 71.2833 1 0 0 1 0 0 \n", "2 26.0 7.9250 0 0 1 1 0 1 \n", "3 35.0 53.1000 1 0 0 1 0 0 \n", "4 35.0 8.0500 0 0 1 0 1 1 \n", "\n", " SibSp_1 SibSp_2 ... Parch_0 Parch_1 Parch_2 Parch_3 Parch_4 \\\n", "0 1 0 ... 1 0 0 0 0 \n", "1 1 0 ... 1 0 0 0 0 \n", "2 0 0 ... 1 0 0 0 0 \n", "3 1 0 ... 1 0 0 0 0 \n", "4 0 0 ... 1 0 0 0 0 \n", "\n", " Parch_5 Parch_6 Embarked_C Embarked_Q Embarked_S \n", "0 0 0 0 0 1 \n", "1 0 0 1 0 0 \n", "2 0 0 0 0 1 \n", "3 0 0 0 0 1 \n", "4 0 0 0 0 1 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeFarePClass_1PClass_2PClass_3Sex_femaleSex_maleSibSp_0SibSp_1SibSp_2...Parch_0Parch_1Parch_2Parch_3Parch_4Parch_5Parch_6Embarked_CEmbarked_QEmbarked_S
034.57.829200101100...1000000010
147.07.000000110010...1000000001
262.09.687501001100...1000000010
327.08.662500101100...1000000001
422.012.287500110010...0100000001
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " Age Fare PClass_1 PClass_2 PClass_3 Sex_female Sex_male SibSp_0 \\\n", "0 34.5 7.8292 0 0 1 0 1 1 \n", "1 47.0 7.0000 0 0 1 1 0 0 \n", "2 62.0 9.6875 0 1 0 0 1 1 \n", "3 27.0 8.6625 0 0 1 0 1 1 \n", "4 22.0 12.2875 0 0 1 1 0 0 \n", "\n", " SibSp_1 SibSp_2 ... Parch_0 Parch_1 Parch_2 Parch_3 Parch_4 \\\n", "0 0 0 ... 1 0 0 0 0 \n", "1 1 0 ... 1 0 0 0 0 \n", "2 0 0 ... 1 0 0 0 0 \n", "3 0 0 ... 1 0 0 0 0 \n", "4 1 0 ... 0 1 0 0 0 \n", "\n", " Parch_5 Parch_6 Embarked_C Embarked_Q Embarked_S \n", "0 0 0 0 1 0 \n", "1 0 0 0 0 1 \n", "2 0 0 0 1 0 \n", "3 0 0 0 0 1 \n", "4 0 0 0 0 1 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Дерево решений без настройки параметров " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Обучите на имеющейся выборке дерево решений (`DecisionTreeClassifier`) максимальной глубины 2. Используйте параметр `random_state=17` для воспроизводимости результатов.**" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "temp = DecisionTreeClassifier(random_state=17, max_depth = 2)\n", "tree = temp.fit(train_df, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сделайте с помощью полученной модели прогноз для тестовой выборки **" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "predictions = tree.predict(test_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сформируйте файл посылки и отправьте на Kaggle**" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "write_to_submission_file(predictions, 'Kaggle_1.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вопрос 1. Каков результат первой посылки (дерево решений без настройки параметров) в публичном рейтинге соревнования Titanic?\n", "- 0.746\n", "- 0.756\n", "- 0.766\n", "- 0.776\n", "- 0.77033 <--" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Отобразите дерево с помощью `export_graphviz` и `dot`.**" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "dot_data = export_graphviz(tree, out_file=\"tree.dot\", feature_names=train_df.columns)\n", "import pydot\n", "(graph,) = pydot.graph_from_dot_file('tree.dot')\n", "graph.write_png('tree_depth2.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вопрос 2. Сколько признаков задействуются при прогнозе деревом решений глубины 2?\n", "- 2\n", "- 3 <--\n", "- 4\n", "- 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Дерево решений с настройкой параметров " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Обучите на имеющейся выборке дерево решений (`DecisionTreeClassifier`). Также укажите `random_state=17`. Максимальную глубину и минимальное число элементов в листе настройте на 5-кратной кросс-валидации с помощью `GridSearchCV`.**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=17),\n", " param_grid={'max_depth': [1, 2, 3, 4],\n", " 'min_samples_leaf': [1, 2, 3, 4]})" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# tree params for grid search\n", "tree_params = {'max_depth': list(range(1, 5)), 'min_samples_leaf': list(range(1, 5))} #минимальная выборка\n", "best_tree = GridSearchCV(DecisionTreeClassifier(random_state=17), tree_params, cv=5)\n", "best_tree.fit(train_df, y)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "q3s: {'max_depth': 3, 'min_samples_leaf': 3}\n", "q4: 0.81\n" ] } ], "source": [ "print(\"q3s: \", best_tree.best_params_)\n", "print(\"q4: \", best_tree.best_score_.round(2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вопрос 3. Каковы лучшие параметры дерева, настроенные на кросс-валидации с помощью `GridSearchCV`?\n", "- max_depth=2, min_samples_leaf=1\n", "- max_depth=2, min_samples_leaf=4\n", "- max_depth=3, min_samples_leaf=2\n", "- max_depth=3, min_samples_leaf=3 <--" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вопрос 4. Какой получилась средняя доля верных ответов на кросс-валидации для дерева решений с лучшим сочетанием гиперпараметров `max_depth` и `min_samples_leaf`?\n", "- 0.77\n", "- 0.79\n", "- 0.81 <--\n", "- 0.83" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сделайте с помощью полученной модели прогноз для тестовой выборки.**" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "predictions = best_tree.predict(test_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сформируйте файл посылки и отправьте на Kaggle.**" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "write_to_submission_file(predictions, 'Kaggle_2.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вопрос 5. Каков результат второй посылки (дерево решений с настройкой гиперпараметров) в публичном рейтинге соревнования Titanic?\n", "- 0.7499\n", "- 0.7599\n", "- 0.7699\n", "- 0.7799\n", "- 0.77751 <--" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Метод ближайших соседей без настройки параметров" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### **Обучите на имеющейся выборке метод ближайших соседей (`KNeighborsClassifier`) с количеством соседей k=5 (`n_neighbors=5`). Сформируйте файл посылки и отправьте на Kaggle**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вопрос 6. Каков результат первой посылки (метод ближайших соседей без настройки параметров) в публичном рейтинге соревнования Titanic?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "new_tree = KNeighborsClassifier(n_neighbors=5)\n", "new_tree.fit(train_df, y)\n", "predictions = new_tree.predict(test_df)\n", "write_to_submission_file(predictions, 'Kaggle_3.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "0.64114 <--" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Метод ближайших соседей с настройкой параметров" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Обучите на имеющейся выборке метод ближайших соседей (`KNeighborsClassifier`).   Количество соседей настройте на 5-кратной кросс-валидации с помощью `GridSearchCV` в диапозоне от 1 до 20 (range(1, 20)).**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вопрос 7. Каково лучшее значение количества соседей (`n_neighbors`) на кросс-валидации с помощью `GridSearchCV`?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 19 candidates, totalling 95 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 2.3s\n", "[Parallel(n_jobs=-1)]: Done 95 out of 95 | elapsed: 2.6s finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('scaler', StandardScaler()),\n", " ('knn',\n", " KNeighborsClassifier(n_jobs=-1))]),\n", " n_jobs=-1, param_grid={'knn__n_neighbors': range(1, 20)},\n", " verbose=True)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree_params = {'knn__n_neighbors': range(1, 20)}\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "pip = Pipeline([('scaler', StandardScaler()),\n", " ('knn', KNeighborsClassifier(n_jobs=-1))])\n", "new_tree = GridSearchCV(pip, tree_params, cv=5, n_jobs=-1, verbose=True)\n", "new_tree.fit(train_df, y)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Лучшее значение 12\n" ] } ], "source": [ "print(\"Лучшее значение\", new_tree.best_params_['knn__n_neighbors'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Вопрос 8. Какой получилась средняя доля верных ответов на кросс-валидации с лучшим сочетанием гиперпараметра `n_neighbors`?" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8148148148148148" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(y, new_tree.predict(train_df))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сделайте с помощью полученной модели прогноз для тестовой выборки. Сформируйте файл посылки и отправьте на Kaggle**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вопрос 9. Каков результат посылки (метод ближайших соседей с настройкой параметров) в публичном рейтинге соревнования Titanic?" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "predictions = new_tree.predict(test_df)\n", "write_to_submission_file(predictions, 'Kaggle_4.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "0.73684 <--" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Ссылки:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " - Соревнование Kaggle \"Titanic: Machine Learning from Disaster\"\n", " - Тьюториал Dataquest по задаче Kaggle \"Titanic: Machine Learning from Disaster\"\n", " - https://habr.com/ru/company/ods/blog/322534/" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "name": "lesson3_homework_trees_titanic_solution.ipynb" }, "nbformat": 4, "nbformat_minor": 1 }