{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center> Тема 4. Композиции алгоритмов, случайный лес\n",
    "## <center>Практика. Деревья решений и случайный лес в соревновании Kaggle Inclass по кредитному скорингу"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ориентируйтесь на рейтинг [соревнования](https://inclass.kaggle.com/c/beeline-credit-scoring-competition-2), [ссылка](https://www.kaggle.com/t/115237dd8c5e4092a219a0c12bf66fc6) для участия.\n",
    "\n",
    "Решается задача кредитного скоринга. \n",
    "\n",
    "Признаки клиентов банка:\n",
    "- Age - возраст (вещественный)\n",
    "- Income - месячный доход (вещественный)\n",
    "- BalanceToCreditLimit - отношение баланса на кредитной карте к лимиту по кредиту (вещественный)\n",
    "- DIR - Debt-to-income Ratio (вещественный)\n",
    "- NumLoans - число заемов и кредитных линий\n",
    "- NumRealEstateLoans - число ипотек и заемов, связанных с недвижимостью (натуральное число)\n",
    "- NumDependents - число членов семьи, которых содержит клиент, исключая самого клиента (натуральное число)\n",
    "- Num30-59Delinquencies - число просрочек выплат по кредиту от 30 до 59 дней (натуральное число)\n",
    "- Num60-89Delinquencies - число просрочек выплат по кредиту от 60 до 89 дней (натуральное число)\n",
    "- Delinquent90 - были ли просрочки выплат по кредиту более 90 дней (бинарный) - имеется только в обучающей выборке"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.metrics import roc_auc_score\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Загружаем данные.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = pd.read_csv('data/credit_scoring_train.csv', index_col='client_id')\n",
    "test_df = pd.read_csv('data/credit_scoring_test.csv', index_col='client_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = train_df['Delinquent90']\n",
    "train_df.drop('Delinquent90', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DIR</th>\n",
       "      <th>Age</th>\n",
       "      <th>NumLoans</th>\n",
       "      <th>NumRealEstateLoans</th>\n",
       "      <th>NumDependents</th>\n",
       "      <th>Num30-59Delinquencies</th>\n",
       "      <th>Num60-89Delinquencies</th>\n",
       "      <th>Income</th>\n",
       "      <th>BalanceToCreditLimit</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>client_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.496289</td>\n",
       "      <td>49.1</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>5298.360639</td>\n",
       "      <td>0.387028</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.433567</td>\n",
       "      <td>48.0</td>\n",
       "      <td>9</td>\n",
       "      <td>2</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>6008.056256</td>\n",
       "      <td>0.234679</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2206.731199</td>\n",
       "      <td>55.5</td>\n",
       "      <td>21</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.348227</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>886.132793</td>\n",
       "      <td>55.3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.971930</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>52.3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2504.613105</td>\n",
       "      <td>1.004350</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   DIR   Age  NumLoans  NumRealEstateLoans  NumDependents  \\\n",
       "client_id                                                                   \n",
       "0             0.496289  49.1        13                   0            0.0   \n",
       "1             0.433567  48.0         9                   2            2.0   \n",
       "2          2206.731199  55.5        21                   1            NaN   \n",
       "3           886.132793  55.3         3                   0            0.0   \n",
       "4             0.000000  52.3         1                   0            0.0   \n",
       "\n",
       "           Num30-59Delinquencies  Num60-89Delinquencies       Income  \\\n",
       "client_id                                                              \n",
       "0                              2                      0  5298.360639   \n",
       "1                              1                      0  6008.056256   \n",
       "2                              1                      0          NaN   \n",
       "3                              0                      0          NaN   \n",
       "4                              0                      0  2504.613105   \n",
       "\n",
       "           BalanceToCreditLimit  \n",
       "client_id                        \n",
       "0                      0.387028  \n",
       "1                      0.234679  \n",
       "2                      0.348227  \n",
       "3                      0.971930  \n",
       "4                      1.004350  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Посмотрим на число пропусков в каждом признаке.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 75000 entries, 0 to 74999\n",
      "Data columns (total 9 columns):\n",
      " #   Column                 Non-Null Count  Dtype  \n",
      "---  ------                 --------------  -----  \n",
      " 0   DIR                    75000 non-null  float64\n",
      " 1   Age                    75000 non-null  float64\n",
      " 2   NumLoans               75000 non-null  int64  \n",
      " 3   NumRealEstateLoans     75000 non-null  int64  \n",
      " 4   NumDependents          73084 non-null  float64\n",
      " 5   Num30-59Delinquencies  75000 non-null  int64  \n",
      " 6   Num60-89Delinquencies  75000 non-null  int64  \n",
      " 7   Income                 60153 non-null  float64\n",
      " 8   BalanceToCreditLimit   75000 non-null  float64\n",
      "dtypes: float64(5), int64(4)\n",
      "memory usage: 5.7 MB\n"
     ]
    }
   ],
   "source": [
    "train_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 75000 entries, 75000 to 149999\n",
      "Data columns (total 9 columns):\n",
      " #   Column                 Non-Null Count  Dtype  \n",
      "---  ------                 --------------  -----  \n",
      " 0   DIR                    75000 non-null  float64\n",
      " 1   Age                    75000 non-null  float64\n",
      " 2   NumLoans               75000 non-null  int64  \n",
      " 3   NumRealEstateLoans     75000 non-null  int64  \n",
      " 4   NumDependents          72992 non-null  float64\n",
      " 5   Num30-59Delinquencies  75000 non-null  int64  \n",
      " 6   Num60-89Delinquencies  75000 non-null  int64  \n",
      " 7   Income                 60116 non-null  float64\n",
      " 8   BalanceToCreditLimit   75000 non-null  float64\n",
      "dtypes: float64(5), int64(4)\n",
      "memory usage: 5.7 MB\n"
     ]
    }
   ],
   "source": [
    "test_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Заменим пропуски медианными значениями.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df['NumDependents'].fillna(train_df['NumDependents'].median(), inplace=True)\n",
    "train_df['Income'].fillna(train_df['Income'].median(), inplace=True)\n",
    "test_df['NumDependents'].fillna(test_df['NumDependents'].median(), inplace=True)\n",
    "test_df['Income'].fillna(test_df['Income'].median(), inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Дерево решений без настройки параметров"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Обучите дерево решений максимальной глубины 3, используйте параметр random_state=17 для воспроизводимости результатов.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(max_depth=3, random_state=17)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_tree = DecisionTreeClassifier(max_depth=3, random_state=17) # Классификатор дерева решений\n",
    "first_tree.fit(train_df, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Сделайте прогноз для тестовой выборки.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_tree_pred = first_tree.predict(test_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Запишем прогноз в файл.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def write_to_submission_file(predicted_labels, out_file, target='Delinquent90', index_label=\"client_id\"):\n",
    "    predicted_df = pd.DataFrame(predicted_labels, index = np.arange(75000, predicted_labels.shape[0] + 75000), columns=[target])\n",
    "    predicted_df.to_csv(out_file, index_label=index_label)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_to_submission_file(first_tree_pred, 'credit_scoring_first_tree.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Score 0.53810"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Если предсказывать вероятности дефолта для клиентов тестовой выборки, результат будет намного лучше.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_tree_pred_probs = first_tree.predict_proba(test_df)[:, 1] # Вероятностная оценка"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_to_submission_file(first_tree_pred_probs, '2nd_tree.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Score 0.80468"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Дерево решений с настройкой параметров с помощью GridSearch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Настройте параметры дерева с помощью `GridSearhCV`, посмотрите на лучшую комбинацию параметров и среднее качество на 5-кратной кросс-валидации. Используйте параметр `random_state=17` (для воспроизводимости результатов), не забывайте про распараллеливание (`n_jobs=-1`).**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=17), n_jobs=-1,\n",
       "             param_grid={'max_depth': [3, 4, 5, 6, 7],\n",
       "                         'min_samples_leaf': [5, 6, 7, 8, 9, 10, 11, 12]})"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tree_params = {'max_depth': list(range(3, 8)), 'min_samples_leaf': list(range(5, 13))}\n",
    "# Поиск параметров по заданным значениям для оценщика\n",
    "locally_best_tree = GridSearchCV(DecisionTreeClassifier(random_state=17), tree_params, cv=5, n_jobs=-1)\n",
    "locally_best_tree.fit(train_df, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'max_depth': 5, 'min_samples_leaf': 11}, 0.935)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "locally_best_tree.best_params_, round(locally_best_tree.best_score_, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Сделайте прогноз для тестовой выборки и пошлите решение на Kaggle.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuned_tree_pred_probs = locally_best_tree.predict(test_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_to_submission_file(tuned_tree_pred_probs, '3rd_tree.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Score 0.54898"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Случайный лес без настройки параметров"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Обучите случайный лес из деревьев неограниченной глубины, используйте параметр `random_state=17` для воспроизводимости результатов.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier(random_state=17)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_forest = RandomForestClassifier(random_state=17) # Классификатор случайных лесов\n",
    "first_forest.fit(train_df, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_forest_pred = first_forest.predict(test_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Сделайте прогноз для тестовой выборки и пошлите решение на Kaggle.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_to_submission_file(first_forest_pred, '4th_tree.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Score 0.56146"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Случайный лес c настройкой параметров"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Настройте параметр `max_features` леса с помощью `GridSearhCV`, посмотрите на лучшую комбинацию параметров и среднее качество на 5-кратной кросс-валидации. Используйте параметр random_state=17 (для воспроизводимости результатов), не забывайте про распараллеливание (n_jobs=-1).**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 2min 4s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=17), n_jobs=-1,\n",
       "             param_grid={'max_depth': [3, 4, 5, 6, 7],\n",
       "                         'min_samples_leaf': [5, 6, 7, 8, 9, 10, 11, 12]})"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "forest_params = {'max_features': np.linspace(.3, 1, 7)}\n",
    "# Поиск параметров по заданным значениям для оценщика\n",
    "locally_best_forest = GridSearchCV(RandomForestClassifier(random_state=17), tree_params, cv=5, n_jobs=-1)\n",
    "locally_best_forest.fit(train_df, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'max_depth': 7, 'min_samples_leaf': 8}, 0.935)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "locally_best_forest.best_params_, round(locally_best_forest.best_score_, 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuned_forest_pred = locally_best_forest.predict(test_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_to_submission_file(tuned_forest_pred, '5th_tree.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Score 0.54041"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Обычно увеличение количества деревьев только улучшает результат. Так что напоследок обучите случайный лес из 300 деревьев с найденными лучшими параметрами. Это может занять несколько минут.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 49.9 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "final_forest = RandomForestClassifier(n_estimators=300, random_state=17)\n",
    "final_forest.fit(train_df, y)\n",
    "final_forest_pred = final_forest.predict_proba(test_df)[:, 1]\n",
    "write_to_submission_file(final_forest_pred, 'credit_scoring_final_forest.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Score 0.83045"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Сделайте посылку на Kaggle.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Ссылки:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://habr.com/ru/company/ods/blog/324402/"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}