{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#
Тема 4. Композиции алгоритмов, случайный лес\n", "##
Практика. Деревья решений и случайный лес в соревновании Kaggle Inclass по кредитному скорингу" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ориентируйтесь на рейтинг [соревнования](https://inclass.kaggle.com/c/beeline-credit-scoring-competition-2), [ссылка](https://www.kaggle.com/t/115237dd8c5e4092a219a0c12bf66fc6) для участия.\n", "\n", "Решается задача кредитного скоринга. \n", "\n", "Признаки клиентов банка:\n", "- Age - возраст (вещественный)\n", "- Income - месячный доход (вещественный)\n", "- BalanceToCreditLimit - отношение баланса на кредитной карте к лимиту по кредиту (вещественный)\n", "- DIR - Debt-to-income Ratio (вещественный)\n", "- NumLoans - число заемов и кредитных линий\n", "- NumRealEstateLoans - число ипотек и заемов, связанных с недвижимостью (натуральное число)\n", "- NumDependents - число членов семьи, которых содержит клиент, исключая самого клиента (натуральное число)\n", "- Num30-59Delinquencies - число просрочек выплат по кредиту от 30 до 59 дней (натуральное число)\n", "- Num60-89Delinquencies - число просрочек выплат по кредиту от 60 до 89 дней (натуральное число)\n", "- Delinquent90 - были ли просрочки выплат по кредиту более 90 дней (бинарный) - имеется только в обучающей выборке" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.metrics import roc_auc_score\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Загружаем данные.**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "train_df = pd.read_csv('data/credit_scoring_train.csv', index_col='client_id')\n", "test_df = pd.read_csv('data/credit_scoring_test.csv', index_col='client_id')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "y = train_df['Delinquent90']\n", "train_df.drop('Delinquent90', axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIRAgeNumLoansNumRealEstateLoansNumDependentsNum30-59DelinquenciesNum60-89DelinquenciesIncomeBalanceToCreditLimit
client_id
00.49628949.11300.0205298.3606390.387028
10.43356748.0922.0106008.0562560.234679
22206.73119955.5211NaN10NaN0.348227
3886.13279355.3300.000NaN0.971930
40.00000052.3100.0002504.6131051.004350
\n", "
" ], "text/plain": [ " DIR Age NumLoans NumRealEstateLoans NumDependents \\\n", "client_id \n", "0 0.496289 49.1 13 0 0.0 \n", "1 0.433567 48.0 9 2 2.0 \n", "2 2206.731199 55.5 21 1 NaN \n", "3 886.132793 55.3 3 0 0.0 \n", "4 0.000000 52.3 1 0 0.0 \n", "\n", " Num30-59Delinquencies Num60-89Delinquencies Income \\\n", "client_id \n", "0 2 0 5298.360639 \n", "1 1 0 6008.056256 \n", "2 1 0 NaN \n", "3 0 0 NaN \n", "4 0 0 2504.613105 \n", "\n", " BalanceToCreditLimit \n", "client_id \n", "0 0.387028 \n", "1 0.234679 \n", "2 0.348227 \n", "3 0.971930 \n", "4 1.004350 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Посмотрим на число пропусков в каждом признаке.**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 75000 entries, 0 to 74999\n", "Data columns (total 9 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 DIR 75000 non-null float64\n", " 1 Age 75000 non-null float64\n", " 2 NumLoans 75000 non-null int64 \n", " 3 NumRealEstateLoans 75000 non-null int64 \n", " 4 NumDependents 73084 non-null float64\n", " 5 Num30-59Delinquencies 75000 non-null int64 \n", " 6 Num60-89Delinquencies 75000 non-null int64 \n", " 7 Income 60153 non-null float64\n", " 8 BalanceToCreditLimit 75000 non-null float64\n", "dtypes: float64(5), int64(4)\n", "memory usage: 5.7 MB\n" ] } ], "source": [ "train_df.info()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 75000 entries, 75000 to 149999\n", "Data columns (total 9 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 DIR 75000 non-null float64\n", " 1 Age 75000 non-null float64\n", " 2 NumLoans 75000 non-null int64 \n", " 3 NumRealEstateLoans 75000 non-null int64 \n", " 4 NumDependents 72992 non-null float64\n", " 5 Num30-59Delinquencies 75000 non-null int64 \n", " 6 Num60-89Delinquencies 75000 non-null int64 \n", " 7 Income 60116 non-null float64\n", " 8 BalanceToCreditLimit 75000 non-null float64\n", "dtypes: float64(5), int64(4)\n", "memory usage: 5.7 MB\n" ] } ], "source": [ "test_df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Заменим пропуски медианными значениями.**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "train_df['NumDependents'].fillna(train_df['NumDependents'].median(), inplace=True)\n", "train_df['Income'].fillna(train_df['Income'].median(), inplace=True)\n", "test_df['NumDependents'].fillna(test_df['NumDependents'].median(), inplace=True)\n", "test_df['Income'].fillna(test_df['Income'].median(), inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Дерево решений без настройки параметров" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Обучите дерево решений максимальной глубины 3, используйте параметр random_state=17 для воспроизводимости результатов.**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(max_depth=3, random_state=17)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_tree = DecisionTreeClassifier(max_depth=3, random_state=17) # Классификатор дерева решений\n", "first_tree.fit(train_df, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сделайте прогноз для тестовой выборки.**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "first_tree_pred = first_tree.predict(test_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Запишем прогноз в файл.**" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def write_to_submission_file(predicted_labels, out_file, target='Delinquent90', index_label=\"client_id\"):\n", " predicted_df = pd.DataFrame(predicted_labels, index = np.arange(75000, predicted_labels.shape[0] + 75000), columns=[target])\n", " predicted_df.to_csv(out_file, index_label=index_label)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "write_to_submission_file(first_tree_pred, 'credit_scoring_first_tree.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score 0.53810" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Если предсказывать вероятности дефолта для клиентов тестовой выборки, результат будет намного лучше.**" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "first_tree_pred_probs = first_tree.predict_proba(test_df)[:, 1] # Вероятностная оценка" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "write_to_submission_file(first_tree_pred_probs, '2nd_tree.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score 0.80468" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Дерево решений с настройкой параметров с помощью GridSearch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Настройте параметры дерева с помощью `GridSearhCV`, посмотрите на лучшую комбинацию параметров и среднее качество на 5-кратной кросс-валидации. Используйте параметр `random_state=17` (для воспроизводимости результатов), не забывайте про распараллеливание (`n_jobs=-1`).**" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=17), n_jobs=-1,\n", " param_grid={'max_depth': [3, 4, 5, 6, 7],\n", " 'min_samples_leaf': [5, 6, 7, 8, 9, 10, 11, 12]})" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree_params = {'max_depth': list(range(3, 8)), 'min_samples_leaf': list(range(5, 13))}\n", "# Поиск параметров по заданным значениям для оценщика\n", "locally_best_tree = GridSearchCV(DecisionTreeClassifier(random_state=17), tree_params, cv=5, n_jobs=-1)\n", "locally_best_tree.fit(train_df, y)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "({'max_depth': 5, 'min_samples_leaf': 11}, 0.935)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "locally_best_tree.best_params_, round(locally_best_tree.best_score_, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сделайте прогноз для тестовой выборки и пошлите решение на Kaggle.**" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "tuned_tree_pred_probs = locally_best_tree.predict(test_df)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "write_to_submission_file(tuned_tree_pred_probs, '3rd_tree.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score 0.54898" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Случайный лес без настройки параметров" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Обучите случайный лес из деревьев неограниченной глубины, используйте параметр `random_state=17` для воспроизводимости результатов.**" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(random_state=17)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_forest = RandomForestClassifier(random_state=17) # Классификатор случайных лесов\n", "first_forest.fit(train_df, y)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "first_forest_pred = first_forest.predict(test_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сделайте прогноз для тестовой выборки и пошлите решение на Kaggle.**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "write_to_submission_file(first_forest_pred, '4th_tree.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score 0.56146" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Случайный лес c настройкой параметров" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Настройте параметр `max_features` леса с помощью `GridSearhCV`, посмотрите на лучшую комбинацию параметров и среднее качество на 5-кратной кросс-валидации. Используйте параметр random_state=17 (для воспроизводимости результатов), не забывайте про распараллеливание (n_jobs=-1).**" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 2min 4s\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=17), n_jobs=-1,\n", " param_grid={'max_depth': [3, 4, 5, 6, 7],\n", " 'min_samples_leaf': [5, 6, 7, 8, 9, 10, 11, 12]})" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "forest_params = {'max_features': np.linspace(.3, 1, 7)}\n", "# Поиск параметров по заданным значениям для оценщика\n", "locally_best_forest = GridSearchCV(RandomForestClassifier(random_state=17), tree_params, cv=5, n_jobs=-1)\n", "locally_best_forest.fit(train_df, y)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "({'max_depth': 7, 'min_samples_leaf': 8}, 0.935)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "locally_best_forest.best_params_, round(locally_best_forest.best_score_, 3)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "tuned_forest_pred = locally_best_forest.predict(test_df)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "write_to_submission_file(tuned_forest_pred, '5th_tree.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score 0.54041" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Обычно увеличение количества деревьев только улучшает результат. Так что напоследок обучите случайный лес из 300 деревьев с найденными лучшими параметрами. Это может занять несколько минут.**" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 49.9 s\n" ] } ], "source": [ "%%time\n", "final_forest = RandomForestClassifier(n_estimators=300, random_state=17)\n", "final_forest.fit(train_df, y)\n", "final_forest_pred = final_forest.predict_proba(test_df)[:, 1]\n", "write_to_submission_file(final_forest_pred, 'credit_scoring_final_forest.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score 0.83045" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Сделайте посылку на Kaggle.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Ссылки:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://habr.com/ru/company/ods/blog/324402/" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }