PythonでCatBoostの解説

関連記事：　決定木分析、ランダムフォレスト、Xgboost

Kaggleなどのデータ分析競技といえば、XGBoost, Light GBM, CatBoost の決定木アルゴリズムをよく使われています。分類分析系と予測分析系の競技のKaggleの上位にランクされています。今回の記事はCatBoostの新しい決定木アルゴリズムを解説します。

1. CatBoostとは
2. 実験・コード
＿＿2.1 データロード
＿＿2.2 10,000件くらいサンプルデータを作成
＿＿2.3. XGBoost グリッドサーチで 81モデルから最適なパラメータを探索
＿＿2.4 XGBoost　最適なパラメータのモデルを作成
＿＿2.5. Light GBM グリッドサーチで 81モデルから最適なパラメータを探索
＿＿2.6 Light GBM最適なパラメータのモデルを作成（Categorial Feature除く）
＿＿2.7 Light GBM最適なパラメータのモデルを作成（Categorial Feature含む）
＿＿2.8. CatBoost グリッドサーチで 81モデルから最適なパラメータを探索
＿＿2.9 CatBoost　最適なパラメータのモデルを作成（Categorial Feature除く）
＿＿2.10 CatBoost　最適なパラメータのモデルを作成（Categorial Feature含む
3. モデル評価評価：学習時間　AUC

1. CatBoostとは

CatBoostはCategory　Boostingの略で、決定木ベースの勾配ブースティングに基づく機械学習ライブラリ。2017にYandex社からCatBoostが発表されました。

特徴：
１）回帰予測、分類の教師あり学習に対応
２）過学習を減らして、高い精度、学習速度を誇る
３）GPU、マルチGPUに対応

決定木ベースのアルゴリズムの歴史

CatBoostは、オーバーフィットを減らし、データセット全体をトレーニングに使用できるようにする、より効率的な戦略を使用します。

Pおよびパラメーターa> 0（事前分布の重み）。

最初のイテレーションで、アルゴリズムは最初のツリーを学習してトレーニングエラーを減らします。通常、このモデルには重大なエラーがあります。データをオーバーフィットするため、ブースティングで非常に大きなツリーを構築することはお勧めできません

2番目のイテレーション。アルゴリズムはもう1つのツリーを学習して、最初のツリーで発生したエラーを減らします。アルゴリズムは、適切な品質モードを構築するまでこの手順を繰り返します。

CatBoostの論文: http://learningsys.org/nips17/assets/papers/paper_11.pdf

CatBoostのライブラリ – https://catboost.ai/

2. 実験・コード

環境：google colab Python3 GPU
ライブラリ：XGboost、Light GBM、CatBoost

2.1 データロード

データセット：2015 Flight Delays (5,819,079行、565MB)　The U.S. Department of Transportation’s (DOT) US米国の交通統計局により、フライトの遅延とキャンセルのデータ

（https://www.kaggle.com/usdot/flight-delays）

%%time
import pandas as pd
data_path = "/content/drive/My Drive/dataset/flights/flights.csv"
data = pd.read_csv(data_path)

CPU times: user 15.3 s, sys: 362 ms, total: 15.7 s
Wall time: 16.2 s

data.head(3)

2.2　サンプルデータ作成

data = data.sample(frac = 0.002, random_state=10)
print(data.count())

YEAR 11638
MONTH 11638
DAY 11638
DAY_OF_WEEK 11638
AIRLINE 11638
FLIGHT_NUMBER 11638
TAIL_NUMBER 11594
ORIGIN_AIRPORT 11638
DESTINATION_AIRPORT 11638
SCHEDULED_DEPARTURE 11638
DEPARTURE_TIME 11427
DEPARTURE_DELAY 11427
TAXI_OUT 11424
WHEELS_OFF 11424
SCHEDULED_TIME 11638
ELAPSED_TIME 11394
AIR_TIME 11394
DISTANCE 11638
WHEELS_ON 11421
TAXI_IN 11421
SCHEDULED_ARRIVAL 11638
ARRIVAL_TIME 11421
ARRIVAL_DELAY 11394
DIVERTED 11638
CANCELLED 11638
CANCELLATION_REASON 214
AIR_SYSTEM_DELAY 2148
SECURITY_DELAY 2148
AIRLINE_DELAY 2148
LATE_AIRCRAFT_DELAY 2148
WEATHER_DELAY 2148
dtype: int64

データ加工

%%time
import numpy as np
import time
from sklearn.model_selection import train_test_split

data = data[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
"ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1

cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
data[item] = data[item].astype("category").cat.codes +1

train, test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"],
random_state=10, test_size=0.25)

CPU times: user 70.6 ms, sys: 823 µs, total: 71.4 ms
Wall time: 71.6 ms

XGBoost

2.3. XGBoost グリッドサーチで 81モデルから最適なパラメータを探索

import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from datetime import datetime

start_time = datetime.now()

# パラメータ調整
model = xgb.XGBClassifier()
param_dist = {"max_depth": [10,30,50],
"min_child_weight" : [1,3,6],
"n_estimators": [200],
"learning_rate": [0.05, 0.1,0.16],}

# グリッドサーチの設定
xgb_grid_search = GridSearchCV(model, param_grid=param_dist, cv = 3, verbose=3, n_jobs=-1, scoring="roc_auc")
xgb_grid_search.fit(train, y_train)

print(xgb_grid_search.best_params_)
print(xgb_grid_search.best_index_)
print(xgb_grid_search.best_score_)

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 1.4min
[Parallel(n_jobs=-1)]: Done 81 out of 81 | elapsed: 3.7min finished
{'learning_rate': 0.05, 'max_depth': 10, 'min_child_weight': 6, 'n_estimators': 200}
2
0.6766358303944008
Duration: 0:03:46.873801

2.4 XGBoost　最適なパラメータのモデルを作成

start_time = datetime.now()

def auc(m, train, test): 
return (metrics.roc_auc_score(y_train,m.predict_proba(train)[:,1]),
metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

xgb_model = xgb.XGBClassifier(max_depth=10, min_child_weight=6, n_estimators=200, n_jobs=-1 , verbose=1,learning_rate=0.05)
# モデルを学習
xgb_model.fit(train,y_train)

# AUCのモデル評価
print("AUC =", auc(xgb_model, train, test))

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

AUC = (0.977650094493346, 0.6839265704161672)
Duration: 0:00:03.215976

Light GBM

2.5. Light GBM グリッドサーチで 81モデルから最適なパラメータを探索

import lightgbm as lgb
from sklearn import metrics
from datetime import datetime

start_time = datetime.now()

# パラメータ調整
lg = lgb.LGBMClassifier(silent=False)
param_dist = {"max_depth": [25,50, 75],
"learning_rate" : [0.01,0.05,0.1],
"num_leaves": [300,900,1200],
"n_estimators": [200]
}

# グリッドサーチの設定
lg_grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 3, scoring="roc_auc", verbose=5)
lg_grid_search.fit(train,y_train)

print(lg_grid_search.best_params_)
print(lg_grid_search.best_index_)
print(lg_grid_search.best_score_)

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 20.1s
/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
"timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done 68 tasks | elapsed: 1.6min
[Parallel(n_jobs=-1)]: Done 81 out of 81 | elapsed: 1.9min finished
{'learning_rate': 0.05, 'max_depth': 75, 'n_estimators': 200, 'num_leaves': 300}
15
0.6704275313151552
Duration: 0:01:58.034424

2.6 Light GBM最適なパラメータのモデルを作成（Categorial Feature除く）

# Categorical Features除く

start_time = datetime.now()

def auc2(m, train, test): 
return (metrics.roc_auc_score(y_train,m.predict(train)),
metrics.roc_auc_score(y_test,m.predict(test)))

# データセットの設定
d_train = lgb.Dataset(train, label=y_train, free_raw_data=False)
params = {"max_depth": 75, "learning_rate" : 0.05, "num_leaves": 300, "n_estimators": 200}

# Categorical Features除く
lgb_model = lgb.train(params, d_train)
print("AUC =", auc2(lgb_model, train, test))

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

AUC = (1.0, 0.6596555518539831)
Duration: 0:00:02.807290

2.7 Light GBM最適なパラメータのモデルを作成（Categorial Feature含む）

# Catgeorical Features含む
start_time = datetime.now()

cate_features_name = ["MONTH","DAY","DAY_OF_WEEK","AIRLINE","DESTINATION_AIRPORT",
"ORIGIN_AIRPORT"]
lgb_model = lgb.train(params, d_train, categorical_feature = cate_features_name)

print("AUC =", auc2(lgb_model, train, test))

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

AUC = (1.0, 0.6521800579464349)
Duration: 0:00:02.723219

CatBoost

2.8. CatBoost グリッドサーチで 81モデルから最適なパラメータを探索

!pip install catboost

Collecting catboost
Downloading https://files.pythonhosted.org/packages/39/51/bfab1d94e2bed6302e3e58738b1135994888b09f29c7cee8686d431b9281/catboost-0.17.3-cp36-none-manylinux1_x86_64.whl (62.5MB)
|████████████████████████████████| 62.5MB 74.6MB/s
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from catboost) (4.1.1)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from catboost) (3.0.3)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (0.24.2)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.16.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from catboost) (1.12.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from catboost) (1.3.1)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.4.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.5.3)
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from kiwisolver>=1.0.1->matplotlib->catboost) (41.2.0)
Installing collected packages: catboost
Successfully installed catboost-0.17.3

2.8. CatBoost グリッドサーチで 81モデルから最適なパラメータを探索

import catboost as cb
from datetime import datetime

start_time = datetime.now()

# パラメータ調整
params = {'depth': [3, 5, 7],
'learning_rate' : [0.03, 0.1, 0.15],
'l2_leaf_reg': [1,3,7],
'iterations': [300]}

ctb = cb.CatBoostClassifier(eval_metric="AUC", logging_level='Silent')

ctb_grid_search = GridSearchCV(ctb, params, scoring="roc_auc", cv = 3, verbose=2)
ctb_grid_search.fit(train, y_train)

print(ctb_grid_search.best_params_)
print(ctb_grid_search.best_index_)
print(ctb_grid_search.best_score_)

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.03 ......
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.03, total= 1.4s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.03 ......
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.4s remaining: 0.0s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.03, total= 1.3s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.03 ......
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.03, total= 1.3s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.1 .......
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.1, total= 1.3s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.1 .......
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.1, total= 1.3s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.1 .......
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.1, total= 1.3s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.15 ......
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.15, total= 1.3s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.15 ......
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.15, total= 1.3s
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.15 ......
[CV] depth=3, iterations=300, l2_leaf_reg=1, learning_rate=0.15, total= 1.3s
[CV] depth=3, iterations=300, l2_leaf_reg=3, learning_rate=0.03 ......
[CV] depth=3, iterations=300, l2_leaf_reg=3, learning_rate=0.03, total= 1.4s
[CV] depth=3, iterations=300, l2_leaf_reg=3, learning_rate=0.03 ......
...........

[CV] depth=7, iterations=300, l2_leaf_reg=7, learning_rate=0.15, total= 3.2s
[Parallel(n_jobs=1)]: Done 81 out of 81 | elapsed: 2.9min finished
{'eval_metric': 'AUC', 'logging_level': 'Silent', 'depth': 3, 'iterations': 300, 'l2_leaf_reg': 7, 'learning_rate': 0.15}
{'depth': 3, 'iterations': 300, 'l2_leaf_reg': 7, 'learning_rate': 0.15}
8
0.6888360823801398
Duration: 0:02:58.684757

2.9 CatBoost　最適なパラメータのモデルを作成（Categorial Feature除く）

# Categorical features除く

start_time = datetime.now()

def auc(m, train, test): 
return (metrics.roc_auc_score(y_train,m.predict_proba(train)[:,1]),
metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

clf = cb.CatBoostClassifier(eval_metric="AUC", depth=3, iterations= 300, l2_leaf_reg= 7, learning_rate= 0.15, logging_level='Silent')
clf.fit(train,y_train)
print("AUC =", auc(clf, train, test))

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

UC = (0.7868498404068541, 0.6939698499709401)
Duration: 0:00:01.753843

2.10 CatBoost　最適なパラメータのモデルを作成（Categorial Feature含む）

# Categorical features含む

start_time = datetime.now()

cat_features_index = [0,1,2,3,4,5,6]

clf = cb.CatBoostClassifier(eval_metric="AUC", depth=3, iterations= 300, l2_leaf_reg= 7, learning_rate= 0.15, logging_level='Silent')

clf.fit(train,y_train, cat_features= cat_features_index)
print("AUC =", auc(clf, train, test))

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

AUC = (0.7894243204961788, 0.7039958234724883)
Duration: 0:00:04.504826

3. モデル評価評価：学習時間　AUC

XGNoost, Light GBM, CatBoostの決定木モデルを比較しました。予測精度は CatBoostが最も良くて、特にCategorial Feature含むのモデルです（0.7039）。実行時間を見ると、早いモデルはCategorial Feacture除くのCatBoostだが、一番遅いモデルはCategorial Feature含むのCatBoostingです。

目次

1. CatBoostとは

2. 実験・コード

2.1 データロード

2.2 サンプルデータ作成

XGBoost

2.3. XGBoost グリッドサーチで 81モデルから最適なパラメータを探索

2.4 XGBoost 最適なパラメータのモデルを作成

Light GBM

2.5. Light GBM グリッドサーチで 81モデルから最適なパラメータを探索

2.6 Light GBM最適なパラメータのモデルを作成（Categorial Feature除く）

2.7 Light GBM最適なパラメータのモデルを作成（Categorial Feature含む）

CatBoost

2.8. CatBoost グリッドサーチで 81モデルから最適なパラメータを探索

2.8. CatBoost グリッドサーチで 81モデルから最適なパラメータを探索

2.9 CatBoost 最適なパラメータのモデルを作成（Categorial Feature除く）

2.10 CatBoost 最適なパラメータのモデルを作成（Categorial Feature含む）

3. モデル評価評価：学習時間 AUC

2.2　サンプルデータ作成

2.4 XGBoost　最適なパラメータのモデルを作成

2.9 CatBoost　最適なパラメータのモデルを作成（Categorial Feature除く）

2.10 CatBoost　最適なパラメータのモデルを作成（Categorial Feature含む）

3. モデル評価評価：学習時間　AUC