PyCaretでの異常検知

1. 異常検知のPyCaretの概要

1.1異常検知とは

異常検知（Anomaly detection) とは、データの中から異常な状態、通常のパターンとは異なる挙動を検出することです。

1.2異常検知の種類

異常検出の手法は主に3つの種類があります。

１）教師なし異常検出：ラベルがなくデータで異常モデルを学習します。

２）教師あり異常検出：「正常」および「異常」とラベル付けされたデータセットで異常モデルを学習します。

３）半教師あり異常検出：正常データのみで学習されます（異常はありません）。正常データからテストデータを推測します。

1.3 PyCaretのライブラリ

pycaret.anomalyモジュールは、教師なし異常検出と教師あり異常検出の手法を提供します。

PyCaretは異常検出のアルゴリズムが12つあります。

abod Angle-base Outlier Detection

cluster Clustering-Based Local Outlier

cof Connectivity-Based Local Outlier

iforest Isolation Forest

histogram Histogram-based Outlier Detection

knn K-Nearest Neighbors Detector

lof Local Outlier Factor

svm One-class SVM detector

pca Principal Component Analysis

mcd Minimum Covariance Determinant

sod Subspace Outlier Detection

sos Stochastic Outlier Selection

2. 実験

環境：Google Colab

データセット：mice　皮質の核画分で検出可能なシグナルを生成した77のタンパク質/タンパク質のPyCaretのデータセットです。タンパク質ごとに合計1080の測定値が含まれています。

モデル：異常検知のIsolation Forest

2.1 環境構築

PyCaretのインストール

!pip install pycaret

ColabでPyCaretを有効します。

# Colab enable

from pycaret.utils import enable_colab

enable_colab()

2.2 データロード

PyCaretのデータセットを読み込みます。

# Download dataset

from pycaret.datasets import get_data

dataset = get_data(‘mice’)

データの確認

#check the shape of data

dataset.shape

(1080, 82)

学習とテストのデータを分けます。

# Train and test split

data = dataset.sample(frac=0.95, random_state=786)

data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)

data_unseen.reset_index(drop=True, inplace=True)

print(‘Train: ‘ + str(data.shape))

print(‘Test: ‘ + str(data_unseen.shape))

Train: (1026, 82)

Test: (54, 82)

2.3 モデル学習

モデルをセットアップします。

from pycaret.anomaly import *

exp_ano101 = setup(data, normalize = True,

ignore_features = [‘MouseID’],

session_id = 123)

Isolation Forestのモデルを作成します。

# Create model

iforest = create_model(‘iforest’)

print(iforest)

IForest(behaviour=’new’, bootstrap=False, contamination=0.05,

max_features=1.0, max_samples=’auto’, n_estimators=100, n_jobs=-1,

random_state=123, verbose=0)

全てのモデルを表示します。

# Model list

models()

abod	Angle-base Outlier Detection	pyod.models.abod.ABOD
cluster	Clustering-Based Local Outlier	pyod.models.cblof.CBLOF
cof	Connectivity-Based Local Outlier	pyod.models.cof.COF
iforest	Isolation Forest	pyod.models.iforest.IForest
histogram	Histogram-based Outlier Detection	pyod.models.hbos.HBOS
knn	K-Nearest Neighbors Detector	pyod.models.knn.KNN
lof	Local Outlier Factor	pyod.models.lof.LOF
svm	One-class SVM detector	pyod.models.ocsvm.OCSVM
pca	Principal Component Analysis	pyod.models.pca.PCA
mcd	Minimum Covariance Determinant	pyod.models.mcd.MCD
sod	Subspace Outlier Detection	pyod.models.sod.SOD
sos	Stochastic Outlier Selection	pyod.models.sos.SOS

2.4 モデル推論

異常ラベルをデータセットに割り当てるにはラベル列は外れ値を示します（1 =外れ値、0 =外れ値）。スコアは、アルゴリズムによって計算された値です。外れ値には、より大きな異常スコアが割り当てられます。

iforest_results = assign_model(iforest)

iforest_results.head()

t-SNEの可視化

# T-distributed Stochastic Neighbor Embedding (t-SNE)

plot_model(iforest)

テストデータのモデル推論

# Predict on Unseen Data

unseen_predictions = predict_model(iforest, data=data_unseen)

unseen_predictions.head()

学習データのモデル推論

# Predict on train Data

data_predictions = predict_model(iforest, data = data)

data_predictions.head()

2.5 モデルの保存と読み込む

pklファイルを保存します。

# Saving the Model

save_model(iforest,’Final IForest Model 2021′)

Transformation Pipeline and Model Successfully Saved

(Pipeline(memory=None,

steps=[(‘dtypes’,

DataTypes_Auto_infer(categorical_features=[],

display_types=True,

features_todrop=[‘MouseID’],

id_columns=[], ml_usecase=’regression’,

numerical_features=[],

target=’UNSUPERVISED_DUMMY_TARGET’,

time_features=[])),

(‘imputer’,

Simple_Imputer(categorical_strategy=’most frequent’,

fill_value_categorical=None,

fill_value_numer…

(‘fix_perfect’, ‘passthrough’),

(‘clean_names’, Clean_Colum_Names()),

(‘feature_select’, ‘passthrough’), (‘fix_multi’, ‘passthrough’),

(‘dfs’, ‘passthrough’), (‘pca’, ‘passthrough’),

[‘trained_model’,

IForest(behaviour=’new’, bootstrap=False, contamination=0.05,

max_features=1.0, max_samples=’auto’, n_estimators=100, n_jobs=-1,

random_state=123, verbose=0)]],

verbose=False), ‘Final IForest Model 2021.pkl’)

モデルの読み込

# Loading the Saved Model

saved_iforest = load_model(‘Final IForest Model 2021’)

Transformation Pipeline and Model Successfully Loaded

担当者：KW
バンコクのタイ出身　データサイエンティスト
製造、マーケティング、財務、AI研究などの様々な業界にPSI生産管理、在庫予測・最適化分析、顧客ロイヤルティ分析、センチメント分析、SaaS、PaaS、IaaS、AI at the Edge の環境構築などのスペシャリスト

目次