가장 쉬운 lightGBM 모델 [분류, 회귀] better accuracy by 바죠


가장 쉬운 lightGBM 모델 [분류, 회귀] better accuracy
tabular data


2012년 이후를 흔히 딥러닝의 시대라고 한다.

많은 사람들이 오해하는 것 중 하나는 딥러닝이 분류와 회귀에서 항상 가장 뛰어날 것이다.
이것은 완전한 오해이다.
딥러닝의 성능은 이미지, 글, 오디오 데이터에서만 최고의 성능을 발휘한다.
연속적으로 변화하는 자연스러운 데이터에서만 그렇다.

그 이외의 일반적인 불연속적 표 형식의 데이터에 대해서는 의사 결정나무 (decision tree) 방식을 활용하고, 한 발 더 나아가, 소위 앙상블 방식으로 가게되면 상황는 바뀐다.  
테이블 데이터에 한해서, lightGBM, XGBoost (battle tested), Catboost 등의 기계학습 방법이 소위 딥러닝 방법보다 더 뛰어난 성능을 발휘한다.

2012년 이후 혜성처럼 등장한 딥러닝을 신봉하는 사람들 입장에서는 상당히 자존심 상하는 상황이다.
하지만, 어쩔수 없이 받아들여만 하는 사실이다.
이러한 방법들(lightGBM, XGBoost, Catboost)은  각각 hyperparameter 들을 최적화 할 경우, 보다 더 좋은 성능을 각각 발휘한다.
한 발 더 나아가서, hyperparameter들을 기계학습 방법 중 하나인 베이지안 옵티마이제이션을 활용하여 쉽게 최적화 할 수 있다.


또한, 인공신경망의 블랙박스에 더 가깝다. 의사결정 나무는 화이트박스에 더 가깝다. 사실 이들은 양극단의 경우들에 해당한다.
즉, 결과를 해석하는데 있어서 의사결정 나무 방식이 훨씬 더 유리하다.

기계학습 방법의 최고 성능을 또 다른 기계학습 방법으로 쉽게 얻어내는 것이다.
모든 풀이는 결국 최적화를 통해서 이루어진다. 하지만, 디테일에서, 출발하는 방식에서 차이가 있을 수 있다. 
왜냐하면, 50년 역사를 가지고 있는 연구분야이기 때문이다.
인공지능, 기계학습, 딥러닝, 그렇게 단순하지 않다.

베이지안 옵티마이제이션을 활용한 hyperparameter 최적화: http://incredible.egloos.com/7479039
베이지안 옵티마이제이션 방법 이외에도 Optuna 를 사용하는 방법도 있다.

lightGBM과 베이지안 옵티마이제이션을 아래와 같이 동시에 활용할 경우, scikit-learn 라이버러리 모두를 집으로 보낼 수 있다.
가능하다. 모델의 성능이 뛰어나다. 계산시간도 획기적으로 줄여준다.

Xgboost, lightGBM을 동시에 사용할 수도 있다.

---------------------------------------------------------------------------------------------------------------------

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

X, y = datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

rf_regressor = RandomForestRegressor(n_estimators=30)
rf_regressor.fit(X_train, y_train)
train_error = mse(y_train, rf_regressor.predict(X_train))
test_error = mse(y_test, rf_regressor.predict(X_test))
print('RF test error:', test_error)
print('RF overfit ratio:', test_error/train_error)

boosting_regressor = LGBMRegressor(n_estimators=30, learning_rate=0.12)
boosting_regressor.fit(X_train, y_train)
train_error = mse(y_train, boosting_regressor.predict(X_train))
test_error = mse(y_test, boosting_regressor.predict(X_test))
print('Boosting test error:', test_error)
print('Boosting overfit ratio:', test_error/train_error)

dart_regressor = LGBMRegressor(
    n_estimators=30, learning_rate=0.12, boosting_type='dart', skip_drop=0.7)
dart_regressor.fit(X_train, y_train)
train_error = mse(y_train, dart_regressor.predict(X_train))
test_error = mse(y_test, dart_regressor.predict(X_test))
print('Dart test error:', test_error)
print('Dart overfit ratio:', test_error/train_error)

RF test error: 3165.8182521847684
RF overfit ratio: 5.702067450175707
Boosting test error: 2856.1684664692157
Boosting overfit ratio: 2.16337987534278
Dart test error: 2756.3126781510805
Dart overfit ratio: 1.775351505966991

---------------------------------------------------------------------------------------------------------------------

import lightgbm as lgb
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import accuracy_score,confusion_matrix
import numpy as np
def lgb_evaluate(numLeaves, maxDepth, scaleWeight, minChildWeight, subsample, colSam):
    reg=lgb.LGBMRegressor(num_leaves=31, max_depth= 2,scale_pos_weight= scaleWeight, min_child_weight= minChildWeight, subsample= 0.4, colsample_bytree= 0.4, learning_rate=0.05,   n_estimators=20)
#   scores = cross_val_score(reg, train_x, train_y, cv=5, scoring='roc_auc')
    scores = cross_val_score(reg, train_x, train_y, cv=5, scoring='neg_mean_squared_error')
    return np.mean(scores)
def bayesOpt(train_x, train_y):
    lgbBO = BayesianOptimization(lgb_evaluate, {  'numLeaves':  (5, 90),  'maxDepth': (2, 90),   'scaleWeight': (1, 10000),  'minChildWeight': (0.01, 70), 'subsample': (0.4, 1), 'colSam': (0.4, 1) })
    lgbBO.maximize(init_points=5, n_iter=50)
    print(lgbBO.res)
boston = load_boston()
X, y = boston.data, boston.target
train_x, X_test, train_y, y_test = train_test_split(X, y, test_size=0.2)
bayesOpt(train_x, train_y)

---------------------------------------------------------------------------------------------------------------------

import lightgbm as lgb
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from numpy import loadtxt
from sklearn.metrics import accuracy_score,confusion_matrix
import numpy as np
def lgb_evaluate(numLeaves, maxDepth, scaleWeight, minChildWeight, subsample, colSam):
    clf = lgb.LGBMClassifier(
        objective = 'binary',
        metric= 'auc',
        reg_alpha= 0,
        reg_lambda= 2,
#       bagging_fraction= 0.999,
        min_split_gain= 0,
        min_child_samples= 10,
        subsample_freq= 3,
#       subsample_for_bin= 50000,
#       n_estimators= 9999999,
        n_estimators= 99,
        num_leaves= int(numLeaves),
        max_depth= int(maxDepth),
        scale_pos_weight= scaleWeight,
        min_child_weight= minChildWeight,
        subsample= subsample,
        colsample_bytree= colSam,
        verbose =-1)
    scores = cross_val_score(clf, train_x, train_y, cv=5, scoring='roc_auc')
    return np.mean(scores)
def bayesOpt(train_x, train_y):
    lgbBO = BayesianOptimization(lgb_evaluate, {   'numLeaves':  (5, 90),  'maxDepth': (2, 90), 'scaleWeight': (1, 10000),  'minChildWeight': (0.01, 70),  'subsample': (0.4, 1),  'colSam': (0.4, 1)   })
    lgbBO.maximize(init_points=5, n_iter=50)
    print(lgbBO.res)
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
X = dataset[:,0:8]
y = dataset[:,8]
train_x, X_test, train_y, y_test = train_test_split(X, y, test_size=0.2)
bayesOpt(train_x, train_y)

---------------------------------------------------------------------------------------------------------------------

# coding: utf-8
import numpy as np
import pandas as pd
import lightgbm as lgb

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

print('Loading data...')
# load or create your dataset
df_train = pd.read_csv('../regression/regression.train', header=None, sep='t')
df_test = pd.read_csv('../regression/regression.test', header=None, sep='t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

print('Starting training...')
# train
gbm = lgb.LGBMRegressor(num_leaves=31,
                        learning_rate=0.05,
                        n_estimators=20)
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        early_stopping_rounds=5)

print('Starting predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# eval
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

# feature importances
print('Feature importances:', list(gbm.feature_importances_))


# self-defined eval metric
# f(y_true: array, y_pred: array) -> name: string, eval_result: float, is_higher_better: bool
# Root Mean Squared Logarithmic Error (RMSLE)
def rmsle(y_true, y_pred):
    return 'RMSLE', np.sqrt(np.mean(np.power(np.log1p(y_pred) - np.log1p(y_true), 2))), False


print('Starting training with custom eval function...')
# train
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric=rmsle,
        early_stopping_rounds=5)


# another self-defined eval metric
# f(y_true: array, y_pred: array) -> name: string, eval_result: float, is_higher_better: bool
# Relative Absolute Error (RAE)
def rae(y_true, y_pred):
    return 'RAE', np.sum(np.abs(y_pred - y_true)) / np.sum(np.abs(np.mean(y_true) - y_true)), False


print('Starting training with multiple custom eval functions...')
# train
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric=[rmsle, rae],
        early_stopping_rounds=5)

print('Starting predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# eval
print('The rmsle of prediction is:', rmsle(y_test, y_pred)[1])
print('The rae of prediction is:', rae(y_test, y_pred)[1])

# other scikit-learn modules
estimator = lgb.LGBMRegressor(num_leaves=31)

param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'n_estimators': [20, 40]
}

gbm = GridSearchCV(estimator, param_grid, cv=3)
gbm.fit(X_train, y_train)

print('Best parameters found by grid search are:', gbm.best_params_)


# coding: utf-8
import numpy as np
import pandas as pd
import lightgbm as lgb

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

print('Loading data...')
# load or create your dataset
df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

print('Starting training...')
# train
gbm = lgb.LGBMRegressor(num_leaves=31,
                        learning_rate=0.05,
                        n_estimators=20)
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        early_stopping_rounds=5)

print('Starting predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# eval
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

# feature importances
print('Feature importances:', list(gbm.feature_importances_))


# self-defined eval metric
# f(y_true: array, y_pred: array) -> name: string, eval_result: float, is_higher_better: bool
# Root Mean Squared Logarithmic Error (RMSLE)
def rmsle(y_true, y_pred):
    return 'RMSLE', np.sqrt(np.mean(np.power(np.log1p(y_pred) - np.log1p(y_true), 2))), False


print('Starting training with custom eval function...')
# train
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric=rmsle,
        early_stopping_rounds=5)


# another self-defined eval metric
# f(y_true: array, y_pred: array) -> name: string, eval_result: float, is_higher_better: bool
# Relative Absolute Error (RAE)
def rae(y_true, y_pred):
    return 'RAE', np.sum(np.abs(y_pred - y_true)) / np.sum(np.abs(np.mean(y_true) - y_true)), False


print('Starting training with multiple custom eval functions...')
# train
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric=[rmsle, rae],
        early_stopping_rounds=5)

print('Starting predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# eval
print('The rmsle of prediction is:', rmsle(y_test, y_pred)[1])
print('The rae of prediction is:', rae(y_test, y_pred)[1])

# other scikit-learn modules
estimator = lgb.LGBMRegressor(num_leaves=31)

param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'n_estimators': [20, 40]
}

gbm = GridSearchCV(estimator, param_grid, cv=3)
gbm.fit(X_train, y_train)

print('Best parameters found by grid search are:', gbm.best_params_)

# coding: utf-8
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error

print('Loading data...')
# load or create your dataset
df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

print('Starting training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=20,
                valid_sets=lgb_eval,
                early_stopping_rounds=5)

print('Saving model...')
# save model to file
gbm.save_model('model.txt')

print('Starting predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)


# coding: utf-8
import lightgbm as lgb
import pandas as pd

if lgb.compat.MATPLOTLIB_INSTALLED:
    import matplotlib.pyplot as plt
else:
    raise ImportError('You need to install matplotlib for plot_example.py.')

print('Loading data...')
# load or create your dataset
df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    'num_leaves': 5,
    'metric': ('l1', 'l2'),
    'verbose': 0
}

evals_result = {}  # to record eval results for plotting

print('Starting training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=100,
                valid_sets=[lgb_train, lgb_test],
                feature_name=['f' + str(i + 1) for i in range(X_train.shape[-1])],
                categorical_feature=[21],
                evals_result=evals_result,
                verbose_eval=10)

print('Plotting metrics recorded during training...')
ax = lgb.plot_metric(evals_result, metric='l1')
plt.show()

print('Plotting feature importances...')
ax = lgb.plot_importance(gbm, max_num_features=10)
plt.show()

print('Plotting split value histogram...')
ax = lgb.plot_split_value_histogram(gbm, feature='f26', bins='auto')
plt.show()

print('Plotting 54th tree...')  # one tree use categorical feature to split
ax = lgb.plot_tree(gbm, tree_index=53, figsize=(15, 15), show_info=['split_gain'])
plt.show()

print('Plotting 54th tree with graphviz...')
graph = lgb.create_tree_digraph(gbm, tree_index=53, name='Tree54')
graph.render(view=True)

>>>aa=np.zeros((3,3))

>>>aa[0,0]=1

>>>aa[0,1]=2

>>>aa[0,2]=3

>>>aa[1,0]=4

>>>aa[1,1]=5

>>>aa[1,2]=6

>>>aa[2,0]=7

>>>aa[2,1]=8

>>>aa[2,2]=9

>>>aa

array([[1.,2., 3.],

       [4., 5., 6.],

       [7., 8., 9.]])

>>>a=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])

>>>a

   0 1  2

0  1 2  3

1  4 5  6

2  7 8  9

>>>aa

array([[1.,2., 3.],

       [4., 5., 6.],

       [7., 8., 9.]])

>>>a[0]

0    1

1    4

2    7

Name:0, dtype:int64

>>>a.drop(0,axis=1)

   1  2

0  2  3

1  5  6

2 8  9


# Load Library
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
# Step1: Create data set
X, y = make_moons(n_samples=10000, noise=.5, random_state=0)
# Step2: Split the training test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Fit a Decision Tree model as comparison
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
OUTPUT: 0.756
# Step 4: Fit a Random Forest model, " compared to "Decision Tree model, accuracy go up by 5%
clf = RandomForestClassifier(n_estimators=100, max_features="auto",random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
OUTPUT: 0.797
# Step 5: Fit a AdaBoost model, " compared to "Decision Tree model, accuracy go up by 10%
clf = AdaBoostClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
OUTPUT:0.833
# Step 6: Fit a Gradient Boosting model, " compared to "Decision Tree model, accuracy go up by 10%
clf = GradientBoostingClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
OUTPUT:0.834
Note: Parameter - n_estimators stands for how many tree we want to grow


https://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html#sphx-glr-auto-examples-hyperparameter-optimization-py
--------------------------------------------------------------------------------------------------------------------
"""
============================================
Tuning a scikit-learn estimator with `skopt`
============================================

Gilles Louppe, July 2016
Katie Malone, August 2016
Reformatted by Holger Nahrstaedt 2020

.. currentmodule:: skopt

If you are looking for a :obj:`sklearn.model_selection.GridSearchCV` replacement checkout
:ref:`sphx_glr_auto_examples_sklearn-gridsearchcv-replacement.py` instead.

Problem statement
=================

Tuning the hyper-parameters of a machine learning model is often carried out
using an exhaustive exploration of (a subset of) the space all hyper-parameter
configurations (e.g., using :obj:`sklearn.model_selection.GridSearchCV`), which
often results in a very time consuming operation.

In this notebook, we illustrate how to couple :class:`gp_minimize` with sklearn's
estimators to tune hyper-parameters using sequential model-based optimisation,
hopefully resulting in equivalent or better solutions, but within less
evaluations.

Note: scikit-optimize provides a dedicated interface for estimator tuning via
:class:`BayesSearchCV` class which has a similar interface to those of
:obj:`sklearn.model_selection.GridSearchCV`. This class uses functions of skopt to perform hyperparameter
search efficiently. For example usage of this class, see
:ref:`sphx_glr_auto_examples_sklearn-gridsearchcv-replacement.py`
example notebook.
"""
print(__doc__)
import numpy as np

#############################################################################
# Objective
# =========
# To tune the hyper-parameters of our model we need to define a model,
# decide which parameters to optimize, and define the objective function
# we want to minimize.

from sklearn.datasets import load_boston
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

boston = load_boston()
X, y = boston.data, boston.target
n_features = X.shape[1]

# gradient boosted trees tend to do well on problems like this
reg = GradientBoostingRegressor(n_estimators=50, random_state=0)

#############################################################################
# Next, we need to define the bounds of the dimensions of the search space
# we want to explore and pick the objective. In this case the cross-validation
# mean absolute error of a gradient boosting regressor over the Boston
# dataset, as a function of its hyper-parameters.

from skopt.space import Real, Integer
from skopt.utils import use_named_args


# The list of hyper-parameters we want to optimize. For each one we define the
# bounds, the corresponding scikit-learn parameter name, as well as how to
# sample values from that dimension (`'log-uniform'` for the learning rate)
space  = [Integer(1, 5, name='max_depth'),
          Real(10**-5, 10**0, "log-uniform", name='learning_rate'),
          Integer(1, n_features, name='max_features'),
          Integer(2, 100, name='min_samples_split'),
          Integer(1, 100, name='min_samples_leaf')]

# this decorator allows your objective function to receive a the parameters as
# keyword arguments. This is particularly convenient when you want to set
# scikit-learn estimator parameters
@use_named_args(space)
def objective(**params):
    reg.set_params(**params)

    return -np.mean(cross_val_score(reg, X, y, cv=5, n_jobs=-1,
                                    scoring="neg_mean_absolute_error"))

#############################################################################
# Optimize all the things!
# ========================
# With these two pieces, we are now ready for sequential model-based
# optimisation. Here we use gaussian process-based optimisation.

from skopt import gp_minimize
res_gp = gp_minimize(objective, space, n_calls=50, random_state=0)

"Best score=%.4f" % res_gp.fun

#############################################################################

print("""Best parameters:
- max_depth=%d
- learning_rate=%.6f
- max_features=%d
- min_samples_split=%d
- min_samples_leaf=%d""" % (res_gp.x[0], res_gp.x[1],
                            res_gp.x[2], res_gp.x[3],
                            res_gp.x[4]))

#############################################################################
# Convergence plot
# ================

from skopt.plots import plot_convergence

plot_convergence(res_gp)
--------------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------------------------------------
import numpy as np
import xgboost as xgb
from sklearn import datasets
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.datasets import dump_svmlight_file
#from sklearn.externals import joblib
import joblib
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# use DMatrix for xgbosot
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# use svmlight file for xgboost
dump_svmlight_file(X_train, y_train, 'dtrain.svm', zero_based=True)
dump_svmlight_file(X_test, y_test, 'dtest.svm', zero_based=True)
dtrain_svm = xgb.DMatrix('dtrain.svm')
dtest_svm = xgb.DMatrix('dtest.svm')

# set xgboost params
param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

#------------- numpy array ------------------
# training and testing - numpy matrices
bst = xgb.train(param, dtrain, num_round)
preds = bst.predict(dtest)

# extracting most confident predictions
best_preds = np.asarray([np.argmax(line) for line in preds])
print("Numpy array precision:", precision_score(y_test, best_preds, average='macro'))

# ------------- svm file ---------------------
# training and testing - svm file
bst_svm = xgb.train(param, dtrain_svm, num_round)
preds = bst.predict(dtest_svm)

# extracting most confident predictions
best_preds_svm = [np.argmax(line) for line in preds]
print("Svm file precision:",precision_score(y_test, best_preds_svm, average='macro'))
# --------------------------------------------

# dump the models
bst.dump_model('dump.raw.txt')
bst_svm.dump_model('dump_svm.raw.txt')


# save the models for later
joblib.dump(bst, 'bst_model.pkl', compress=True)
joblib.dump(bst_svm, 'bst_svm_model.pkl', compress=True)

--------------------------------------------------------------------------------------------------------------
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print("Train data length:",len(X_train));
print("Test data length:",len(X_test));
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
parameters = {
    'eta': 0.3,
    'silent': True,  # option for logging
    'objective': 'multi:softprob',  # error evaluation for multiclass tasks
    'num_class': 3,  # number of classes to predic
    'max_depth': 3  # depth of the trees in the boosting process
    }
num_round = 20  # the number of training iterations
bst = xgb.train(parameters, dtrain, num_round)
preds = bst.predict(dtest)
print( preds[:5])
'''
Selecting the column that represents the highest probability
(note that, for each line, there is 3 columns, indicating the probability for each class)
'''
import numpy as np
best_preds = np.asarray([np.argmax(line) for line in preds])
print(best_preds)
from sklearn.metrics import precision_score
print(precision_score(y_test, best_preds, average='macro'))

--------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA
import numpy as np
import lightgbm as lgb

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
plt.figure(2, figsize=(8, 6))
plt.clf()

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

# To getter a better understanding of interaction of the dimensions
# plot the first three PCA dimensions
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
           cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
X = iris.data  # we only take the first two features.
y = iris.target
print(np.shape(X))
print(np.shape(y))
le = preprocessing.LabelEncoder() #
y_label=le.fit_transform(y)
classes=le.classes_

X_train, X_test, y_train, y_test = train_test_split(X, y_label, test_size=0.30, random_state=42)

params = {
          "objective" : "multiclass",
          "num_class" : 4,
          "num_leaves" : 60,
          "max_depth": -1,
          "learning_rate" : 0.01,
          "bagging_fraction" : 0.9,  # subsample
          "feature_fraction" : 0.9,  # colsample_bytree
          "bagging_freq" : 5,        # subsample_freq
          "bagging_seed" : 2018,
          "verbosity" : -1 }


lgtrain, lgval = lgb.Dataset(X_train, y_train), lgb.Dataset(X_test, y_test)
lgbmodel = lgb.train(params, lgtrain, 2000, valid_sets=[lgtrain, lgval], early_stopping_rounds=100, verbose_eval=200)


from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.utils.multiclass import unique_labels
y_pred =np.argmax(lgbmodel.predict(X_test),axis=1)
y_true =y_test


def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

plot_confusion_matrix(y_true, y_pred, classes=classes,
                      title='Confusion matrix, without normalization')
from sklearn.metrics import accuracy_score
print( accuracy_score(y_true, y_pred))

--------------------------------------------------------------------------------------------------------------

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

X, y = datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

rf_regressor = RandomForestRegressor(n_estimators=30)
rf_regressor.fit(X_train, y_train)
train_error = mse(y_train, rf_regressor.predict(X_train))
test_error = mse(y_test, rf_regressor.predict(X_test))
print('RF test error:', test_error)
print('RF overfit ratio:', test_error/train_error)

boosting_regressor = LGBMRegressor(n_estimators=30, learning_rate=0.12)
boosting_regressor.fit(X_train, y_train)
train_error = mse(y_train, boosting_regressor.predict(X_train))
test_error = mse(y_test, boosting_regressor.predict(X_test))
print('Boosting test error:', test_error)
print('Boosting overfit ratio:', test_error/train_error)

dart_regressor = LGBMRegressor(
    n_estimators=30, learning_rate=0.12, boosting_type='dart', skip_drop=0.7)
dart_regressor.fit(X_train, y_train)
train_error = mse(y_train, dart_regressor.predict(X_train))
test_error = mse(y_test, dart_regressor.predict(X_test))
print('Dart test error:', test_error)
print('Dart overfit ratio:', test_error/train_error)


--------------------------------------------------------------------------------------------------------------------

import lightgbm as lgb
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from pandas import DataFrame

boston = load_boston()
x, y = boston.data, boston.target

x_df = DataFrame(x, columns=boston.feature_names)
x_train, x_test, y_train, y_test = train_test_split(x_df, y, test_size=0.15)

# defining parameters
params = {
    'task': 'train',
    'boosting': 'gbdt',
    'objective': 'regression',
    'num_leaves': 10,
    'learnnig_rage': 0.05,
    'metric': {'l2', 'l1'},
    'verbose': -1
}

# laoding data
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_test, y_test, reference=lgb_train)

# fitting the model
model = lgb.train(params,
                  train_set=lgb_train,
                  valid_sets=lgb_eval,
                  early_stopping_rounds=30)

# prediction
y_pred = model.predict(x_test)

# accuracy check
mse = mean_squared_error(y_test, y_pred)
rmse = mse**(0.5)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % rmse)

# visualizing in a plot
x_ax = range(len(y_test))
plt.figure(figsize=(12, 6))
plt.plot(x_ax, y_test, label="original")
plt.plot(x_ax, y_pred, label="predicted")
plt.title("Boston dataset test and predicted data")
plt.xlabel('X')
plt.ylabel('Price')
plt.legend(loc='best', fancybox=True, shadow=True)
plt.grid(True)
plt.show()

# plotting feature importance
lgb.plot_importance(model, height=.5)


--------------------------------------------------------------------------------------------------------------------

import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from pandas import DataFrame
from numpy import argmax


iris = load_iris()
x, y = iris.data, iris.target

x_df = DataFrame(x, columns=iris.feature_names)
x_train, x_test, y_train, y_test = train_test_split(x_df, y, test_size=0.15)

# defining parameters
params = {
    'boosting': 'gbdt',
    'objective': 'multiclass',
    'num_leaves': 10,
    'num_class': 3
}

# laoding data
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_test, y_test, reference=lgb_train)

# fitting the model
model = lgb.train(params,
                  train_set=lgb_train,
                  valid_sets=lgb_eval,
                  early_stopping_rounds=30)

# prediction
y_pred = model.predict(x_test)

y_pred = argmax(y_pred, axis=1)
cr = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(cr)
print(cm)

lgb.plot_importance(model, height=.5)

--------------------------------------------------------------------------------------------------------------------
# lightgbm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
# define dataset
X, y = make_classification(
    n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# evaluate the model
model = LGBMClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(
    model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# fit the model on the whole dataset
model = LGBMClassifier(learning_rate=0.05, max_depth=-5, random_state=42)
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42)
model.fit(x_train, y_train, eval_set=[
          (x_test, y_test), (x_train, y_train)], eval_metric='logloss')
# make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -
        1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])
lgb.plot_metric(model)
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
lgb.plot_importance(model)
lgb.plot_tree(model)
--------------------------------------------------------------------------------------------------------------------

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

X, y = datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


rf_regressor = RandomForestRegressor(n_estimators=30)
rf_regressor.fit(X_train, y_train)
train_error = mse(y_train, rf_regressor.predict(X_train))
test_error = mse(y_test, rf_regressor.predict(X_test))
print('RF test error:', test_error)
print('RF overfit ratio:', test_error/train_error)

boosting_regressor = LGBMRegressor(n_estimators=30, learning_rate=0.12)
boosting_regressor.fit(X_train, y_train)
train_error = mse(y_train, boosting_regressor.predict(X_train))
test_error = mse(y_test, boosting_regressor.predict(X_test))
print('Boosting test error:', test_error)
print('Boosting overfit ratio:', test_error/train_error)

dart_regressor = LGBMRegressor(
    n_estimators=30, learning_rate=0.12, boosting_type='dart', skip_drop=0.7)
dart_regressor.fit(X_train, y_train)
train_error = mse(y_train, dart_regressor.predict(X_train))
test_error = mse(y_test, dart_regressor.predict(X_test))
print('Dart test error:', test_error)
print('Dart overfit ratio:', test_error/train_error)

RF test error: 2990.835443196005RF overfit ratio: 6.079309223279069Boosting test error: 2856.1684664692157Boosting overfit ratio: 2.16337987534278Dart test error: 2756.3126781510805Dart overfit ratio: 1.775351505966991
---------------------------------------------------------------------------------------------------------------------


핑백

덧글

  • 바죠 2021/01/13 15:08 # 답글

    https://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html#sphx-glr-auto-examples-hyperparameter-optimization-py
  • 바죠 2021/12/21 13:26 # 답글

    https://medium.com/analytics-vidhya/neural-oblivious-decision-ensembles-node-a-state-of-the-art-deep-learning-algorithm-for-tabular-1cb6788c5212
댓글 입력 영역

최근 포토로그