가장 쉬운 회귀 (regression) by 바죠

가장 쉬운 회귀 (regression)
https://en.wikipedia.org/wiki/Regression_analysis
영국의 유전학자 프랜시스 골턴은 부모의 키와 아이들의 키 사이의 연관 관계를 연구하였다.
부모와 자녀의 키사이에는 선형적인 관계가 있고 키가 커지거나 작아지는 것보다는 전체 키 평균으로 돌아가려는 경향이 있다는 가설을 세웠다.
이렇게 분석하는 방법을 "회귀분석"이라고 하였다.
이러한 경험적 연구 이후, 칼 피어슨은 아버지와 아들의 키를 조사한 결과를 바탕으로 함수 관계를 도출하여 회귀분석 이론을 수학적으로 정립하였다.
회귀분석은 통계학에서 다루는 것이다. 통상 두 가지 목적을 가지고 있다.

예측
상관관계를 추정

가장 간단한 회귀는 독립변수 하나에 종속 변수 하나인 경우이다.
특히, 선형 관계를 가정할 때, 가장 간단한 회귀 분석이 가능하다.

Gaussian process 방법은 기존의 계산 결과들을 이용하여, 새로운 위치에서의 예상 값을 계산할 수 있게한다. 
유사성 측정(kernel function, similarity function)을 이용하여 예측을 수행한다. 예측에서의 불확도도 함께 계산한다.
전형적인 기계학습 알고리듬이다. 기존의 데이터를 이용하고 있다는 점에 주목해야한다.
나아가서 예측의 정밀도를 논할 수 있다. 결국, 가장 확률 높은 위치를 예상할 수 있게 한다. 물론, 이 위치를 직접 탐험할 수 있다.
그 위치에서 실제 계산을 수행할 수 있다. 이러한 실제 계산은 또 다시 데이터에 추가되게 된다.
이렇게 보강된 데이터를 다시 이용할 수 있다.
이러한 계산들을 반복하여 소위 베이지안 옵티마이제이션(Bayesian optimization)을 할 수 있다.
http://incredible.egloos.com/7480714
http://incredible.egloos.com/7479039

복잡한 회귀 분석은 다수의 독립변수들이 있는 경우이다.
가장 일반적인 데이터 양식, 소위, 테이블 데이터에 대해서 LightGBM 방법의 성능이 가장 뛰어나다.
딥러닝 방법은 이 보다 더 좋은 성능을 제시하지 못한다.  http://incredible.egloos.com/7479081
회귀뿐만 아니라 분류도 마찬가지이다.
딥러닝은 소위 연속적인 데이터들에 대해서 좋은 성능을 발휘할 수 있다.



import keras
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
trX = np.linspace(-1, 1, 101)
trY = 3 * trX + np.random.randn(*trX.shape) * 0.33
model = Sequential()
model.add(Dense(input_dim=1, output_dim=1, init='uniform', activation='linear'))

weights = model.layers[0].get_weights()
w_init = weights[0][0][0]
b_init = weights[1][0]
print('Linear regression model is initialized with weights w: %.2f, b: %.2f' % (w_init, b_init))
#     Linear regression model is initialized with weight w: -0.03, b: 0.00
model.compile(optimizer='sgd',  loss='mse')
model.fit(trX,trY, nb_epoch=200, verbose=1)
weights = model.layers[0].get_weights()
w_final = weights[0][0][0]
b_final = weights[1][0]
print('Linear regression model is trained to have weight w: %.2f, b: %.2f' % (w_final, b_final))
#    Linear regression model is trained to have weight w: 2.94, b: 0.08
model.save_weights("mymodel.h5")
model.load_weights("mymodel.h5")


from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
#     Step1: Create data set
X, y = make_moons(n_samples=10000, noise=.5, random_state=0)
#      Step2: Split the training test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#     Step 3: Fit a Decision Tree model as comparison
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
#     Step 4: Fit a Random Forest model, " compared to "Decision Tree model, accuracy go up by 5-6%
clf = RandomForestClassifier(n_estimators=100,max_features="auto",random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)


import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_scoreKFold
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np
boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)
xgbr = xgb.XGBRegressor()
print(xgbr)
xgbr.fit(xtrain, ytrain)
#     - cross validataion
scores = cross_val_score(xgbr, xtrain, ytrain, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())
kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(xgbr, xtrain, ytrain, cv=kfold )
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())
ypred = xgbr.predict(xtest)
mse = mean_squared_error(ytest, ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % np.sqrt(mse))
x_ax = range(len(ytest))
plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()


#     import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets,linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
#     load the boston dataset
boston = datasets.load_boston()
#     predictor variable - X in our regression: number of rooms
Boston_NumRoom = boston.data[:,np.newaxis,5]
#     target variable - y in our regression: Housing Price
Boston_HousingPrice = boston.target
#     Split the data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(Boston_NumRoom, Boston_HousingPrice, test_size=0.2, random_state=42)
#     Create a linear regression object
model = linear_model.LinearRegression()
#     Fit the model using the training set
model.fit(X_train, y_train)
#     Make prediction using test set
y_test_pred = model.predict(X_test)
#     Model Output
#     a. Coefficient - the slop of the line

print("Coefficients(slope of the line):", model.coef_  )
#     b. the error - the mean square error
print("Mean squared error: %.2f"% mean_squared_error(y_test,y_test_pred))
#      c. R-square -  how well x accout for the varaince of Y
print("R-square: %.2f'" % r2_score(y_test,y_test_pred))
#     Plot the line
plt.scatter(X_test, y_test,linewidth=0.1,alpha=0.5)
plt.plot(X_test, y_test_pred, color='green', linewidth=3)
plt.xlabel('average number of rooms per dwelling')
plt.ylabel('housing price')
plt.title('Fitted Line:  housing price  = %.2f + %.2f * number of rooms' % (model.intercept_ , model.coef_[0] ))
plt.show()

Coefficients(slope of the line): [9.34830141]
Mean squared error: 46.14
R-square: 0.37

https://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html#sphx-glr-auto-examples-hyperparameter-optimization-py
--------------------------------------------------------------------------------------------------------------------
"""
============================================
Tuning a scikit-learn estimator with `skopt`
============================================

Gilles Louppe, July 2016
Katie Malone, August 2016
Reformatted by Holger Nahrstaedt 2020

.. currentmodule:: skopt

If you are looking for a :obj:`sklearn.model_selection.GridSearchCV` replacement checkout
:ref:`sphx_glr_auto_examples_sklearn-gridsearchcv-replacement.py` instead.

Problem statement
=================

Tuning the hyper-parameters of a machine learning model is often carried out
using an exhaustive exploration of (a subset of) the space all hyper-parameter
configurations (e.g., using :obj:`sklearn.model_selection.GridSearchCV`), which
often results in a very time consuming operation.

In this notebook, we illustrate how to couple :class:`gp_minimize` with sklearn's
estimators to tune hyper-parameters using sequential model-based optimisation,
hopefully resulting in equivalent or better solutions, but within less
evaluations.

Note: scikit-optimize provides a dedicated interface for estimator tuning via
:class:`BayesSearchCV` class which has a similar interface to those of
:obj:`sklearn.model_selection.GridSearchCV`. This class uses functions of skopt to perform hyperparameter
search efficiently. For example usage of this class, see
:ref:`sphx_glr_auto_examples_sklearn-gridsearchcv-replacement.py`
example notebook.
"""
print(__doc__)
import numpy as np

#############################################################################
# Objective
# =========
# To tune the hyper-parameters of our model we need to define a model,
# decide which parameters to optimize, and define the objective function
# we want to minimize.

from sklearn.datasets import load_boston
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

boston = load_boston()
X, y = boston.data, boston.target
n_features = X.shape[1]

# gradient boosted trees tend to do well on problems like this
reg = GradientBoostingRegressor(n_estimators=50, random_state=0)

#############################################################################
# Next, we need to define the bounds of the dimensions of the search space
# we want to explore and pick the objective. In this case the cross-validation
# mean absolute error of a gradient boosting regressor over the Boston
# dataset, as a function of its hyper-parameters.

from skopt.space import Real, Integer
from skopt.utils import use_named_args


# The list of hyper-parameters we want to optimize. For each one we define the
# bounds, the corresponding scikit-learn parameter name, as well as how to
# sample values from that dimension (`'log-uniform'` for the learning rate)
space  = [Integer(1, 5, name='max_depth'),
          Real(10**-5, 10**0, "log-uniform", name='learning_rate'),
          Integer(1, n_features, name='max_features'),
          Integer(2, 100, name='min_samples_split'),
          Integer(1, 100, name='min_samples_leaf')]

# this decorator allows your objective function to receive a the parameters as
# keyword arguments. This is particularly convenient when you want to set
# scikit-learn estimator parameters
@use_named_args(space)
def objective(**params):
    reg.set_params(**params)

    return -np.mean(cross_val_score(reg, X, y, cv=5, n_jobs=-1,
                                    scoring="neg_mean_absolute_error"))

#############################################################################
# Optimize all the things!
# ========================
# With these two pieces, we are now ready for sequential model-based
# optimisation. Here we use gaussian process-based optimisation.

from skopt import gp_minimize
res_gp = gp_minimize(objective, space, n_calls=50, random_state=0)

"Best score=%.4f" % res_gp.fun

#############################################################################

print("""Best parameters:
- max_depth=%d
- learning_rate=%.6f
- max_features=%d
- min_samples_split=%d
- min_samples_leaf=%d""" % (res_gp.x[0], res_gp.x[1],
                            res_gp.x[2], res_gp.x[3],
                            res_gp.x[4]))

#############################################################################
# Convergence plot
# ================

from skopt.plots import plot_convergence

plot_convergence(res_gp)

--------------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------------------------------------
xtrain2 = pd.DataFrame(
    {'XGB' : xgb.predict(xtrain),
    'NN' : dl.predict(xtrain),
    'LGB' : lgb.predict(xtrain)
    }
)
xtrain2.head()
from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(xtrain2, ytrain)
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------
from sklearn.tree import DecisionTreeRegressor   as DTR
from sklearn.ensemble import StackingRegressor as SR
model_sc = SR(estimators = [('xgb' ,model), ('lgb' ,model_lgb)],final_estimator= DTR(random_state=17,max_depth=4,criterion= "mse"),cv = 8 )
model_sc.fit(X_train, y_train)
pred = model_sc.predict(X_test)
print(regression_error(y_test, pred ))
--------------------------------------------------------------------------------------------------------------


핑백

덧글

댓글 입력 영역

최근 포토로그



MathJax