# Imagination is more important than knowledge (지식보다 상상)

## 스티글러의 법칙by 바죠

스티글러의 법칙

"어느 과학상의 발견도 그 원래의 발견자의 이름을 따서 명명되지 않는다."

통계적으로 대부분이 그렇다는 것이다. 위의 인용은 너무 쎄게 이야기 한 것이다.

물론, 스티글러의 법칙도 그렇다.

교과서에 나오는 것들이 대부분이다. 학교 다니면서 배운것들이다.

아주 흥미롭다. 한편으로는 허무하다는 생각도 든다.

승자 독식의 세상이다. 유명할수록, 더 유명할수록 남의 것들을 더 많이 뺏어갈 수 있다.

진짜 재미있는 사실은 많은 논문에서 논문 속의 내용은 저자들 자신들이 처음 한 것이라고 주장한다.

스티글러의 법칙을 알고 있다면 그 것이 얼마나 의미없는 짓인지 다시 생각하게 될 것이다. 최근에 새롭게 알게 된 것들:
이징 모델 (렌츠)
로렌츠 변환 (라모어)
파울리 배타원리 (스토너)

전통적으로 유명한 것들:

카티지안 좌표
쿨리-투키 알고리듬 (FFT)
페르미 황금 규칙
피보나치 수
가우시인 분포
가우스 정리
가우스 소거
로피탈의 규칙
스토크스 정리
오일러 공식
허블의 법칙
메트로폴리스 알고리듬
아하로노프-봄 효과
오제 효과

## Bayesian optimization by 바죠

Bayesian optimization
https://en.wikipedia.org/wiki/Bayesian_optimization
BO는 기계학습을 이용하여 최적화를 하는 방법이다.
최적의 해가 있을 법한 곳을 기계학습으로 예측하는 것이 핵심이다.
많은 경우에 BO를 사용한다. 하지만, BO가 잘 적용되는 분야를 파악해 두어야 한다.
BO 가 적용되는 조건은 아래와 같다.
* serial algorithm [sequential model-based optimization]
* gradient free
* relatively low dimensional systems
* very expensive objective function, even noisy
* surrogate model
너무 많은 objective function 실행을 피할 경우에만 유용한 알고리듬이다.
측정한 data 가 너무 많은 것을 가정하지 않는다.
즉, objective function 실행이 너무 비싼 비용이 드는 경우에 해당한다.
surrogate model를 만들 때 부담이 될 수 있다.
차원이 그렇게 높은 경우에는 사용되는 것이 아니다.
gradient를 활용한 local minimization을 활용하는 방법이 아니다.
병렬 계산이 될 수 있게 함. 덩어리 형태로 계산하고, 덩어리 형태로 Gaussian process로 훈련하면됨.
즉, 구조 하나를 찾아내고, 예측하는 방식을 취하지 말고, 덩어리 형식으로 일을 진행할 수 있음.
surrogate model, 즉, 기계학습으로 새로운 위치를 제안할 수 있다는 것이 핵심이다. https://journals.aps.org/prmaterials/pdf/10.1103/PhysRevMaterials.2.013803
https://github.com/Tomoki-YAMASHITA/CrySPY/tree/master/CrySPY/BO
http://krasserm.github.io/2018/03/21/bayesian-optimization/
https://arxiv.org/pdf/1807.02811.pdf
https://en.wikipedia.org/wiki/Thompson_sampling

Bayesian optimization is an approach to optimizing objective functions that take a long time (minutes or hours) to evaluate. It is best-suited for optimization over continuous domains of less than 20 dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample. In this tutorial, we describe how Bayesian optimization works, including Gaussian process regression and three common acquisition functions: expected improvement, entropy search, and knowledge gradient. We then discuss more advanced techniques, including running multiple function evaluations in parallel, multi-fidelity and multi-information source optimization, expensive-to-evaluate constraints, random environmental conditions, multi-task Bayesian optimization, and the inclusion of derivative information. We conclude with a discussion of Bayesian optimization software and future research directions in the field. Within our tutorial material we provide a generalization of expected improvement to noisy evaluations, beyond the noise-free setting where it is more commonly applied. This generalization is justified by a formal decision-theoretic argument, standing in contrast to previous ad hoc modifications.
Surrogate modelhttps://towardsdatascience.com/the-intuitions-behind-bayesian-optimization-with-gaussian-processes-7e00fcc898a0https://github.com/tsudalabhttps://www.sciencedirect.com/science/article/pii/S2352924516300035https://github.com/tsudalab/combo3
Bayesian optimization has been proven as an effective tool in accelerating scientific discovery.
A standard implementation (e.g., scikit-learn),
however, can accommodate only small training data.
COMBO is highly scalable due to an efficient protocol that employs Thompson sampling,
random feature maps, one-rank Cholesky update and automatic hyperparameter tuning. https://raw.githubusercontent.com/thuijskens/bayesian-optimization/master/python/plotters.py

""" gp.py

Bayesian optimisation of loss function.
"""

import numpy as np
import sklearn.gaussian_process as gp

from scipy.stats import norm
from scipy.optimize import minimize

def expected_improvement(x, gaussian_process, evaluated_loss, greater_is_better=False, n_params=1):
""" expected_improvement

Expected improvement acquisition function.

Arguments:
----------
x: array-like, shape = [n_samples, n_hyperparams]
The point for which the expected improvement needs to be computed.
gaussian_process: GaussianProcessRegressor object.
Gaussian process trained on previously evaluated hyperparameters.
evaluated_loss: Numpy array.
Numpy array that contains the values off the loss function for the previously
evaluated hyperparameters.
greater_is_better: Boolean.
Boolean flag that indicates whether the loss function is to be maximised or minimised.
n_params: int.
Dimension of the hyperparameter space.

"""

x_to_predict = x.reshape(-1, n_params)

mu, sigma = gaussian_process.predict(x_to_predict, return_std=True)

if greater_is_better:
loss_optimum = np.max(evaluated_loss)
else:
loss_optimum = np.min(evaluated_loss)

scaling_factor = (-1) ** (not greater_is_better)

# In case sigma equals zero
with np.errstate(divide='ignore'):
Z = scaling_factor * (mu - loss_optimum) / sigma
expected_improvement = scaling_factor * (mu - loss_optimum) * norm.cdf(Z) + sigma * norm.pdf(Z)
expected_improvement[sigma == 0.0] == 0.0

return -1 * expected_improvement

def sample_next_hyperparameter(acquisition_func, gaussian_process, evaluated_loss, greater_is_better=False,
bounds=(0, 10), n_restarts=25):
""" sample_next_hyperparameter

Proposes the next hyperparameter to sample the loss function for.

Arguments:
----------
acquisition_func: function.
Acquisition function to optimise.
gaussian_process: GaussianProcessRegressor object.
Gaussian process trained on previously evaluated hyperparameters.
evaluated_loss: array-like, shape = [n_obs,]
Numpy array that contains the values off the loss function for the previously
evaluated hyperparameters.
greater_is_better: Boolean.
Boolean flag that indicates whether the loss function is to be maximised or minimised.
bounds: Tuple.
Bounds for the L-BFGS optimiser.
n_restarts: integer.
Number of times to run the minimiser with different starting points.

"""
best_x = None
best_acquisition_value = 1
n_params = bounds.shape

for starting_point in np.random.uniform(bounds[:, 0], bounds[:, 1], size=(n_restarts, n_params)):

res = minimize(fun=acquisition_func,
x0=starting_point.reshape(1, -1),
bounds=bounds,
method='L-BFGS-B',
args=(gaussian_process, evaluated_loss, greater_is_better, n_params))

if res.fun < best_acquisition_value:
best_acquisition_value = res.fun
best_x = res.x

return best_x

def bayesian_optimisation(n_iters, sample_loss, bounds, x0=None, n_pre_samples=5,
gp_params=None, random_search=False, alpha=1e-5, epsilon=1e-7):
""" bayesian_optimisation

Uses Gaussian Processes to optimise the loss function sample_loss.

Arguments:
----------
n_iters: integer.
Number of iterations to run the search algorithm.
sample_loss: function.
Function to be optimised.
bounds: array-like, shape = [n_params, 2].
Lower and upper bounds on the parameters of the function sample_loss.
x0: array-like, shape = [n_pre_samples, n_params].
Array of initial points to sample the loss function for. If None, randomly
samples from the loss function.
n_pre_samples: integer.
If x0 is None, samples n_pre_samples initial points from the loss function.
gp_params: dictionary.
Dictionary of parameters to pass on to the underlying Gaussian Process.
random_search: integer.
Flag that indicates whether to perform random search or L-BFGS-B optimisation
over the acquisition function.
alpha: double.
Variance of the error term of the GP.
epsilon: double.
Precision tolerance for floats.
"""

x_list = []
y_list = []

n_params = bounds.shape

if x0 is None:
for params in np.random.uniform(bounds[:, 0], bounds[:, 1], (n_pre_samples, bounds.shape)):
x_list.append(params)
y_list.append(sample_loss(params))
else:
for params in x0:
x_list.append(params)
y_list.append(sample_loss(params))

xp = np.array(x_list)
yp = np.array(y_list)

# Create the GP
if gp_params is not None:
model = gp.GaussianProcessRegressor(**gp_params)
else:
kernel = gp.kernels.Matern()
model = gp.GaussianProcessRegressor(kernel=kernel,
alpha=alpha,
n_restarts_optimizer=10,
normalize_y=True)

for n in range(n_iters):

model.fit(xp, yp)

# Sample next hyperparameter
if random_search:
x_random = np.random.uniform(bounds[:, 0], bounds[:, 1], size=(random_search, n_params))
ei = -1 * expected_improvement(x_random, model, yp, greater_is_better=True, n_params=n_params)
next_sample = x_random[np.argmax(ei), :]
else:
next_sample = sample_next_hyperparameter(expected_improvement, model, yp, greater_is_better=True, bounds=bounds, n_restarts=100)

# Duplicates will break the GP. In case of a duplicate, we will randomly sample a next query point.
if np.any(np.abs(next_sample - xp) <= epsilon):
next_sample = np.random.uniform(bounds[:, 0], bounds[:, 1], bounds.shape)

# Sample loss for new set of parameters
cv_score = sample_loss(next_sample)

# Update lists
x_list.append(next_sample)
y_list.append(cv_score)

# Update xp and yp
xp = np.array(x_list)
yp = np.array(y_list)

return xp, yp

def sample_loss(x):
sample_loss=0.
for j in range(x.shape):
sample_loss=sample_loss+(x[j]-j)**2
return sample_loss

n_iters=300
n_params=2
bounds=np.zeros((n_params,2))
for i in range(n_params):
bounds[i,0]=-n_params*2.
bounds[i,1]=n_params*2.
xp,yp=bayesian_optimisation(n_iters,sample_loss,bounds,x0=None,n_pre_samples=5,gp_params=None,random_search=False,alpha=1e-5,epsilon=1e-7)
idx=yp.argsort()[::-1]
xp=xp[idx]
yp=yp[idx]
for i in range(xp.shape):
print(xp[i,:])
for i in range(xp.shape):
print(yp[i])
print(xp.shape,xp.shape,yp.shape)

## 이기명 교수 (고등과학원)by 바죠

다음 