목적 함수(Objective function)와 최적화(Optimization)

ML | AI/내용 정리

목적 함수(Objective function)와 최적화(Optimization)

싶만생각

|2022. 4. 10. 21:26

목적 함수란?

머신러닝 모델의 평가지표는 손실(Loss)함수, 비용(Cost)함수, 목적(Objective) 함수 등 다양한 이름으로 불림

손실, 비용, 목적 함수의 명칭에 대해선 정확한 정의는 없지만 일반적으로 다음과 같이 생각하면 편함

A loss function is a part of a cost function which is a type of an objective function.
출처

결국 머신러닝 모델은 손실, 비용 함수는 줄이려고(minimise) 하며 목적 함수는 최적화(optimise)하려고 함

최적화하고 싶은 목적 함수는 최적의 모델을 확률 관점에서 볼 지 error 최소화 관점에서 볼 지에 따라서 다름

확률 관점에선 목적 함수를 최대화하고 싶어하고(Maximum Likelihood Estimation) error 관점에선 목적 함수를 최소화(Mean Squared Error 등)하고 싶어 함

Convexity

머신러닝(딥러닝) 모델에서 목적 함수를 최소화하고 싶을 때, 목적 함수가 Convex 형태면 local minima가 곧 global minima 이기 때문에 최적해(목적 함수를 최소화)를 구하는 것이 매우 단순해짐

이때 Convex 함수란 임의의 두 점을 이은 할선이 두 점을 이은 곡선보다 위에 있는 함수임

엄밀히 말하면, $x, y$과 [0,1] 사이의 값 t에 대해 $f(t x+(1-t) y) \leq t f(x)+(1-t) f(y)$

가 항상 성립하는 함수 $f$가 Convex 함수임

Convex 함수를 시각화하면 아래와 같고 local minima와 global minima가 같음을 확인할 수 있음

convex 함수의 경우 local minima에 수렴할 가능성을 고려하지 않아도 되기 때문에 단순 gradient descent로도 최적해 얻을 수 있음

하지만 딥러닝 모델의 최적화는 non - convex 최적화이기 때문에 단순 gradient descent로는 global minima에 도달한다는 보장이 없고 다양한 최적화 방법이 제시 됨

실제로 VGG56의 목적 함수(non-convex)를 시각화하면 다음과 같음

Visualizing the Loss Landscape of Neural Nets : VGG56의 목적 함수

최적화

최적화 방법으로는 direct method와 iterative method로 나뉨

Direct method

$\hat{\theta}=\left(X^{T} X\right)^{-1} X^{T} y$

direct method는 목적 함수가 convex 함수여야 하고 또한 closed-form solution여야 하는 조건이 필요

반복없이 한번에 최적해를 구할 수 있다는 장점

하지만 계산 과정에 inverse matrix를 구해야 하는데 parameter 수가 많은 딥러닝에서는 invese matrix를 구하는 연산량이 너무 커져 적합하지 않음

Iterative method

반복적으로 최적해 ${\theta}$를 수정해가면서 얻음

현재 최적해의 예측값을 ${\theta_t}$라고 한다면 반복을 통해 ${\theta_{t+1}} = {\theta_t} + {\delta_t}$를 얻음

이때 $\delta_{t}=\underset{\delta}{\arg \min } L\left(\theta_{t}+\delta\right)$을 통해서 구할 수 있음

$L\left(\theta_{t}+\delta\right)$를 근사시키는 방법에 따라서 gradient descent, newton's method 등이 있음

일차 테일러 전개하여 얻은 근사식을 활용한 iterative method를 gradient descent라 하며 다음과 같음

$\theta_{t+1}=\theta_{t}-\alpha \nabla L\left(\theta_{t}\right)$

이때 ${\alpha}$는 학습률(learning rate)로 hyper parameter의 일종

이차 테일러 전개하여 얻은 근사식을 활용한 iterative method를 newton's method라 하며 다음과 같음

$\theta_{t+1}=\theta_{t}-\nabla^{2} L\left(\theta_{t}\right)^{-1} \nabla L\left(\theta_{t}\right)$

newton's method는 수렴 속도가 gradient descent보다 빠르다는 장점이 있음

하지만 newton's method는 딥러닝에 적합하진 않음

연산을 위해서 inverse hessian matrix를 구해야 하는데 연산량이 너무 크며 일반적인 딥러닝의 목적함수는 non - convex인데 newton's method를 사용할 경우 saddle point에 수렴할 가능성이 크다는 단점이 존재하기 때문

Optimizer(iterative method)

gradient descent는 $\theta_{t+1}=\theta_{t}-\alpha \nabla L\left(\theta_{t}\right)$로 최적해를 찾음

convex 함수에선 gradient descent로도 충분히 global minima에 도달할 수 있지만 non-convex에서는

local-minima 혹은 saddle point에 수렴할 가능성이 존재

https://www.pinterest.co.kr/pin/672232681861372201/

따라서 gradient descent를 딥러닝(non-convex)에서 사용하기 위해선 수정이 필요

이때 최적해를 찾는 과정에서 수정할 수 있는 부분은 ${\alpha}$와 $\nabla L\left(\theta_{t}\right)$임

${\alpha}$는 곧 한번에 얼마나 학습할지를 결정하며 $\nabla L\left(\theta_{t}\right)$는 어떤 방향으로 학습할 지를 결정

이러한 아이디어로 발전된 Optimzer의 전개 흐름은 다음과 같음

자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다. - 하용호

참고

Visualizing the Loss Landscape of Neural Nets

Convex function - Wikipedia

From Wikipedia, the free encyclopedia Jump to navigation Jump to search Convex function on an interval. Real function with secant line between points above the graph itself A function (in black) is convex if and only if the region above its graph (in green

en.wikipedia.org

PyTorch Lecture 03: Gradient Descent

자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.

백날 자습해도 이해 안 가던 딥러닝, 머리속에 인스톨 시켜드립니다. 이 슬라이드를 보고 나면, 유명한 영상인식을 위한 딥러닝 구조 VGG를 코드 수준에서 읽으실 수 있을 거에요