This blog post is a result of a discussion (read borderline heated argument) that I had with a friend regarding if machine learning terms loss function and objective function meant the same. We will get to know if they do mean the same or not by the end of this post. I am choosing to write this post as a dialogue between Pinfy and Scooby, to let you decide if it was a discussion or an argument :'). Also, I am using a colab notebook to write this post because the post is going to get mathematical.

Pinfy: So, I am implementing a neural network for a certain part of my work. I keep seeing the terms loss function and objective function being used interchangeably. It puts me off. Do you think they are really the same?

Scooby: Hmm. They are the same. What makes you doubtful?

Pinfy : My problem is of the form: For a given function $f(.)$, solve: $max_{\theta} f(r_{\theta})$. A NN is used to model $\theta \in \mathbb{R}^{M}$, and $r \in \mathbb{R}^{N}$. This is clearly an optimization problem with $f(.)$ as the objective. What is even meant by loss here?

Scooby: Okay. Here, $-f$ is like a loss function, right?

Pinfy : No, $f$ is the objective function. As per my understanding, an objective function gives you an absolute measure of goodness or badness of the model parameters. In the problem that I mention above, $f$ measures the goodness or badness of $\theta$. An example of an objective function is a likelihood function. Loss is a relative quantity. You measure loss with respect to a certain fixed quantity. What is that quantity in $f$?

Scooby: I don't understand. Aren't the terms loss function and objective function interchangeable?

Pinfy : It is certainly debatable. I don't think they are the same, at least the way I understand the terms. For dicriminative modelling, the two terms are the same. Not for generative modelling.

Scooby: From what I know, negative of the likelihood function is considered as the loss function for maximum likelihood generative modelling. Loss function is just the negative of the objective if the optimization problem is finding the max. And, an objective function is called loss or cost function when the optimization problem is a minimization problem.

Pinfy : Negative or positive is just a change to accomodate if the optimization problem is casted as a minimization problem or a maximization problem. Because most of the implementations use gradient descent to optimiza the models, the negative of the objective is used. So, I am not sure if it is right to say 'Loss function is just the negative of the objective if the optimization problem is finding the max.'

Scooby: Why are you not sure. I don't understand your doubt?

Pinfy : I think it is incorrect to say 'Negative of objective function is the loss function' as a general statement. Loss is a term that is used when there is an explicit sense of difference or distance. Doing ascent or descent over the same objective does not make one thing an objective function and another a loss function.

Scooby: Hmm. So, you are saying $-f$ would be called a loss function only if it can be intepreted as minimizing some distance? Otherwise, one can not call $-f$ a loss function?

Pinfy : Yes, that is exactly what I am saying.

Scooby: Okay, I get now. You have a problem with the notation. Let us try to understand it with an example of a generative modelling problem. Take the example of fitting a Gaussian to certain data. Suppose we are given some $x_i, i = 1,2,...N$, and we want to determine the mean, $\mu$, of the Gaussian that fits the data. Let us keep the standard deviation of the Gaussian fixed, say $\sigma$. Let us pose the problem as a maximum likelihood problem.

Pinfy : Alright, let us write the problem mathematically:

IID input data: $x_i \in \mathcal{X}, i = 1,2,...,N$

Gaussian to be fit to data: $p(x) = \frac{1}{\sqrt{2\pi\sigma^{2}}} \exp(-\frac{(x-\theta)^2}{2\sigma^2})$

Here, $\theta = \mu$ is the mean of the Gaussian (and hence, the parameter of our mean fitting model) that we want to fit to the data $x_i$.

The likelihood function, $L(\theta) = p(x|\theta) = p(x_1|\theta).p(x_2|\theta)...p(x_N|\theta) = \prod_{i = 1}^{N} p(x_i|\theta)$.

The optimization problem to find $\theta$ then becomes:

$\theta = \arg \max_{\theta} L(\theta)$

$\quad = \arg \max_{\theta} \log(L(\theta)) $

$\quad = \arg \max_{\theta} l(\theta) $

Now, $l(\theta) = \log(\frac{1}{(2\pi\sigma^{2})^{N/2}} \prod_{i = 1}^{N}\exp(-\frac{(x_i-\theta)^2}{2\sigma^2})) $

$l(\theta) = -\frac{N}{2}\log(2\pi\sigma^{2}) - \sum_{i = 1}^{N}\frac{(x_i-\theta)^2}{2\sigma^2} $

$\theta = \arg \max_{\theta} -\sum_{i = 1}^{N} \frac{(x_i-\theta)^2}{2\sigma^2}$

$\quad = \arg \max_{\theta} -\sum_{i = 1}^{N} (x_i-\theta)^2 \quad \quad \quad (1)$

Scooby: Do you see it now? The last line you wrote is like minimizing sum of $L^{2}$ distances between the data points, $x_i$ and the mean, $\mu$. So, you are minimizing the $L^{2}$ loss.

Pinfy : No, I still don't see it. I agree $\theta$ can be obtained by minimizing the sum of $L^{2}$ distances. It would be called a loss if I was predicting mean and I had an original mean, something like: $\mu_{\theta} - {\mu}$.

Scooby: I don't think you understand it right. Loss function is always calculated over a set of inputs. What is your definition of a loss function?

Pinfy : A loss function, according to me, quantifies a match between predictions and true labels; estimated parameters and the true parameters, and so on.

Scooby: That sounds vague. State it formally.

Pinfy : Loss function is of the form: $\sum_{i} (y_i - y_i^t)$, where $i$ is iterating over the data, $y_i$ is the estimated output for the input, $x_i$ and $y_i^t$ is the ground truth output. I have a problem with calling Eqn. (1) a loos function. The term loss should not be used to specify the goodness or badness of parmeters.

Scooby: Why? Why not? What really matters is the underlying mathematics. If the mathematics is consistent, then that is all that matters. What you obtained in Eqn. (1) is the $L^2$ loss for the problem. It doesn't have to look like $\mu^{*} - \mu$.

Pinfy : I beg to differ. It should not be called a loss if it is not of the form $\mu^{*} - \mu$.

Scooby: Wait, I think Eqn. (1) can be expressed in that form. Expand and see.

Pinfy : Okay.

$\arg \max_{\theta} -\sum_{i = 1}^{N} (x_i-\theta)^2$

$\quad = \arg \max_{\theta} \sum_{i = 1}^{N} x_i^{2} + \theta^2 - 2x_i\theta$

$\quad = \arg \max_{\theta} \sum_{i = 1}^{N} \theta^2 - 2x_i\theta $

$\quad = \arg \max_{\theta} \sum_{i = 1}^{N} \theta^2 - \sum_{i = 1}^{N} 2x_i\theta $

$\quad = \arg \max_{\theta} N\theta^2 - 2\theta\sum_{i = 1}^{N} x_i$

$\quad = \arg \max_{\theta} N ( \theta^2 - 2\theta x_{mean}) \quad$ (Assuming, $\sum_{i = 1}^{N} x_i = x_{mean}$)

$\quad = \arg \max_{\theta} N( \theta^2 - 2\theta x_{mean} + x_{mean}^{2} - x_{mean}^{2})$

$\quad = \arg \max_{\theta} N( \theta - x_{mean})^{2}$

Scooby: Now, do you see it?

Pinfy : Yes. So, an objective function can be expressed as a loss function. However, it is not right to blindly say one is negative of another.

Scooby: Yeah, agreed.

I am sure you have judged by now if the conversation was an argument or not :'). Irrespective of what it was, let us list down the takeways from the conversation above:

For the general machine learning (ML) problems, the loss function and the objective function mean almost the same thing.
It is important to be clear with the definitions of the quantities that you deal with on a daily basis as ML researcher. Nobody really defines what a loss function actually means. I believe, the ML community would greatly benifit if an IEEE like standard body could unify the ML terminologies.
Even when things are given to you on a platter, be critical about them. In the dicussion above, Scooby knew that the loss function was equivalent to the negative of objective function for general ML problems, but, did not have concrete reasoning for the same.