Let \(f(x)\) be a convex function \(f: X \rightarrow \mathbb{R}\). Jensen’s Inequality states that \(\forall \lambda \in [0, 1], \forall x_1, x_2 \in X\),
\[f(\lambda x_1 + (1 - \lambda) x_2) \leq \lambda f( x_1) + (1 - \lambda) f(x_2)\]This can be generalized to \(n\) elements in the function’s domain, so long as the coefficients \(\lambda_i\) remain non-negative:
\[f(\sum_i \lambda_i x_i) \leq \sum_i \lambda_i f( x_i)\]The inequality is reversed if the function is concave.
Jensen’s Inequality appears everywhere in information theory e.g. KL Divergence, Mutual Information. due to the frequent appearance of \(\log\), a concave function. For instance, we can bound the log of a random variable’s expected value:
\[\mathbf{E}[\log X] = \sum_x p(x) \log x \leq \log \sum_x p(x) x = \log (\mathbf{E}[X])\]