I remember back in 2006, when I first heard of the Least Squares Problem during a physics laboratory at the National Technical University of Athens. The problem at that time was pretty simple. Given a control variable x∈R and a measurement y∈R, identify the line y = f(x) that best fits the data. This line could be employed in order to evaluate the linear relationship between the variables, but most importantly it could be used to interpolate the unknown variable (y) for values of the independent variable (x) that are not part of the measurements.
The naive intuition described above is the academic foundation and the ancestor of modern machine learning. In the general case, the data may contain a significant number of control variables and the “fitting” function may not be linear, but polynomial or exponential instead. However, through proper transformations, any relevant problem can be modeled in a linear form and the method to solve the least square problem is the so-called Linear Regression method. This is the process of fitting a linear function y = f(X) into the experimental data (X, y). Considering an experiment, during which we control (n) independent variables and perform (m) measurements for the dependent variable. It stands:
As noted by Joseph Rocca (2018), the modern neural networks, designed to model significantly complex structures in data (including non-linear relationships), are the natural evolution of linear regression.
Considering a linear system of (m) equations with (n) unknowns:
This can be written in the matrix form: A·x = b
Considering again the experiment, during which (n) independent variables are controlled and (m) measurements are performed for the dependent variable. In the previous notation, the matrix A represents the (m) values retrieved for the (n) independent variables and the vector b represents the (m) values retrieved for the dependent variable that gets measured. Moreover, the vector x represents the (n) unknown coefficients of the linear function fitting the data. Employing the Linear Regression method already discussed, is equivalent to solving the linear system of equations noted above.
From a Linear Algebra perspective, there are (3) scenarios to be discussed:
- When n = m, the linear system is well-determined and therefore it may have: (a) unique solution, (b) infinite many solutions or (c) no solutions.
- When m < n the linear system is under-determined (it has fewer equations than unknowns) and therefore it has infinitely many solutions or no solution.
- When m > n, the linear system is over-determined (it has more equations than variables) and therefore it has no solution.
The Linear Regression method deals most often with over-determined systems by identifying the solution x that “approximately” satisfies the initial (n) equations. The linear system A·x = b, is better written in the form A·x ≃ b, since the equality is not exactly satisfiable and a residual vector r = b – A·x exists. The solution is approached through minimizing the squared Euclidean norm of the residual vector. The Least Squares Problem is therefore stated as per below:
In order to minimize the norm, we perform the following calculations:
By computing the partial derivative:
The last equations represent the “transformed” linear system that is now well-determined, since the number of variables (n) is equal to the number of equations (n). This linear system is known as the so called normal equations. The analytical solution obtained by solving this linear system is the so called closed-form solution of the Least Squares Problem.
It is clarified that the Linear Regression method will provide the best approximation solution also to a linear system A·x = b, that is either well-determined or under-determined. Especially, in the case of consistent linear systems (having at least a solution), the Least Squares Problem solution will be exactly satisfiable, because the residual vector r will be equal to zero.
However, the Linear Regression method is most often applied to over-determined systems because they are of greatest importance. The reason is simple: the more measurements are performed during an experiment, the better are the chances to minimize the measurement error (noise) and map the “real” mechanism that generates the data into a fitting function. This is explained by the Law of Large Numbers in statistics.
The Least Squares Problem A·x ≃ b that “approximately” satisfies the initial (n) equations, always has a solution. Under certain circumstances that will be discussed more later on, this solution can be obtained through solving the normal equations. The below figure provides the geometric interpretation of the Least Squares Problem.
Least Squares Problem Geometry | Modified by : Stanford University (2020)
For an over-determined linear system, the vector b does not belong to the span of matrix A. It is therefore noticeable that the vectors Ax and b do not belong in the same subspace. As a result, they can never be “exactly” equal and a residual vector r always exists. The best possible solution of the “approximation” problem, exists when the vector b is closest to the vector Ax. This occurs when the residual vector r = b – Ax is orthogonal to the span of matrix A.
Due to orthogonality, this occurs when:
The geometric interpretation also resulted into the normal equations that were previously obtained through analytical calculations. These normal equations will provide a closed-form solution to the initial Least Squares Problem under specific circumstances.
In order to find the optimal solution for the Least Squares Problem analytically, the linear system of normal equations has to be solved by inverting the covariance matrix AT⋅A. Considering that the covariance matrix is invertible, the closed-form solution of the Least Squares Problem is the following:
The covariance matrix AT⋅A is by definition:
- Non-negative definite: all its eigenvalues are ≥ 0 (non negative)
The above assures that normal equations have either a unique or “infinite” solutions, assuming that the covariance matrix AT⋅A of the linear system exists. When the linear system has infinite solutions, the covariance matrix cannot be inverted though. It is only when the covariance matrix is positive definite, meaning that all its eigenvalues are > 0 (positive), that it is also invertible and therefore the normal equations have a unique solution that can be obtained analytically.
It is useful to know also that:
- When AT⋅A is non negative definite ⇒ the initial matrix A is rank-deficient ⇒ rank(A) < n ⇒ the columns of A are not linearly independent ⇒ Least Square Problem has infinite solutions.
- When AT⋅A is positive definite ⇒ the initial matrix A is full rank ⇒ rank(A) = n ⇒ the columns of A are linearly independent ⇒ Least Square Problem has unique solution.
To conclude, the unique solution of the Least Squares Problem can be obtained analytically by solving the normal equations, only when the matrix A is full rank, meaning that the covariance matrix AT⋅A is invertible. The inversion of a matrix is a computational process of great complexity and therefore it is always important to consider cost efficient algorithms in order to perform this task (e.g. LU decomposition, Cholesky decomposition etc.).
As noted by Sebastian Raschka (2013), the closed-form solution previously discussed is often preferred as a methodology for “smaller” datasets that the computation of the “costly” inverse of the covariance matrix is not a concern.
For datasets though that:
- are significantly large ⇒ meaning that the inversion of the covariance matrix AT⋅A is very costly in terms of computations.
- the inverse of the covariance matrix AT⋅A does not exist ⇒ the case discussed before that A is rank deficient and therefore perfect multi-collinearity between problem variables exists.
It is preferable or even necessary, to employ an optimization algorithm (e.g. Gradient Descent, SGD, MB-GD, Newton’s Method etc.), in order to obtain the best arithmetic solution to the Least-Square Problem. This approach encompasses modelling the mean squared error as an objective function (cost) and minimizing its value through optimizing the weights of the fitting model.
Gradient Descent Optimization | Source : Sebastian Raschka (2013)
Solving the Least Squares Problem arithmetically, by employing an optimization algorithm, is most often the case when using the Linear Regression method to model data from the real world. However, it is important to ensure that the arithmetic approach is not treated as a “black box” heuristic process and the principal assumptions arising from the theoretical concepts are well respected.
Each Linear Regression model is built on the following assumptions:
- Linearity Assumption: The underling relationship between the regressors X and the dependent variable y is linear.
- Independence Assumption: The random errors arising from predictions are independent of the regressors X.
- Normality Assumption: The random errors arising from predictions follow the normal distribution N(0,σ2).
- Homoscedasticity Assumption: The random errors arising from predictions have constant variance.
These principal assumptions need to be tested after training a Linear Regression model in order to justify the explainability, the predictive power and the generalization capability of the model to the unknown population.
NOTE: Additionally to the in-text citations quoted above, the personal notes taken during MSc in Data Science classes in Athens University of Economics & Business, should be referenced and acknowledged.