## Introduction

In 1950, Alan Turing was the first computer scientist to propose a “learning machine” that could learn from experience and become artificially intelligent. Today, Machine Learning and Artificial Intelligence have gained significant interest in the business world.

**Artificial intelligence (AI)** is the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings (Encyclopedia Britannica). Alan Turing proposed that a computer is said to possess artificial intelligence, if it can mimic human responses under specific conditions. From a theoretical perspective, AI includes (2) branches:

**General AI**(or strong AI): The machine has all the characteristics of human intelligence and therefore it can understand and reason its environment just as a human would do.**Narrow AI**(or weak AI): The machine has the capability to perform a single task (e.g. image recognition, text translation, purchase recommendation). Therefore, it exhibits only some aspects of human intelligence but is lacks in other areas.

**Machine learning (ML)** is the field of computer science that employs statistical inference in order to enable computers to “learn” from data, how to perform specific tasks, without being explicitly programmed (Arthur Samuel). ML evolved from the fields of computational statistics and patterns recognition; today it is the “vehicle” that drives the development of AI forward. Most ML algorithms have strong foundations on linear algebra and mathematical optimization.

Although some computer scientists do not consider AI capabilities unless they involve ML, the concept of artificial intelligence is theoretically broader and includes multiple scientific domains from Computer Science & Engineering. In other words, it has been correctly said that all machine learning is AI, but not all AI is machine learning (Gautam, 2019) .

## Framework

Machine Learning tasks can be primary categorized into (2) groups based on their learning approach: **supervised** and **unsupervised learning**. The main difference between them, is that supervised learning models employ prior knowledge for the dependent variables aiming to predict. Therefore, the goal of supervised learning is to infer the function y = f(X) + e, given a sample of values for the independent variables (X) and the dependent variable (y), that best approximates the relationship between dependent and independent variables. The noise (e) existing in the data is actually the portion of the independent variable (y) that can not be explained by the dependent variables (X). *Typical Example: predict the real estate price of a property (y) given its characteristics (X).*

On the other hand, unsupervised learning models do not have prior knowledge for any independent variable (y) and therefore the target here is to infer the natural structure that is present within the given set of observations (X). *Typical Example: group company customers into finite number of clusters based on their characteristics (X).*

Machine Learning algorithms can be also categorized into (2) groups based on the assumptions they make regarding the inference function (supervised case) or the data pattern (unsupervised case): **parametric** and **non parametric models**. According to Russel and Norvig (2016), a learning model that summarizes data with a set of parameters of fixed size (independent of the number of training observations) is called a parametric model. No matter how much data will be fed to a parametric model for training, this won’t change its opinion about how many parameters are needed for inference. In the case of supervised learning, the algorithm of a parametric model typically involves two steps:

- Select the form for the inference function: y = f(X) + e
- Learn the parameters for the function from training observations

Machine Learning algorithms that do not make strong assumptions regarding the form of the inference function are called non-parametric models. By not making any assumptions, these algorithms are free to learn any functional form from the training data that is provided. A very simplistic example of a non parametric function is predicting the target variable (y) simply by choosing on the closest observation existing in the data with respect to the independent variable (X).

Finally, Machine Learning tasks are generally classified into the following categories based on the problem’s nature: *Regression*, *Classification*, *Clustering* and *Anomaly Detection*.

* Supervised Learning *| Inference function: y = f(X) + e

** Regression**: Each observation of the independent variable X is assigned to a continuous value and therefore the dependent variable y is continuous.

*Parametric Models:* Linear regression, Linear SVM, Neural Networks (considering fixed architecture)

*Non-parametric Models:* Regression trees, Random Forests, Kernel SVM

** Classification**: Each observation of the independent variable X is assigned to a label and therefore the dependent variable y is either binary or multi class categorical.

*Parametric Models*: Bayes Classifier, Naïve Bayes Classifier, Logistics Regression, Linear SVM

*Non-parametric Models*: Decision trees, Random Forests, Kernel SVM, Nearest Neighbor classifier, K-Nearest Neighbor classifier

* Unsupervised Learning *| Data Pattern: g(X)

** Clustering**: Data is not labelled but they can be divided into groups based on similarity and other measures of structure inherent in the data.

** Anomaly Detection**: Data is not labelled but based on the assumption that the given instances follow a specific distribution, the identification of rare / suspicions observations can be enabled by detecting outliers.

## Bias & Variance

Machine Learning predictions have typically (2) sources of error: **Bias** and **Variance**. A comprehensive article on bias and variance trade-off has been written by Seema Singh (2018). She describes Bias as the accuracy of a model’s prediction and variance as the difference between many models’ predictions. The below bulls-eye diagram is often discussed in bibliography in order to explain these concepts.

Source: Prateek Joshi (2015)

The red dot is the “true” value (Y) that the model aims to predict. The blue dots are the predictions made by the inference function y = f(X), which is trained sequentially on different data that have been sampled from the real population, possibly with noise. The bulls-eye diagram can be interpreted as per below:

*Low bias and low variance*: The ML algorithm is stable and consistently predicts values y = f(X) that are close to the real value Y.*Low bias and high variance*: The ML algorithm is unstable and inconsistently predicts values y = f(X) that only on average are close to the real value Y.*High bias and low variance*: The ML algorithm is stable and consistently predicts values y = f(X) that are far from the real value Y.*High bias and high variance*: The ML algorithm is unstable and inconsistently predicts values y = f(X) that even on average are far from the real value Y.

The above demonstrate that high bias means inaccuracy in predictions because the model makes wrong assumptions for its parameters. For instance, this occurs when a simple function is employed with the purpose to describe a vector space of higher complexity. In this case, the model doesn’t understand adequately the training data and as a result it makes inaccurate predictions for the unknown data. Forman (2015) described this problem as under-fitting:

“Bias is the algorithm’s tendency to consistently learn the wrong thing by not taking into account all the information in the data (under-fitting).”

The above also demonstrate that high variance means instability of the training algorithm. This means that the model has increased sensitivity to the training data and limited flexibility to learn the true signal of the “population”. For instance, this occurs when a complex function is employed with the purpose to describe a vector space of lower complexity. In this case, the model memorizes noise from the training data and it as a result it is not able to make accurate predictions for the unknown data. Forman (2015) described this problem as over-fitting:

“Variance is the algorithm’s tendency to learn random things irrespective of the real signal by fitting highly flexible models that follow the error/noise in the data too closely (over-fitting).”

As cited by Scott-Fortmann (2012), dealing with bias and variance errors is about dealing with over and under-fitting problems. Bias is reduced and variance is increased when increasing the complexity of the training models. When more parameters are added to the training model, its complexity rises and as a result the variance becomes primary concern while bias steadily falls. The opposite results occur when the model’s parameters get reduced.

Source: Scott-Fortmann (2012)

Since bias and variance errors cannot be simultaneously reduced, the target is to identify the optimal trade-off between them. Andrew Ng (2014) has provided some practical guidelines for improving the trade-off between bias and variance:

In order to reduce variance he suggests:

- get more training examples
- decrease model complexity
- increase the regularization parameter
*(will be discussed further)*

In order to reduce bias he suggests:

- get additional features
- increase model complexity
- decrease the regularization parameter
*(will be discussed further)*