Chapter 3 Machine Learning Fundementals
3.1 Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern. This leads to several issues:
Poor Generalization: The model performs exceptionally well on training data but poorly on unseen test data, indicating it fails to generalize to new data.
Excessive Complexity: The model is overly complex, often due to an excessive number of parameters relative to the number of observations.
Poor Predictive Performance: An overfitted model has low predictive power on new data, resulting in unreliable predictions.
Sensitivity to Noise: The model becomes overly sensitive to minor variations in the training data, which can result in unstable and inaccurate predictions on new data.
To avoid overfitting, it is important to simplify the model, use regularization techniques, or apply cross-validation to ensure better generalization.
3.2 Underfitting
Underfitting occurs when a model is too simple to capture the underlying patterns in the data, leading to several problems:
Overly Simple Model: The model lacks the complexity needed to learn from the data effectively.
Poor Predictive Performance: The model performs poorly not only on test data but also on training data, indicating it has not captured the data’s structure well.
Insensitive to Patterns: The model fails to respond even to significant trends or patterns in the data, resulting in inaccurate and unreliable predictions.
To prevent underfitting, consider using a more complex model, adding relevant features, or reducing regularization.
3.3 Bias-Variance Trade-off
3.3.1 Bias
Bias is an error and can be defined as the difference between the average prediction of the model and the actual value we aim to predict. High bias occurs due to the oversimplification of the machine learning algorithm, leading to inaccurate predictions.
High-bias models result in underfitting and generally perform poorly on both the training and test data due to their inability to learn the underlying patterns in the data properly.
Causes of High Bias:
High bias is often caused by using overly simplistic models for complex data,
or when there is insufficient data to build a robust model.
Effect on Model Performance:
Models with high bias do not learn well from training data and therefore have high errors on both the training and test data.
3.3.2 Variance
Variance is the variability of model predictions for a given data point or a value which tells us spread of our data.
Models with high variance typically are too complex models and capture all the variation in the outcome including the noise. These models are generally performs very well in the training data but performs poorly on the test data with high error rates. (over-fitting issue)
3.3.3 Bias vs. Variance Trade-Off:
Your notes provide a good starting point for explaining the bias-variance trade-off during an interview, but they could be expanded with more detail to demonstrate a deeper understanding of the concept. Here’s a more comprehensive version that you could use in an interview setting:
3.3.4 Bias vs. Variance Trade-Off:
High Bias Models:
Models with high bias make strong assumptions about the data, leading to oversimplification. These models, like linear regression with few features, often underfit the data, meaning they fail to capture important patterns and relationships.
This results in both high training error and high test error. Such models are characterized by high bias and low variance—they perform consistently but poorly on different datasets because they are not flexible enough to model the true underlying structure of the data.
High Variance Models:
On the other end, models with low bias but high variance (such as decision trees without pruning) are highly flexible and can learn intricate details of the training data. However, they tend to overfit, meaning they capture noise and random fluctuations in the training data rather than the general pattern. These models have low bias but high variance, leading to good performance on training data but poor generalization to unseen data.
Finding the Balance:
The key challenge in machine learning is finding the right balance between bias and variance to minimize the overall error (which is the sum of bias error, variance error, and irreducible error).
This is often done through techniques like cross-validation, regularization (e.g., Ridge or Lasso for linear models), or by using ensemble methods like Random Forests and Gradient Boosting, which combine multiple models to reduce variance without significantly increasing bias.
Mitigation Techniques:
Discuss techniques like cross-validation, regularization (L1/L2), pruning for decision trees, or using ensemble methods like bagging and boosting, which are commonly used to manage the trade-off.
Visual Representation:
If possible, describe or draw the bias-variance trade-off curve, showing how training and test errors change with model complexity.
Discuss Decision Tree algorithm A decision tree is a popular supervised machine learning algorithm. It is mainly used for Regression and Classification. It allows breaks down a dataset into smaller subsets. The decision tree can able to handle both categorical and numerical data.
What is Prior probability and likelihood? Prior probability is the proportion of the dependent variable in the data set while the likelihood is the probability of classifying a given observant in the presence of some other variable.
Explain Recommender Systems? It is a subclass of information filtering techniques. It helps you to predict the preferences or ratings which users likely to give to a product.
Question: Please explain Recommender Systems along with an application. Answer: Recommender Systems is a subclass of information filtering systems, meant for predicting the preferences or ratings awarded by a user to some product. An application of a recommender system is the product recommendations section in Amazon. This section contains items based on the user’s search history and past orders.
Name three disadvantages of using a linear model Three disadvantages of the linear model are: • The assumption of linearity of the errors. • You can’t use this model for binary or count outcomes • There are plenty of overfitting problems that it can’t solve
Why do you need to perform resampling?
Resampling is done in below-given cases: • Estimating the accuracy of sample statistics by drawing randomly with replacement from a set of the data point or using as subsets of accessible data • Substituting labels on data points when performing necessary tests • Validating models by using random subsets
- List out the libraries in Python used for Data Analysis and Scientific Computations. SciPy, Pandas, Matplotlib, NumPy, SciKit, Seaborn
- What is Power Analysis? The power analysis is an integral part of the experimental design. It helps you to determine the sample size requires to find out the effect of a given size from a cause with a specific level of assurance. It also allows you to deploy a particular probability in a sample size constraint.
- Explain Collaborative filtering Collaborative filtering used to search for correct patterns by collaborating viewpoints, multiple data sources, and various agents.
- Discuss ‘Naive’ in a Naive Bayes algorithm? The Naive Bayes Algorithm model is based on the Bayes Theorem. It describes the probability of an event. It is based on prior knowledge of conditions which might be related to that specific event.
- What is a Linear Regression? Linear regression is a statistical programming method where the score of a variable ‘A’ is predicted from the score of a second variable ‘B’. B is referred to as the predictor variable and A as the criterion variable.
- State the difference between the expected value and mean value They are not many differences, but both of these terms are used in different contexts. Mean value is generally referred to when you are discussing a probability distribution whereas expected value is referred to in the context of a random variable.
- What the aim of conducting A/B Testing? AB testing used to conduct random experiments with two variables, A and B. The goal of this testing method is to find out changes to a web page to maximize or increase the outcome of a strategy.
- What is Ensemble Learning? The ensemble is a method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are:
Bagging Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions. Boosting Boosting is an iterative method which allows you to adjust the weight of an observation depends upon the last classification. Boosting decreases the bias error and helps you to build strong predictive models. 18. Explain Eigenvalue and Eigenvector Eigenvectors are for understanding linear transformations. Data scientist need to calculate the eigenvectors for a covariance matrix or correlation. Eigenvalues are the directions along using specific linear transformation acts by compressing, flipping, or stretching. Question: Please explain Eigenvectors and Eigenvalues. Answer: Eigenvectors help in understanding linear transformations. They are calculated typically for a correlation or covariance matrix in data analysis. In other words, eigenvectors are those directions along which some particular linear transformation acts by compressing, flipping, or stretching. Eigenvalues can be understood either as the strengths of the transformation in the direction of the eigenvectors or the factors by which the compressions happens.
Define the term cross-validation Cross-validation is a validation technique for evaluating how the outcomes of statistical analysis will generalize for an Independent dataset. This method is used in backgrounds where the objective is forecast, and one needs to estimate how accurately a model will accomplish. Question: Can you compare the validation set with the test set? Answer: A validation set is part of the training set used for parameter selection as well as for avoiding overfitting of the machine learning model being developed. On the contrary, a test set is meant for evaluating or testing the performance of a trained machine learning model.
Explain the steps for a Data analytics project The following are important steps involved in an analytics project: • Understand the Business problem • Explore the data and study it carefully. • Prepare the data for modeling by finding missing values and transforming variables. • Start running the model and analyze the Big data result. • Validate the model with new data set. • Implement the model and track the result to analyze the performance of the model for a specific period.
Question: What do you mean by cluster sampling and systematic sampling? Answer: When studying the target population spread throughout a wide area becomes difficult and applying simple random sampling becomes ineffective, the technique of cluster sampling is used. A cluster sample is a probability sample, in which each of the sampling units is a collection or cluster of elements. Following the technique of systematic sampling, elements are chosen from an ordered sampling frame. The list is advanced in a circular fashion. This is done in such a way so that once the end of the list is reached, the same is progressed from the start, or top, again.
- What is a Random Forest? Random forest is a machine learning method which helps you to perform all types of regression and classification tasks. It is also used for treating missing values and outlier values.
- What is the importance of having a selection bias? Selection Bias occurs when there is no specific randomization achieved while picking individuals or groups or data to be analyzed. It suggests that the given sample does not exactly represent the population which was intended to be analyzed.
- What is the K-means clustering method? K-means clustering is an important unsupervised learning method. It is the technique of classifying data using a certain set of clusters which is called K clusters. It is deployed for grouping to find out the similarity in the data.
- Explain the difference between Data Science and Data Analytics Data Scientists need to slice data to extract valuable insights that a data analyst can apply to real-world business scenarios. The main difference between the two is that the data scientists have more technical knowledge then business analyst. Moreover, they don’t need an understanding of the business required for data visualization.
- Explain p-value? When you conduct a hypothesis test in statistics, a p-value allows you to determine the strength of your results. It is a numerical number between 0 and 1. Based on the value it will help you to denote the strength of the specific result.
- Define the term deep learning Deep Learning is a subtype of machine learning. It is concerned with algorithms inspired by the structure called artificial neural networks (ANN).
- Explain the method to collect and analyze data to use social media to predict the weather condition. You can collect social media data using Facebook, twitter, Instagram’s API’s. For example, for the tweeter, we can construct a feature from each tweet like tweeted date, retweets, list of follower, etc. Then you can use a multivariate time series model to predict the weather condition.
- When do you need to update the algorithm in Data science? You need to update an algorithm in the following situation: • You want your data model to evolve as data streams using infrastructure • The underlying data source is changing If it is non-stationarity
- What is Normal Distribution A normal distribution is a set of a continuous variable spread across a normal curve or in the shape of a bell curve. You can consider it as a continuous probability distribution which is useful in statistics. It is useful to analyze the variables and their relationships when we are using the normal distribution curve.
- Which language is best for text analytics? R or Python? Python will more suitable for text analytics as it consists of a rich library known as pandas. It allows you to use high-level data analysis tools and data structures, while R doesn’t offer this feature.
- Explain the benefits of using statistics by Data Scientists Statistics help Data scientist to get a better idea of customer’s expectation. Using the statistic method Data Scientists can get knowledge regarding consumer interest, behavior, engagement, retention, etc. It also helps you to build powerful data models to validate certain inferences and predictions.
- Name various types of Deep Learning Frameworks • Pytorch • Microsoft Cognitive Toolkit • TensorFlow • Caffe • Chainer • Keras
- Explain why Data Cleansing is essential and which method you use to maintain clean data Dirty data often leads to the incorrect inside, which can damage the prospect of any organization. For example, if you want to run a targeted marketing campaign. However, our data incorrectly tell you that a specific product will be in-demand with your target audience; the campaign will fail.
- What is skewed Distribution & uniform distribution? Skewed distribution occurs when if data is distributed on any one side of the plot whereas uniform distribution is identified when the data is spread is equal in the range.
- When underfitting occurs in a static model? Underfitting occurs when a statistical model or machine learning algorithm not able to capture the underlying trend of the data.
- Name commonly used algorithms. Four most commonly used algorithm by Data scientist are: • Linear regression • Logistic regression • Random Forest • KNN
- What is precision? Precision is the most commonly used error metric is n classification mechanism. Its range is from 0 to 1, where 1 represents 100%
- What is a univariate analysis? An analysis which is applied to none attribute at a time is known as univariate analysis. Boxplot is widely used, univariate model.
- How do you overcome challenges to your findings? In order, to overcome challenges of my finding one need to encourage discussion, Demonstrate leadership and respecting different options.
- Explain cluster sampling technique in Data science A cluster sampling method is used when it is challenging to study the target population spread across, and simple random sampling can’t be applied.
- State the difference between a Validation Set and a Test Set A Validation set mostly considered as a part of the training set as it is used for parameter selection which helps you to avoid overfitting of the model being built. While a Test Set is used for testing or evaluating the performance of a trained machine learning model.
- Explain the term Binomial Probability Formula? “The binomial distribution contains the probabilities of every possible success on N trials for independent events that have a probability of π of occurring.”
- What is a recall? A recall is a ratio of the true positive rate against the actual positive rate. It ranges from 0 to 1.
- Discuss normal distribution Normal distribution equally distributed as such the mean, median and mode are equal.
- While working on a data set, how can you select important variables? Explain Following methods of variable selection you can use: • Remove the correlated variables before selecting important variables • Use linear regression and select variables which depend on that p values. • Use Backward, Forward Selection, and Stepwise Selection • Use Xgboost, Random Forest, and plot variable importance chart. • Measure information gain for the given set of features and select top n features accordingly.
- Is it possible to capture the correlation between continuous and categorical variable? Yes, we can use analysis of covariance technique to capture the association between continuous and categorical variables.
- Treating a categorical variable as a continuous variable would result in a better predictive model? Yes, the categorical value should be considered as a continuous variable only when the variable is ordinal in nature. So, it is a better predictive model. Question: Recall: What is the proportion of actual positives was identified correctly? TP / (TP + FN) Precision: What is the proportion of positive identifications was actually correct? TP / (TP + FP) Question: A false positive is an incorrect identification of the absence of a condition when it is absent. A false negative is an incorrect identification of the absence of a condition when it is actually present. Question: Please explain the goal of A/B Testing. Answer: A/B Testing is a statistical hypothesis testing meant for a randomized experiment with two variables, A and B. The goal of A/B Testing is to maximize the likelihood of an outcome of some interest by identifying any changes to a webpage. A highly reliable method for finding out the best online marketing and promotional strategies for a business, A/B Testing can be employed for testing everything, ranging from sales emails to search ads and website copy.
Question: Could you explain how to define the number of clusters in a clustering algorithm? Answer: The primary objective of clustering is to group together similar identities in such a way that while entities within a group are similar to each other, the groups remain different from one another. Generally, the Within Sum of Squares is used for explaining the homogeneity within a cluster. For defining the number of clusters in a clustering algorithm, WSS is plotted for a range pertaining to a number of clusters. The resultant graph is known as the Elbow Curve. The Elbow Curve graph contains a point that represents the point post in which there aren’t any decrements in the WSS. This is known as the bending point and represents K in K–Means. Although the aforementioned is the widely-used approach, another important approach is the Hierarchical clustering. In this approach, dendrograms are created first and then distinct groups are identified from there.
Question: Please explain Gradient Descent. Answer: The degree of change in the output of a function relating to the changes made to the inputs is known as a gradient. It measures the change in all weights with respect to the change in error. A gradient can also be comprehended as the slope of a function. Gradient Descent refers to escalating down to the bottom of a valley. Simply, consider this something as opposed to climbing up a hill. It is a minimization algorithm meant for minimizing a given activation function.
Question: Please enumerate the various steps involved in an analytics project. Answer: Following are the numerous steps involved in an analytics project: • Understanding the business problem • Exploring the data and familiarizing with the same • Preparing the data for modeling by means of detecting outlier values, transforming variables, treating missing values, et cetera • Running the model and analyzing the result for making appropriate changes or modifications to the model (an iterative step that repeats until the best possible outcome is gained) • Validating the model using a new dataset • Implementing the model and tracking the result for analyzing the performance of the same
Question: What are outlier values and how do you treat them? Answer: Outlier values, or simply outliers, are data points in statistics that don’t belong to a certain population. An outlier value is an abnormal observation that is very much different from other values belonging to the set. Identification of outlier values can be done by using univariate or some other graphical analysis method. Few outlier values can be assessed individually but assessing a large set of outlier values require the substitution of the same with either the 99th or the 1st percentile values. There are two popular ways of treating outlier values: 1. To change the value so that it can be brought within a range 2. To simply remove the value Note: - Not all extreme values are outlier values.
Discuss Artificial Neural Networks Artificial Neural networks (ANN) are a special set of algorithms that have revolutionized machine learning. It helps you to adapt according to changing input. So the network generates the best possible result without redesigning the output criteria.
What is Back Propagation? Back-propagation is the essence of neural net training. It is the method of tuning the weights of a neural net depend upon the error rate obtained in the previous epoch. Proper tuning of the helps you to reduce error rates and to make the model reliable by increasing its generalization.
Explain Auto-Encoder
Autoencoders are learning networks. It helps you to transform inputs into outputs with fewer numbers of errors. This means that you will get output to be as close to input as possible.
Define Boltzmann Machine Boltzmann machines is a simple learning algorithm. It helps you to discover those features that represent complex regularities in the training data. This algorithm allows you to optimize the weights and the quantity for the given problem.
What is reinforcement learning? Reinforcement Learning is a learning mechanism about how to map situations to actions. The end result should help you to increase the binary reward signal. In this method, a learner is not told which action to take but instead must discover which action offers a maximum reward. As this method based on the reward/penalty mechanism.
Training-Validation-Test
- We typically train our model but get evaluation metrics on the test data.
From ISLR:
In general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen data. Why is this what we care about? Suppose that we are interested in developing an algorithm to predict a stock’s price based on previous stock returns. We can train the method using stock returns from the past 6 months. But we don’t really care how well our method predicts last week’s stock price. We instead care about how well it will predict tomorrow’s price or next month’s price.
We can use patients data to train a statistical learning method to predict risk of diabetes based on clinical measurements. In practice, we want this method to accurately predict diabetes risk for future patients based on their clinical measurements. We are not very interested in whether or not the method accurately predicts diabetes risk for patients used to train the model, since we already know which of those patients have diabetes.
We want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE.
The problem is that many statistical methods specifically estimate coefficients so as to minimize the training set MSE. For these methods, the training set MSE can be quite small, but the test MSE is often much larger.
R-squared
- In simple linear regression \(r^2 = R^2\)
From ISLR:
A number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance σ2 is high, or both.
It can still be challenging to determine what is a good R2 value, and in general, this will depend on the application.
In certain problems in physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an R2 value that is extremely close to 1, and a substantially smaller R2 value might indicate a serious problem with the experiment in which the data were generated.
On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an \(R^2\) value well below 0.1 might be more realistic!
F-Test
Testing whether all of the regression coefficients are zero, i.e. \(H_0 :β_1 = β_2 =···=β_p = 0\)
\(Ha :\) at least one \(β_j\) is non-zero.