Chapter 2 Machine Learning Fundamentals

2.1 definitions

2.1.1 Data Science

2.1.1.1 What is data science?

At its core, data science is using data to answer questions. This is a pretty broad definition, and that’s because it’s a pretty broad field!

Data science can involve:

• Statistics, computer science, mathematics

• Data cleaning and formatting

• Data visualization

An Economist Special Report sums up this mélange of skills well - they state that a data scientist is broadly defined as someone:

“who combines the skills of software programmer, statistician and storyteller slash artist to extract the nuggets of gold hidden under mountains of data”

2.1.1.2 Why do we need data science?

One of the reasons for the rise of data science in recent years is the vast amount of data currently available and being generated. Not only are massive amounts of data being collected about many aspects of the world and our lives, but we simultaneously have the rise of inexpensive computing. This has created the perfect storm in which we have rich data and the tools to analyse it: Rising computer memory capabilities, better processors, more software and now, more data scientists with the skills to put this to use and answer questions using this data!

There is a little anecdote that describes the truly exponential growth of data generation we are experiencing. In the third century BC, the Library of Alexandria was believed to house the sum of human knowledge. Today, there is enough information in the world to give every person alive 320 times as much of it as historians think was stored in Alexandria’s entire collection.

And that is still growing.

2.1.1.3 What is big data?

It has been so integral to the rise of data science. There are a few qualities that characterize big data. The first is volume. As the name implies, big data involves large datasets - and these large datasets are becoming more and more routine. For example, say you had a question about online video - well, YouTube has approximately 300 hours of video uploaded every minute! You would definitely have a lot of data available to you to analyse, but you can see how this might be a difficult problem to wrangle all of that data!

And this brings us to the second quality of big data: velocity. Data is being generated and collected faster than ever before. In our YouTube example, new data is coming at you every minute! In a completely different example, say you have a question about shipping times or routes. Well, most transport trucks have real time GPS data available - you could in real time analyse the trucks movements… if you have the tools and skills to do so!

The third quality of big data is variety. In the examples I’ve mentioned so far, you have different types of data available to you. In the YouTube example, you could be analysing video or audio, which is a very unstructured data set, or you could have a database of video lengths, views or comments, which is a much more structured dataset to analyse.

2.1.1.4 Descriptive analysis

The goal of descriptive analysis is to describe or summarize a set of data. Whenever you get a new dataset to examine, this is usually the first kind of analysis you will perform. Descriptive analysis will generate simple summaries about the samples and their measurements. You may be familiar with common descriptive statistics: measures of central tendency (eg: mean, median, mode) or measures of variability (eg: range, standard deviations or variance). This type of analysis is aimed at summarizing your sample – not for generalizing the results of the analysis to a larger population or trying to make conclusions. Description of data is separated from making interpretations; generalizations and interpretations require additional statistical steps. Some examples of purely descriptive analysis can be seen in censuses. Here, the government collects a series of measurements on all of the country’s citizens, which can then be summarized. Here, you are being shown the age distribution in the US, stratified by sex.

2.1.1.5 Exploratory analysis

The goal of exploratory analysis is to examine or explore the data and find relationships that weren’t previously known. Exploratory analyses explore how different measures might be related to each other but do not confirm that relationship as causitive. You’ve probably heard the phrase “Correlation does not imply causation” and exploratory analyses lie at the root of this saying. Just because you observe a relationship between two variables during exploratory analysis, it does not mean that one necessarily causes the other. Because of this, exploratory analyses, while useful for discovering new connections, should not be the final say in answering a question! It can allow you to formulate hypotheses and drive the design of future studies and data collection, but exploratory analysis alone should never be used as the final say on why or how data might be related to each other.

2.1.1.6 Inferential analysis

The goal of inferential analyses is to use a relatively small sample of data to infer or say something about the population at large. Inferential analysis is commonly the goal of statistical modelling, where you have a small amount of information to extrapolate and generalize that information to a larger group.

Inferential analysis typically involves using the data you have to estimate that value in the population and then give a measure of your uncertainty about your estimate. Since you are moving from a small amount of data and trying to generalize to a larger population, your ability to accurately infer information about the larger population depends heavily on your sampling scheme - if the data you collect is not from a representative sample of the population, the generalizations you infer won’t be accurate for the population.

2.1.1.7 Predictive analysis

The goal of predictive analysis is to use current data to make predictions about future data. Essentially, you are using current and historical data to find patterns and predict the likelihood of future outcomes. Like in inferential analysis, your accuracy in predictions is dependent on measuring the right variables. If you aren’t measuring the right variables to predict an outcome, your predictions aren’t going to be accurate. Additionally, there are many ways to build up prediction models with some being better or worse for specific cases, but in general, having more data and a simple model generally performs well at predicting future outcomes. All this being said, much like in exploratory analysis, just because one variable may predict another, it does not mean that one causes the other; you are just capitalizing on this observed relationship to predict the second variable. A common saying is that prediction is hard, especially about the future. There aren’t easy ways to gauge how well you are going to predict an event until that event has come to pass; so evaluating different approaches or models is a challenge.

We spend a lot of time trying to predict things - the upcoming weather, the outcomes of sports events, and in the example we’ll explore here, the outcomes of elections. We’ve previously mentioned Nate Silver of FiveThirtyEight, where they try and predict the outcomes of U.S. elections (and sports matches, too!). Using historical polling data and trends and current polling, FiveThirtyEight builds models to predict the outcomes in the next US Presidential vote - and has been fairly accurate at doing so! FiveThirtyEight’s models accurately predicted the 2008 and 2012 elections and was widely considered an outlier in the 2016 US elections, as it was one of the few models to suggest Donald Trump at having a chance of winning.

2.1.1.8 Causal analysis

The caveat to a lot of the analyses we’ve looked at so far is that we can only see correlations and can’t get at the cause of the relationships we observe. Causal analysis fills that gap; the goal of causal analysis is to see what happens to one variable when we manipulate another variable - looking at the cause and effect of a relationship. Generally, causal analyses are fairly complicated to do with observed data alone; there will always be questions as to whether it is correlation driving your conclusions or that the assumptions underlying your analysis are valid. More often, causal analyses are applied to the results of randomized studies that were designed to identify causation. Causal analysis is often considered the gold standard in data analysis, and is seen frequently in scientific studies where scientists are trying to identify the cause of a phenomenon, but often getting appropriate data for doing a causal analysis is a challenge. One thing to note about causal analysis is that the data is usually analysed in aggregate and observed relationships are usually average effects; so, while on average giving a certain population a drug may alleviate the symptoms of a disease, this causal relationship may not hold true for every single affected individual.

2.1.1.9 Experimental Design

Now that we’ve looked at the different types of data science questions, we are going to spend some time looking at experimental design concepts. As a data scientist, you are a scientist and as such, need to have the ability to design proper experiments to best answer your data science questions! What does experimental design mean? Experimental design is organizing an experiment so that you have the correct data (and enough of it!) to clearly and effectively answer your data science question. This process involves clearly formulating your question in advance of any data collection, designing the best set-up possible to gather the data to answer your question, identifying problems or sources of error in your design, and only then, collecting the appropriate data. Why should you care?

2.1.1.10 Experiemental Design:

Confounding Variables:

Confounding variables are outside factors that influence both the independent and dependent variables in a study, potentially distorting the true relationship between them and leading to incorrect conclusions.

Example: Age is a confounder in a study investigating the relationship between shoe size and literacy, as it affects both variables. Any observed correlation might actually be due to age.

Methods to Control for Confounders:

Measuring Confounders: Measure the confounding variable (e.g., age) and adjust for it in the analysis to isolate the effect of the independent variable.

Fixing Confounders: Keep the confounder constant (e.g., selecting participants of the same age) to eliminate its influence.

Use of Control Groups:

A control group does not receive the experimental treatment but still has the dependent variable measured. This allows for comparison against the treatment group to determine the effect of the treatment itself.

Blinding and the Placebo Effect:

Blinding is a technique where participants do not know whether they are in the treatment or control group. This reduces biases, such as the placebo effect, where participants might feel better just because they believe they are receiving treatment.

A mock treatment (e.g., a sugar pill) is given to the control group to ensure both groups are equally exposed to the placebo effect.

Randomization:

Randomization involves randomly assigning participants to treatment or control groups. This helps balance potential confounders across groups, minimizing bias and reducing systematic errors.

Replication:

Replication means repeating an experiment with new subjects to confirm the results. This reduces the impact of chance, outliers, or systematic errors and allows for a more accurate assessment of data variability, strengthening the study’s validity.

2.1.1.11 Beware p-hacking!

p-Hacking occurs when researchers perform numerous tests on a dataset to find any statistically significant results (p < 0.05), even if those results are merely due to chance. For example, if 20 independent tests are conducted, one might expect at least one to show a significant result purely by chance (5%). In the era of big data, it’s easy to perform many tests and unintentionally or intentionally find patterns that seem significant but are actually random.

2.1.1.12 Data types

Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example would be something like weight measured in kg.
Ordinal data have a fixed, small (< 100) number of levels but are ordered. This could be for example survey responses where the choices are: poor, fair, good.
Categorical data are data where there are multiple categories, but they aren’t ordered. One example would be sex: male or female. This coding is attractive because it is self-documenting.
Missing data are data that are unobserved and you don’t know the mechanism. You should code missing values as NA.
Censored data are data where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit or a patient being lost to follow-up. They should also be coded as NA when you don’t have the data. But you should also add a new column to your tidy data called, “VariableNameCensored” which should have values of TRUE if censored and FALSE if not.

2.1.1.13 Data scientists in marketing science

Data scientists in marketing science play a crucial role in leveraging data-driven insights to optimize marketing strategies and improve decision-making.

Data scientists in marketing science contribute significantly to the development of targeted, efficient, and impactful marketing campaigns by harnessing the power of data and analytics.

Their work helps organizations optimize their marketing spend, enhance customer experiences, and achieve measurable business outcomes.

Here are some key responsibilities and activities that data scientists in marketing science typically engage in:

Data Analysis:
- Conducting extensive data analysis to understand customer behavior, market trends, and other relevant metrics.
- Utilizing statistical methods and machine learning algorithms to extract meaningful patterns and insights from large datasets.
Predictive Modeling:
- Developing and deploying predictive models to forecast future trends, customer behavior, and campaign outcomes.
- Using machine learning techniques, such as regression analysis, decision trees, and ensemble methods, to build predictive models.
Segmentation and Targeting:
- Creating customer segments based on demographics, behavior, and other relevant factors.
- Optimizing marketing strategies by targeting specific segments with personalized and relevant content.
A/B Testing:
- Designing and conducting A/B tests to evaluate the effectiveness of different marketing strategies, campaigns, or variations.
- Analyzing A/B test results to make data-driven recommendations for optimization.
Causal Inference:
- Applying advanced causal inference methods to understand the impact of marketing initiatives on customer behavior.
- Assessing the causal relationships between marketing activities and business outcomes.
Data Visualization:
- Creating clear and compelling data visualizations to communicate complex insights to non-technical stakeholders.
- Using tools like Tableau, Power BI, or custom scripts to visualize data in a meaningful way.
Optimization Strategies:
- Collaborating with marketing teams to develop and optimize marketing strategies based on data insights.
- Recommending adjustments to campaigns, targeting strategies, and budget allocations for better performance.
Performance Measurement:
- Developing key performance indicators (KPIs) and metrics to assess the success of marketing campaigns.
- Monitoring and evaluating marketing performance against established benchmarks.
Data Management:
- Ensuring the quality and integrity of marketing data by cleaning, preprocessing, and validating datasets.
- Collaborating with data engineers to design and implement data pipelines for efficient data processing.
Communication and Collaboration:

Effectively communicating findings and insights to non-technical stakeholders, including marketing teams and executives.
Collaborating with cross-functional teams to align data-driven strategies with overall business objectives.