Predictive Analytics & Marketing – The Next Stage of Business Optimization
You must have heard of something called ‘predictive analytics‘ at some point in your life as a marketer/analyst.
The term ‘predictive analysis’ is so heavy that we have conferences (like Predictive Analytics World) organized around it all over the world.
What is Predictive Analytics?
Predictive analytics is an area of statistical analysis in which we extract information from data and then analyze it to identify patterns and make predictions about future outcomes/events.
Both trend analysis and regression analysis are a subset of predictive analysis.
In trend analysis, we extract information from data and then analyze the information to identify patterns in that information.
In regression analysis, we make predictions about future outcomes/events by estimating the relationship between two or more variables.
Regression analysis is based on the idea that what has happened in the past can give us an idea of what can happen in the future.
Predictive analytics can be applied to a wide range of industries like sports, real estate, insurance, marketing, etc.
What is Predictive Marketing?
Predictive marketing is the application of predictive analytics in marketing. It can help you answer the following questions:
What my sales will likely to be for the second half of the year?
Where I should invest my money and resources to get the highest possible ROI?
Where current business operations and marketing practices will take me?
Which customers are most likely to respond to our new social media campaign?
….. and the questions can be endless.
My article is an attempt to unravel the mysteries of the techniques used in ‘Predictive Marketing’ and bring statistics into marketing for the greater good.
At the time of writing this article, I know only a handful of marketers who sell ‘predictive marketing’ as a service.
The reason behind this is simple.
Predictive marketing is not everybody’s cup of tea.
It is too difficult to learn and equally difficult to implement.
This is the main reason for writing this article. I want to make these techniques a bit easier for you to learn and implement.
I want to be honest with you. There is a steep learning curve, and this is the reason why this post is so long.
In order to learn predictive marketing techniques, you need to climb a pyramid in which each step is progressively more difficult than the last one:
But once you have climbed all the steps and reached the top, you can actually make predictions that are statistically accurate.
All you have to do is to stick with me until the very end of this post.
I know it is a long journey.
It took me almost 1, 1/2 month to write this post.
So it will probably take you a couple of days to absorb all the material in this post.
But there is no hurry. Take your time.
If you can read and absorb all the material in one go, then you are just too smart and hats off to you.
I spent an awful lot of time deciding how to present the course material so that you can quickly learn the basic predictive marketing techniques and apply them to your job/work straight away.
In order to learn predictive marketing you need to know or learn ‘linear algebra’ first followed by ‘linear relationship’, ‘correlation’, ‘trends’ and ‘regression’.
Once you know all these concepts, only then can you proceed to make predictions in a scientific way.
Feel free to skip the section(s) with which you think you are familiar.
But remember that each section is built on the knowledge acquired through the previous section.
So if you miss something important in one section, it may cost you dearly in the subsequent sections. I have warned you.
I won’t go down into great details but will just present to you the bare minimum you need to know in order to get you going with predictive marketing techniques.
Hopefully, my article will build a solid foundation for you, and you can then dive deeper on your own depending upon your passion level and current situation (marketer, analyst, student, etc).
Get weekly practical tips on GA4 and/or BigQuery to accurately track and read your analytics data.
Linear Algebra
Slope (Gradient) of a straight line
The slope of a straight line defines how steep the line is.
Slope = change in vertical distance/change in horizontal distance
Note: Change in height is also called ‘rise’ or ‘fall’. Whereas change in horizontal distance is also called ‘run’.
The slope of this line =3/3 =1
Note: Large slope means a steeper line. For example:
Here the slope of line = 4/2 = 2
A slope can be positive, negative, zero, or undefined.
Positive Slope – A slope is positive if both change in horizontal distance and vertical distance is positive. For example:
Here the slope of line = 4/2 = 2
Negative Slope – A slope is negative if the change in horizontal distance is negative or the change in vertical distance is negative. For example:
Here slope of the line = -4/2 = -2
Zero Slope – A slope is zero if the change in vertical distance is zero. For example:Here slope of the line = 0/2 =0
Undefined Slope – A slope is undefined if the change in horizontal distance is zero. For example:Here slope of the line = 4/0 = undefined
Y Intercept of a Straight Line
The y-intercept of a straight line is the point where a line crosses the y-axis. For example:
In the diagram, the line crosses the y-axis at 1. So here, the y-intercept is 1.
Equation of a straight line
The equation of a straight line is: Y=mX+b
Here, ‘m’ is the slope of the straight line, and ‘b’ is the y-intercept.
For example, let us calculate the equation of the following straight line:
Slope = 2/2 =2. So m =1
Y intercept = 1. So b= 1
So the equation of the straight line is: y=x + 1
Linear Relationship
If a linear relationship exist between two variables then increase or decrease in the value of one variable will cause corresponding increase or decrease in the value of other variable.
This type of relationship exists between dependent and independent variables.
For example:
Suppose you want to forecast sales for your company, and you have concluded that your sales go up and down depending on changes in average order value (AOV).
The sales you are forecasting would be the dependent variable because its value “depends” on the value of AOV, and the AOV would be the independent variable.
Independent variable is generally denoted by the letter ‘x’. Whereas dependant variable is generally denoted by the letter ‘y’.
In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect.
You can express linear relationship between ‘x’ and ‘y’ via equation of the straight line: y=mx+b
Here, ‘m’ is the slope of the straight line, and ‘b’ is the y-intercept.
You can also express a linear relationship in a graphical format via an Excel scatter chart like this:
Note: In regression analysis, we predict the value of the dependent variable from one or more independent variables.
How you can determine whether the relationship between two variables is linear or not?
There are two methods to determine a linear relationship between two variables.
One is through Excel Scatter Chart, and one is through a residual plot.
Excel Scatter Chart
Step-1: To determine how two variables are related, graph the data points using Excel ‘scatter chart ‘so that the independent variable is on the x-axis and the dependent variable is on the ‘y’ axis.
Make sure that the column of data that you want to display on the x-axis is located to the left of the column of data you want to display on the y-axis in your excel table.
Something like this:
Step-2: Once you have plotted the data points. You will see a graph, something like Fig.1 above. If you look at the data points in Fig.1, you can see that they resemble a straight line.
I have added a trendline to make this look more obvious in Fig.2.
The trendlines are used to graphically display trends in data.
To add a trendline to your chart, follow the steps below in MS excel 2013:
- Select the chart and then click on the Design tab in the Excel ribbon.
- Click on the ‘Add Chart Element’ button and then select Trendline > ‘Linear’
Note: A trendline is also known as a ‘regression line’ or ‘least square line’
Residual Plot
A residual (denoted by ‘e’) is the difference between the observed value and predicted value of a dependent variable in regression analysis. Each data point has one residual.
Residual (e) = Observed value – Predicted value
For example:
Note: Residuals are also called ‘errors’ or ‘Regression Residuals’
In the table above, column F contains all the observed values of the dependent variable ‘y’.
Column G contains all the predicted values of the dependent variable ‘y’.
Column H contains the residuals.
Note: Both the sum and the mean of the residuals are equal to zero.
A residual plot is a graph that shows all the residuals on the vertical axis and the independent variable on the horizontal axis.
If the points in a residual plot are randomly dispersed around the horizontal axis, then the relationship between two variables is linear otherwise, it is non-linear.
For example, the following residual plot shows a linear relationship as the data points are randomly dispersed around the horizontal axis:
Now, look at the following residual plots:
Here the first residual plot shows a linear relationship. The other two plots show a non-linear relationship as the data points are not randomly dispersed around the horizontal axis.
Note(1): The sum of the residuals is always zero, whether the data set is linear or nonlinear.
Note(2): You need to conduct a regression analysis in order to create residual plots. I will teach you later in the post how to create a residual plot.
Correlation
Correlation is a statistical measurement of relationship between two variables. Correlation measures how two variables change in relation to each other.
The formula used to calculate the relationship between two variables is called covariance:
Covariance measures the strength and direction of the relationship between two variables. It is a measure of how much two variables change together.
Let us suppose ‘x’ and ‘y’ are two variables.
If as ‘x’ goes up, ‘y’ goes up then covariance would be positive.
However, if as ‘x’ goes up, ‘y’ goes down, then covariance would be negative.
The problem with calculating covariance is that the number you get from the calculation is hard to interpret.
This happens because covariance is not normalized (statistically adjusted).
Correlation Coefficient
The correlation coefficient is the normalized version of covariance and is calculated as:
Here we are simply dividing the covariance by the product of the standard deviation of the two variables.
Note: The correlation coefficient is not useful if the relationship between variables is non-linear.
Types of Correlation Coefficient
The most common correlation coefficient is ‘The Pearson product-moment correlation coefficient’ which measures the strength and direction of the linear relationship (a relationship which resembles a straight line) between variables.
When we speak simply of a correlation coefficient, we are generally referring to the ‘Pearson product-moment correlation’.
‘Spearman’s Rank Correlation’ and ‘Kendall Rank Correlation’ are some other types of correlations.
The correlation coefficient of a sample (sample is set of observations drawn from a population) is denoted by ‘r’.
The correlation coefficient of a population (population is total set of observations) is denoted by ‘R’.
The value of the correlation coefficient can range from +1 to -1.
The sign (positive or negative) of the correlation coefficient describes the direction of the relationship between two variables.
Whereas the absolute value of the correlation coefficient describes the strength of the relationship between two variables.
Correlation of +1 means that the variables ‘x’ and ‘y’ have strongest positive linear relationship i.e. when ‘x’ is larger than average, ‘y’ tends to be larger than average. Similarly when ‘x’ is smaller than average, ‘y’ tends to be smaller than average.
Correlation of -1 means that the variables ‘x’ and ‘y’ have strongest negative linear relationship i.e. when ‘x’ is larger than average, ‘y’ tends to be smaller than average. Similarly when ‘x’ is smaller than average, ‘y’ tends to be larger than average.
Correlation of 0 means that no linear relationship exist between the two variables. Keep in mind that the ‘Pearson product-moment correlation coefficient’ only measures linear relationships.
Therefore, a correlation coefficient of 0 does not mean zero relationships exist between two variables.
It simply means no linear relationship exists between the two variables.
It is possible for two variables to have curvilinear relationship (it is a non linear relationship which doesn’t resemble a straight line) instead of linear relationship.
Note: A correlation greater than 0.8 is generally considered a strong correlation.Whereas a correlation less than 0.5 is generally considered a weak correlation. However, these values can vary based on the type of data being analysed.
Now consider the following scatter charts:
We can conclude the following facts from the above scatter charts:
1. In the case of a positive correlation, the slope of the straight line is positive. Whereas in the case of negative correlation, the slope of the straight line is negative.
2. In the case of the strongest positive correlation (r=1) or strongest negative correlation (r=-1) the data points form an exact straight line.
3. The correlation becomes weaker as the data points become more scattered. So in the case of zero correlation, the data points are completely scattered (in a random manner).
Outliers effect correlation
Outliers are extreme values that differ greatly from other values.
In regression analysis, the data points that diverge greatly from the overall pattern of the data are called outliers.
Note: If the values of the two variables are unreliable (because of measurement error or other errors), then the correlation coefficient could be lower than expected.
Correlation Coefficient and Predictive Power
A correlation coefficient (R) is a good indicator of the predictive relationship between two variables.
For example:
If the value of R is zero then the dependent variable cannot be predicted from the independent variable as the relationship between two variables is non-linear.
If the value of R is 1 or -1 then the dependent variable can be predicted without any error from the independent variable.
The value of R between 0 and 1 or 0 or -1 measures the extent to which the dependant variable is predictable from the independent variable.
For example, an R of 0.50 means that 50% of the variation in the dependent variable (y) is predictable from the independent variable (x). The other 50% variation in ‘y’ can’t be explained.
Similarly, an R of 0.20 means that 20% of the variation in the dependent variable (y) is predictable from the independent variable (x). The other 80% variation in ‘y’ can’t be explained.
Therefore it is important that you determine the correlation between two variables before you run a regression analysis.
If there is no linear relationship or weak linear relationship between two variables, or in other words, the correlation between the two variables is zero or weak, then such a relationship is not good to predict anything.
So there is no point running a regression analysis, then.
Coefficient of Determination or R-Squared Value
1st Definition – The coefficient of determination (denoted by R2 or R-Squared value) is a number from 0 to 1 that describes how well a regression line (i.e. trendline) fits a data set.
A trend line is most reliable/accurate when its R2 value is 1 or near 1.
2nd Definition – The coefficient of determination (denoted by R2) is a number from 0 to 1 that measures the percentage of variation in the dependent variable that is predictable from the independent variable(s).
3rd Definition – The coefficient of determination (denoted by R2) is the square of the correlation coefficient (R).
4th Definition – The coefficient of determination allows to interpret the relationship between two variables in terms of variation.
5th Definition – The coefficient of determination is the ratio of ‘explained variation’ to the total variation.
If the value of R2 is zero then the dependent variable cannot be predicted from the independent variable.
If the value of R2 is 1 then the dependent variable can be predicted without any error from the independent variable.
The value of R2 between 0 and 1 measures the extent to which the dependant variable is predictable from the independent variable.
For example, an R2 of 0.50 means that 50% of the variation in the dependent variable (y) is predictable from the independent variable (x). The other 50% variation in ‘y’ can’t be explained.
Similarly, an R2 of 0.20 means that 20% of the variation in the dependent variable (y) is predictable from the independent variable (x). The other 80% variation in ‘y’ can’t be explained.
For example, in the chart below, 62% of the variation in sales can be explained by the variation in AOV. The remaining 38% (100-62) variation in sales can be explained by other factors:
The other factors that can explain the variation in sales can be determined by running multiple regression analyses.
Note: Coefficient of determination is a key output of regression analysis.
How to Display R-Squared Value on Your Chart
To display this value on your chart, follow the steps below:
Step-1: In MS Excel 2013, right-click on the trendline and then select ‘Format Trendline’ from the drop-down menu. This will open up the ‘Format Trendline’ box on the right-hand side of the excel window.
Step-2: In the ‘Format Trendline’ box, select the checkbox ‘Display R-Squared Value on chart’:
You can now see the R-Squared value in your chart like the one below:
Note: When you add a trendline to a chart, excel automatically calculates the R-squared value of the trendline.
Methods to Find a Correlation Between Two Sets of Data
There are three methods of finding a correlation between two sets of data in Excel:
1. Through Analysis Tool Pack
2. By using CORREL function
3. By using the R2 Value
Using Analysis Tool Pack for Finding a Correlation Between Two Data Sets
The ‘Analysis tool pack’ is an excel add-in through which you can do complex statistical analysis (like multiple regressions) in Excel. It is available in Excel 2007 and above versions.
To install Analysis Tool Pack in Excel 2013, follow the steps below:
Step-1: Open MS Excel and then click on the File Tab.
Step-2: Go to Options > Excel Options > Add-ins
Step-3: Select the ‘Analysis ToolPak’ option and then click on the ‘go’ button:
Step-4: Select the ‘Analysis ToolPak’ checkbox and then click on the ‘ok’ button:
Once the ‘Analysis ToolPak’ is installed, you will see the new ‘Data Analysis’ button on the extreme right under the ‘Data’ tab:
Now click on the ‘Data Analysis’ button, select ‘correlation’ from the list and then click on the ‘ok’ button:
Select the ‘input range’ (the two columns in this case) and output range (where you want to display the results), and then click on the ok button:
You can now see the correlation between two sets of data:
Here the correlation between AOV and Sales is 0.792282
Using CORREL Function for Finding Correlation Between Two Data Sets
Through this Excel function, you can calculate the correlation between two data sets without using the ‘Analysis Tool Pack’.
Syntax: CORREL (array1, array2)
For example:
Here the correlation between AOV and Sales is 0.792282. This is the same value we got when we used the ‘Analysis Tool Pack’ to calculate the correlation between two data sets.
Using the R2 Value for Finding a Correlation Between Two Data Sets
The correlation between two data sets can also be defined as the square root of the R2 value:
You can calculate the R2 value by using the RSQ excel function.
Syntax: RSQ (known_y’s, known_x’s)
Here known_y’s are the values of the dependent variable (in our case, ‘Sales’), and known_x’s are the values of the independent variable (in our case ‘AOV’). For example:
Once you have determined the R2 value, then take its square root via the excel function SQRT to determine the correlation between two data sets.
Note: The correlation determined through the r2 value matches the correlation determined through the CORREL function and the correlation determined through the ‘Analysis Tool Pack’.
Trend Analysis Basics
A trend is a movement in a particular direction.
We do trend analysis for the following main reasons:
#1 Measure the performance of a marketing campaign over a period of time. Through trend analysis, you can identify areas where your campaign is performing well (so that you replicate success) and the areas where your campaign is underperforming.
#2 Trend analysis help in getting answers to questions like:
- What are my top selling products?
- What are my top converting keywords?
- Which keywords I should bid on?
- Where I should invest my money and resources to get highest possible ROI?
- Which are the most effective marketing channels in terms of conversions and revenue?
#3 Trend analysis help in determining where current business operations and marketing practices could take you. Negative trends act as a warning of potential problems with current practices.
#4 Trend analysis can help in refining business decisions and strategies. We develop strategies that respond to the identified trends. Moving with a positive trend and not against them can lead to more profit.
#5 Trend analysis can help in identifying risks and opportunities.
#6 Through trend analysis, we can predict future outcomes and events, like what the sales will likely be for the second half of the year.
Importance of Accurately Analyzing and Interpreting Data Trends
How we analyze and interpret the ‘data trends play a very important role in optimizing our marketing campaigns and making predictions about future outcomes.
One wrong interpretation, and we can end up losing a hundred thousand pounds (depending upon the size of your business).
Important Points to Keep in Mind while Analyzing, Interpreting and Reporting Data Trends
#1 Always question how the data is collected
#2 Understand that historical data is in fact “dated”
#3 Select the right time period to analyse your data trends
#4 Add comparison to your data trends
#5 Never report standalone metric in your data trends
#6 Segment your data before you analyse/report data trends
#7 Report something business bottomline Impacting
#8 Make the insight obvious
#9 Use sparklines and conditional formatting.
#10 Don’t jump to conclusions
#11 Analysing single trend is rarely useful.
Related Post: How to analyze, interpret and report data trends in Google Analytics
Types of Trend
1 Uptrend – moving higher
2 Downtrend – moving lower
3 Sideways/horizontal trend – moving sideways.
Note: Sideways trend is technically not a trend. It is a lack of well-defined trends in either direction.
Trend Length
A trend can be short (or seasonal), intermediate, or long-term. The longer the trend more significant it is.
For example, a 3 months trend is not as significant as 3 years trend.
Regression Analysis
Regression analysis is a statistical technique which is used to estimate relationship between a dependant variable and one or more independent variables.
In regression analysis, we try to predict the value of a dependent variable (y) from one or more independent variables (x1,x2,x3,…)
When we use one independent variable to predict the value of a dependent variable then such type of regression is known as linear regression.
The analysis, which is done in linear regression, is known as the linear regression analysis.
When we use two or more independent variables to predict the value of a dependent variable, then such type of regression is known as multiple regression.
The analysis, which is done in multiple regression, is known as the multiple regression analysis.
Equation used to calculate linear regression
Y = a + bX
Here ‘y’ is the dependant variable (the variable whose value we are trying to predict)
‘X’ is the independent variable (the variable whose value we are using to predict the value of ‘y’)
‘a’ is the ‘y-intercept’
‘b’ is the slope of the regression line
Equation used to calculate multiple regression
Y = a + b1X1+ b2X2 + b3X3 +…..bnXn
Here ‘y’ is the dependant variable (the variable whose value we are trying to predict)
‘X1’, ‘X2’, ‘X3’….’Xn’ are the independent variable (the variables whose values we are using to predict the value of ‘y’)
‘a’ is the ‘y-intercept’
‘b1’,’b2’,’b3’…..’bn’ are the slope of the regression lines
Note: You can use up to 15 independent variables in excel to do multiple regression analysis.
By using regression analysis, you can extend a trendline in a chart beyond the actual data to predict future values of the dependent variable.
For example, the following chart uses a trendline that is forecasting four months ahead and is clearly showing a trend of increase in sales.
However, for regression analysis to work, you need to select the right trendline for your data in the first place.
Choosing the Best Trendline for Your Data
Choosing the right trendline for your data is very important. If you applied the wrong trendline, then your predictions/forecast won’t be accurate.
The R2 value can help you in choosing the best trendline for your data.
There are six types of trendlines (or regression lines) available in Excel:
1. Linear Trendline
2. Logarithmic Trendline
3. Polynomial Trendline
4. Moving Average Trendline
5. Power Trendline
6. Exponential Trendline
Add each of the trendlines in turns and note down its R2 value. The trendline whose R2 value is highest is the best trendline to use.
Linear Trendline
Use this trendline if your data set is linear (resembles a straight line) and the data values are increasing or decreasing at a steady rate.
For example, the following chart contains a linear trendline as visits are increasing at a steady rate:
To add a linear trendline to your chart, follow the steps below:
Step-1: Right-click on the data points in your chart and then select the ‘Add Trendline’ option.
Step-2: From the ‘Format Trendline’ dialogue box, select ‘linear’ and click on the two checkboxes: ‘Display equation on chart’ and ‘Display R-Squared Value on chart’:
Step-3: Click on the ‘close’ button. You will now see the chart with linear trendline, linear trendline equation, and R2 value:
Use the same procedure to add other types of trendlines to your chart.
The equation used to calculate the linear trendline is: y=mx+b
Here,
‘y’ is the dependent variable,
‘m’ is the slope of the line
‘b’ is the ‘y’ intercept.
Excel automatically calculates the linear trendline for you once you apply it to your chart. In the chart above, the equation used to display the linear trendline is:
y = 2819.7x + 38691
Here 2189.7 is the slope of the line, and 38691 is the y-intercept. Also, note the R2 value of 0.9826, which is close to 1. This means this trendline is very reliable.
Logarithmic Trendline
Use this trendline if your data values increase or decrease quickly and then level out.
For example, the following chart contains a logarithmic trendline as visits increases quickly and then level out:
The equation used to calculate the logarithmic trendline is: y= c ln x + b
Here ‘c’ and ‘b’ are constants, and ‘ln’ is the natural logarithm function.
Excel automatically calculates the logarithmic trendline for you once you apply it to your chart.
In the chart above the equation used to display the logarithmic trendline is: y = 22724ln(x) + 39839
Also, note the R2 value of 0.9296, which is close to 1. This means this trendline is very reliable.
Polynomial Trendline
Use this trendline if there is fluctuation in your data.
The order of the polynomial is the number of fluctuations in the data or the number of bands (peaks and valleys) that appeared in a curve.
For example, use an order two polynomial trendline if there is only one hill or valley.
Similarly, use order 3 polynomial trendline if there are one or two hills or valleys.
The following chart contains a polynomial trendline
The equation used to calculate the polynomial trendline is:
Here b and c1, c2,,,c6 are constants
Excel automatically calculates the polynomial trendline for you once you apply it to your chart.
In the chart above, the equation used to display the polynomial trendline is: y = 3750×2 – 18351x + 61130
Also, note the R2 value of 0.8394, which is close to 1. This means this trendline is quite reliable.
Moving Average Trendline
Use this trendline if there is a lot of fluctuation in your data.
You select the number of data points you want the trendline to average (by using the period option in excel) and use it as a point in the line.
For example,
If you set the period to 2, then the average of the first two data points is used as the first point in the moving average trendline.
The average of the 2nd and 3rd data points is used as the second point in the trendline, and so on.
Moving average trendlines are generally used in stock analysis.
Generally, low period values are used to show more immediate trends, and high period values are used to show long-term trends.
The following chart contains a moving average trendline:
Note: The R2 value is not available with a moving average trendline.
Power Trendline
Use this trendline if the data values increase at a specific rate. The following chart contains a power trendline:
The equation used to calculate the power trendline is:
Here ‘c’ and ‘b’ are constants.
Excel automatically calculates the power trendline for you once you apply it to your chart. In the chart above, the equation used to display the power trendline is:
y = 334.09×0.2455
Also, note the R2 value of 0.9319, which is close to 1. This means this trendline is very reliable.
Note: You can not create a power trendline if your data contains zero or negative values.
Exponential Trendline
Use this trendline if data values increase or decrease at increasingly higher rates.
The following chart contains an exponential trendline:
The equation used to calculate the exponential trendline is:
Here ‘c’ and ‘b’ are constants. ‘e’ is the base of the natural logarithm.
Excel automatically calculates the exponential trendline for you once you apply it to your chart.
In the chart above, the equation used to display the exponential trendline is:
y = 41290e0.0461x
Also, note the R2 value of 0.9979, which is very close to 1. This means this trendline is very reliable.
Note: You can not create an exponential trendline if your data contains zero or negative values.
Least Square Line
The trendline is also known as the least square line (or best-fitting line) because Excel uses the method of least square to choose a trendline.
According to this method, Excel chooses that line as a trendline which minimizes the sum of squared vertical distances from each point to the line.
The vertical distance from each point to the line is called residual (or error).
As discussed earlier, each data point has one residual:
A positive error means a data point is above the least square line and a negative error means a data point is below the least square line.
Another definition of residual:
A residual (denoted by ‘e’) is the difference between the observed value and predicted value of a dependent variable in regression analysis.
Residual (e) = Observed value – Predicted value
Forecasting with Trendline
Forecasting with a trendline is one of the easiest ways of running regression analysis in excel.
All you have to do is to select the best trendline for your data and then enter the number of periods for which you want to forecast the data in the ‘Forecast Forward’ text box of the ‘Format Trendline’ dialogue box:
The chart below uses a trendline that is forecasting four months ahead and is clearly showing a trend toward an increase in sales:
By using regression analysis, we have extended the trendline in the chart beyond the actual data to predict future values of the dependent variable (sales).
Note: You can also do backward forecasting by entering the number of periods for which you want to forecast the data in the ‘Forecast Backward’ text box of the ‘Format Trendline’ dialogue box.
Running Linear Regression
In the case of linear regression, we use only one independent variable (x) to predict the value of a dependent variable (y).
Let us suppose we want to predict sales from AOV (average order value). So sales become the dependent variable, and AOV becomes the independent variable.
Now follow the steps below:
Step-1: Collect the data for both Sales and AOV over a period of 3 or more months and then put the data in a table in Excel:
Step-2: Determine the nature of the relationship (linear or nonlinear) between AOV and Sales by graphing the data points using Excel ‘scatter chart ‘so that the independent variable (AOV) is on the x-axis and the dependent variable (Sales) is on the y axis:
Step-3: Determine the correlation between AOV and Sales.
If there is no or weak linear relationship between two variables, or in other words, the correlation between the two variables is zero or weak, then such a relationship is not good for predicting anything.
So then, there is no point in running a regression analysis. The correlation coefficient between AOV and Sales, according to the CORREL function, turned out to be 0.79, which is a strong positive correlation.
Hence we are in a position to predict the value of Sales from AOV. In other words, we can run regression analysis.
Step-4: Add the best trendline to your chart. From the chart above, we can see that the relationship between AOV and Sales forms a straight line.
In other words, the relationship is linear. So the trendline that will best fit the data is the linear trendline:
The formula used to calculate the linear trendline is: y = 59.364x + 40397
Here ‘y’ is the dependant variable (i.e. Sales), and ‘x’ is the independent variable (i.e. AOV)
Here 59.36 is the slope of the trendline, which indicates that as AOV increases by $1, the sales increase by $59.36.
The number ‘40397’ is the y-intercept which can be interpreted as monthly fixed sales.
So even if AOV during a month is zero, this graph estimates that the business will still generate sales of $40,397 a month.
The R2 value is 0.63, which indicates that 63% of the variation in monthly sales can be predicted from the variation in AOV.
This implies that the remaining 37% (100-63) variation in monthly sales can be predicted by other factors. We need to run multiple regression to determine other factors that influence the sale.
We can now calculate the ‘predicted sales’ for each AOV by using the formula: y = 59.364x + 40397 in our spreadsheet like this:
Step-5: Calculate residuals for each data point.
As stated earlier, a residual (denoted by ‘e’) is the difference between the observed value and predicted value of a dependent variable in regression analysis.
Each data point has one residual:
Step-6: Calculate the standard error of the regression (also known as SER).
SER is a measure of the accuracy of your predictions. It measures the spread of the data points around the least square line.
The function that is used to calculate the standard error of regression in Excel is STEYX:
STEYX (yrange, xrange)
Here yrange contains the values of the dependent variable, and xrange contains the values of the independent variable:
According to descriptive statistics rule of thumb: you would expect about 68% of your predictions to be accurate within one standard error of regression (SER) and about 95% of your predictions to be accurate within two standard errors.
Any prediction/forecast that differs from the actual value by more than two standard errors is considered an outlier (extreme value).
In our case, one standard error of the regression is 14467.26.
Therefore the two standard errors of regression will be 14467.26 * 2 = 28934.52.
So, according to the descriptive statistics rule of thumb, 68% of the residuals (or errors) should be 14467.26 or smaller, and 95% of the residuals (or errors) should be 28934.52 or smaller.
If you look at the residuals in column ‘E’, you can see that 12 residuals out of 15 or 80% of the residuals are equal to or smaller than 14467.26 (.i.e. within one standard error of regression).
Similarly, 15 residuals out of 15 or 100% of the residuals are equal to or smaller than 28934.52 (.i.e. within two standard errors of regression).
This means we could expect 80% of our predictions to be accurate within one standard error of regression (SER) and about 100% of our predictions to be accurate within two standard errors.
If we look at the first observation, the residual of $7922.36 indicates that the predicted sales of $1, 15,195.64 is too low and is low by $7922.36.
Running Multiple Regression
As mentioned earlier, when we use two or more independent variables to predict the value of a dependent variable, then such a type of regression is known as multiple regression.
Let us suppose we want to predict sales from AOV (average order value) and the number of e-commerce transactions.
So sales become a dependent variable and AOV and number of e-commerce transactions become independent variables. Now follow the steps below:
Step-1: Collect the data for both Sales and AOV over a period of 3 or more months and then put the data in a table in excel like this:
Step-2: Now click on the ‘Data’ tab > Data Analysis to open the ‘Data Analysis tool’. In the ‘Data Analysis’ dialogue box, select ‘Regression’ from the list and click on the ‘ok’ button:
Note: You need to install the ‘Analysis Tool Pack’ (an Excel add-in) in order to run multiple regression analyses in Excel. This tool pack is available in Excel 2007 and above versions.
Step-3: Enter the ‘input Y range’ and input X range.
The input Y range contains all the values of a dependent variable (here, Sales).
The input X range contains all the values of independent variables (here, AOV and Transactions).
Click on the checkbox ‘labels’ as we have also included labels in the input Y range and input X ranges.
Select the output range, i.e. the cells where you want to display the results of multiple regression analysis.
Click on the ‘Residuals’ checkbox to include residuals in your multiple regression analysis reports.
Click on the ‘Residual Plots’ checkbox to include residual plots in your multiple regression analysis reports.
Then click on the ‘ok’ button.
You will now see the output of your multiple regression analysis something like this:
One of the most important output of a multiple regression analysis is the R2 value.
If you look at the R2 (or R Square) value, it is 0.95, which is pretty close to 1.
The Adjusted R2 value is 0.94, which is a more reliable statistic as it takes into account your sample size (i.e. number of observations).
The adjusted R2 value of 0.94 indicates that 94% of the variation in monthly sales can be predicted from the variation in AOV and the number of transactions.
This implies that the remaining 6% (100-94) variation in monthly sales can be predicted by other factors.
If you remember, when we ran linear regression, our R2 value was 0.63.
This means that we were able to explain only 63% of the variation in monthly sales.
With the R2 value of 0.94, we now can explain up to 94% variation in monthly sales.
This is the advantage of running multiple regression.
In the real world a single independent variable is rarely responsible for causing all of the variations in a dependent variable.
So most of the time, you will find yourself running multiple regression.
The formula used by excel to calculate the multiple regression is:
Y = a + b1X1+ b2X2
Here ‘y’ is the dependent variable (in our case, Sales)
‘X1’ and ‘X2’ are the values of the independent variable (in our case AOV and number of transactions)
‘a’ is the ‘y-intercept’
‘b1’ is the slope of AOV, and ‘b2’ is the slope of Transactions.
You can see the values of ‘a’, ‘b1’ and ‘b2’ in Fig-3. In the first output of multiple regression analysis:
So here, the y-intercept is -54398.47, ‘b1’ is 96.94 and ‘b2’ is 552.41.
So the equation of multiple regression (or the best prediction equation) is:
y= -54398.47 + 96.94X1 + 552.41X2
By using this formula, Excel calculated the predicted sales for each observation, as shown above under the ‘Predicted Sales’ column. You can also see the residuals for each observation/data point.
The standard error of regression (also known as SER) which measures the accuracy of our predictions, is 5370.16. (see the first output of multiple regression analysis, i.e. summary output).
Now according to the descriptive statistics rule of thumb:
You would expect about 68% of your predictions to be accurate within one standard error of regression (SER) and about 95% of your predictions to be accurate within two standard errors.
Any prediction/forecast that differs from the actual value by more than two standard errors is considered an outlier (.i.e extreme value)
In our case, one standard error of regression is 5370.16. Therefore the two standard errors of regression will be 5370.16 * 2 = 10740.32.
So, according to the descriptive statistics rule of thumb, 68% of the residuals (or errors) should be 5370.16 or smaller, and 95% of the residuals (or errors) should be 10740.32 or smaller.
So if you look at the residuals column in the second output of multiple regression (i.e. Residual Output), you can see that 14 residuals out of 15 or 93% of the residuals are equal to or smaller than 5370.16 (.i.e. within one standard error of regression).
Similarly, 15 residuals out of 15 or 100% of the residuals are equal to or smaller than 10740.32 (.i.e. within two standard errors of regression).
That means we could expect 93% of our predictions to be accurate within one standard error of regression (SER) and about 100% of our predictions to be accurate within two standard errors.
The last two outputs of multiple regressions (AOV Residual Plot and Transactions Residual Plot) are residual plots which I have explained earlier.
The P-Value
When you run a regression analysis, each independent variable has a p-value between 0 and 1.
The p-value is the probability that the results occurred randomly.
If the p-value of an independent variable is less than or equal to 0.05, then it is considered to be useful for predicting the dependent variable.
Thus, the smaller the p-value, the higher the predictive power of the independent variable.
In the table above, you can see that the p-values of the independent variables ‘AOV’ and ‘Transactions’ are very small: 2.742E-09 and 1.0E-06.
Since these p-values are less than 0.05, they are considered to be very useful for predicting the value of a dependent variable (in our case, ‘Sales’).
The other outputs of multiple regression analysis {like ANOVA(Analysis of variance), degree of freedom (df), sum of squares (SS), Mean of Squares (MS) etc.} are not so relevant, and their explanation is beyond the scope of this post.
However, if you want to dive deeper, here is a good video on interpreting regression output:
Note: You can also use the Excel LINEST function to run multiple regression in Excel.
Predictive Marketing Process
This was a lot of statistics. Isn’t it? Finally, we are in a position where we can carry out predictive marketing techniques.
I use the following process to do predictive marketing:
Step-1: Collect data and extract information from it – All modern-day web analytics tools are able to collect data (like visits, page views, etc.) and extract a considerable amount of information from it (like goal conversion rate, visitor behaviour, etc.).
You need to collect at least 3 months of performance data before you start trend analysis. Always strive to collect accurate data. If your data is not accurate, then predictions won’t be accurate either.
Step-2: Identify trends to analyse – Review your KPIs, esp. the ones which drive revenue and cost, in order to determine the performance data you want to analyse and compare.
For example, when reviewing the sales performance of your company, analyse KPIs like conversion volume, conversion rate, and average order value.
Step-3: Choose threshold – Define the level at which a variation is worth investigating. For example, a 30% increase or decrease in sales over a period of time should be worth investigating. Use historical data to choose the threshold.
Step-5: Chart the data and add the best trendline to the chart
Step-6: Analyse and interpret the data and spot patterns
Step-7: Conduct cause analysis – Investigate why the variation occurred so that you can use this information to discount certain peaks or valleys in your data trends.
Step-8: Conduct regression analysis.
Other Posts you may find useful: Beginners Guide to Maths and Stats behind Web Analytics
Bonus Post => Analytics Career Advice from Top Industry Experts
You must have heard of something called ‘predictive analytics‘ at some point in your life as a marketer/analyst.
The term ‘predictive analysis’ is so heavy that we have conferences (like Predictive Analytics World) organized around it all over the world.
What is Predictive Analytics?
Predictive analytics is an area of statistical analysis in which we extract information from data and then analyze it to identify patterns and make predictions about future outcomes/events.
Both trend analysis and regression analysis are a subset of predictive analysis.
In trend analysis, we extract information from data and then analyze the information to identify patterns in that information.
In regression analysis, we make predictions about future outcomes/events by estimating the relationship between two or more variables.
Regression analysis is based on the idea that what has happened in the past can give us an idea of what can happen in the future.
Predictive analytics can be applied to a wide range of industries like sports, real estate, insurance, marketing, etc.
What is Predictive Marketing?
Predictive marketing is the application of predictive analytics in marketing. It can help you answer the following questions:
What my sales will likely to be for the second half of the year?
Where I should invest my money and resources to get the highest possible ROI?
Where current business operations and marketing practices will take me?
Which customers are most likely to respond to our new social media campaign?
….. and the questions can be endless.
My article is an attempt to unravel the mysteries of the techniques used in ‘Predictive Marketing’ and bring statistics into marketing for the greater good.
At the time of writing this article, I know only a handful of marketers who sell ‘predictive marketing’ as a service.
The reason behind this is simple.
Predictive marketing is not everybody’s cup of tea.
It is too difficult to learn and equally difficult to implement.
This is the main reason for writing this article. I want to make these techniques a bit easier for you to learn and implement.
I want to be honest with you. There is a steep learning curve, and this is the reason why this post is so long.
In order to learn predictive marketing techniques, you need to climb a pyramid in which each step is progressively more difficult than the last one:
But once you have climbed all the steps and reached the top, you can actually make predictions that are statistically accurate.
All you have to do is to stick with me until the very end of this post.
I know it is a long journey.
It took me almost 1, 1/2 month to write this post.
So it will probably take you a couple of days to absorb all the material in this post.
But there is no hurry. Take your time.
If you can read and absorb all the material in one go, then you are just too smart and hats off to you.
I spent an awful lot of time deciding how to present the course material so that you can quickly learn the basic predictive marketing techniques and apply them to your job/work straight away.
In order to learn predictive marketing you need to know or learn ‘linear algebra’ first followed by ‘linear relationship’, ‘correlation’, ‘trends’ and ‘regression’.
Once you know all these concepts, only then can you proceed to make predictions in a scientific way.
Feel free to skip the section(s) with which you think you are familiar.
But remember that each section is built on the knowledge acquired through the previous section.
So if you miss something important in one section, it may cost you dearly in the subsequent sections. I have warned you.
I won’t go down into great details but will just present to you the bare minimum you need to know in order to get you going with predictive marketing techniques.
Hopefully, my article will build a solid foundation for you, and you can then dive deeper on your own depending upon your passion level and current situation (marketer, analyst, student, etc).
Linear Algebra
Slope (Gradient) of a straight line
The slope of a straight line defines how steep the line is.
Slope = change in vertical distance/change in horizontal distance
Note: Change in height is also called ‘rise’ or ‘fall’. Whereas change in horizontal distance is also called ‘run’.
The slope of this line =3/3 =1
Note: Large slope means a steeper line. For example:
Here the slope of line = 4/2 = 2
A slope can be positive, negative, zero, or undefined.
Positive Slope – A slope is positive if both change in horizontal distance and vertical distance is positive. For example:
Here the slope of line = 4/2 = 2
Negative Slope – A slope is negative if the change in horizontal distance is negative or the change in vertical distance is negative. For example:
Here slope of the line = -4/2 = -2
Zero Slope – A slope is zero if the change in vertical distance is zero. For example:Here slope of the line = 0/2 =0
Undefined Slope – A slope is undefined if the change in horizontal distance is zero. For example:Here slope of the line = 4/0 = undefined
Y Intercept of a Straight Line
The y-intercept of a straight line is the point where a line crosses the y-axis. For example:
In the diagram, the line crosses the y-axis at 1. So here, the y-intercept is 1.
Equation of a straight line
The equation of a straight line is: Y=mX+b
Here, ‘m’ is the slope of the straight line, and ‘b’ is the y-intercept.
For example, let us calculate the equation of the following straight line:
Slope = 2/2 =2. So m =1
Y intercept = 1. So b= 1
So the equation of the straight line is: y=x + 1
Linear Relationship
If a linear relationship exist between two variables then increase or decrease in the value of one variable will cause corresponding increase or decrease in the value of other variable.
This type of relationship exists between dependent and independent variables.
For example:
Suppose you want to forecast sales for your company, and you have concluded that your sales go up and down depending on changes in average order value (AOV).
The sales you are forecasting would be the dependent variable because its value “depends” on the value of AOV, and the AOV would be the independent variable.
Independent variable is generally denoted by the letter ‘x’. Whereas dependant variable is generally denoted by the letter ‘y’.
In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect.
You can express linear relationship between ‘x’ and ‘y’ via equation of the straight line: y=mx+b
Here, ‘m’ is the slope of the straight line, and ‘b’ is the y-intercept.
You can also express a linear relationship in a graphical format via an Excel scatter chart like this:
Note: In regression analysis, we predict the value of the dependent variable from one or more independent variables.
How you can determine whether the relationship between two variables is linear or not?
There are two methods to determine a linear relationship between two variables.
One is through Excel Scatter Chart, and one is through a residual plot.
Excel Scatter Chart
Step-1: To determine how two variables are related, graph the data points using Excel ‘scatter chart ‘so that the independent variable is on the x-axis and the dependent variable is on the ‘y’ axis.
Make sure that the column of data that you want to display on the x-axis is located to the left of the column of data you want to display on the y-axis in your excel table.
Something like this:
Step-2: Once you have plotted the data points. You will see a graph, something like Fig.1 above. If you look at the data points in Fig.1, you can see that they resemble a straight line.
I have added a trendline to make this look more obvious in Fig.2.
The trendlines are used to graphically display trends in data.
To add a trendline to your chart, follow the steps below in MS excel 2013:
- Select the chart and then click on the Design tab in the Excel ribbon.
- Click on the ‘Add Chart Element’ button and then select Trendline > ‘Linear’
Note: A trendline is also known as a ‘regression line’ or ‘least square line’
Residual Plot
A residual (denoted by ‘e’) is the difference between the observed value and predicted value of a dependent variable in regression analysis. Each data point has one residual.
Residual (e) = Observed value – Predicted value
For example:
Note: Residuals are also called ‘errors’ or ‘Regression Residuals’
In the table above, column F contains all the observed values of the dependent variable ‘y’.
Column G contains all the predicted values of the dependent variable ‘y’.
Column H contains the residuals.
Note: Both the sum and the mean of the residuals are equal to zero.
A residual plot is a graph that shows all the residuals on the vertical axis and the independent variable on the horizontal axis.
If the points in a residual plot are randomly dispersed around the horizontal axis, then the relationship between two variables is linear otherwise, it is non-linear.
For example, the following residual plot shows a linear relationship as the data points are randomly dispersed around the horizontal axis:
Now, look at the following residual plots:
Here the first residual plot shows a linear relationship. The other two plots show a non-linear relationship as the data points are not randomly dispersed around the horizontal axis.
Note(1): The sum of the residuals is always zero, whether the data set is linear or nonlinear.
Note(2): You need to conduct a regression analysis in order to create residual plots. I will teach you later in the post how to create a residual plot.
Correlation
Correlation is a statistical measurement of relationship between two variables. Correlation measures how two variables change in relation to each other.
The formula used to calculate the relationship between two variables is called covariance:
Covariance measures the strength and direction of the relationship between two variables. It is a measure of how much two variables change together.
Let us suppose ‘x’ and ‘y’ are two variables.
If as ‘x’ goes up, ‘y’ goes up then covariance would be positive.
However, if as ‘x’ goes up, ‘y’ goes down, then covariance would be negative.
The problem with calculating covariance is that the number you get from the calculation is hard to interpret.
This happens because covariance is not normalized (statistically adjusted).
Correlation Coefficient
The correlation coefficient is the normalized version of covariance and is calculated as:
Here we are simply dividing the covariance by the product of the standard deviation of the two variables.
Note: The correlation coefficient is not useful if the relationship between variables is non-linear.
Types of Correlation Coefficient
The most common correlation coefficient is ‘The Pearson product-moment correlation coefficient’ which measures the strength and direction of the linear relationship (a relationship which resembles a straight line) between variables.
When we speak simply of a correlation coefficient, we are generally referring to the ‘Pearson product-moment correlation’.
‘Spearman’s Rank Correlation’ and ‘Kendall Rank Correlation’ are some other types of correlations.
The correlation coefficient of a sample (sample is set of observations drawn from a population) is denoted by ‘r’.
The correlation coefficient of a population (population is total set of observations) is denoted by ‘R’.
The value of the correlation coefficient can range from +1 to -1.
The sign (positive or negative) of the correlation coefficient describes the direction of the relationship between two variables.
Whereas the absolute value of the correlation coefficient describes the strength of the relationship between two variables.
Correlation of +1 means that the variables ‘x’ and ‘y’ have strongest positive linear relationship i.e. when ‘x’ is larger than average, ‘y’ tends to be larger than average. Similarly when ‘x’ is smaller than average, ‘y’ tends to be smaller than average.
Correlation of -1 means that the variables ‘x’ and ‘y’ have strongest negative linear relationship i.e. when ‘x’ is larger than average, ‘y’ tends to be smaller than average. Similarly when ‘x’ is smaller than average, ‘y’ tends to be larger than average.
Correlation of 0 means that no linear relationship exist between the two variables. Keep in mind that the ‘Pearson product-moment correlation coefficient’ only measures linear relationships.
Therefore, a correlation coefficient of 0 does not mean zero relationships exist between two variables.
It simply means no linear relationship exists between the two variables.
It is possible for two variables to have curvilinear relationship (it is a non linear relationship which doesn’t resemble a straight line) instead of linear relationship.
Note: A correlation greater than 0.8 is generally considered a strong correlation.Whereas a correlation less than 0.5 is generally considered a weak correlation. However, these values can vary based on the type of data being analysed.
Now consider the following scatter charts:
We can conclude the following facts from the above scatter charts:
1. In the case of a positive correlation, the slope of the straight line is positive. Whereas in the case of negative correlation, the slope of the straight line is negative.
2. In the case of the strongest positive correlation (r=1) or strongest negative correlation (r=-1) the data points form an exact straight line.
3. The correlation becomes weaker as the data points become more scattered. So in the case of zero correlation, the data points are completely scattered (in a random manner).
Outliers effect correlation
Outliers are extreme values that differ greatly from other values.
In regression analysis, the data points that diverge greatly from the overall pattern of the data are called outliers.
Note: If the values of the two variables are unreliable (because of measurement error or other errors), then the correlation coefficient could be lower than expected.
Correlation Coefficient and Predictive Power
A correlation coefficient (R) is a good indicator of the predictive relationship between two variables.
For example:
If the value of R is zero then the dependent variable cannot be predicted from the independent variable as the relationship between two variables is non-linear.
If the value of R is 1 or -1 then the dependent variable can be predicted without any error from the independent variable.
The value of R between 0 and 1 or 0 or -1 measures the extent to which the dependant variable is predictable from the independent variable.
For example, an R of 0.50 means that 50% of the variation in the dependent variable (y) is predictable from the independent variable (x). The other 50% variation in ‘y’ can’t be explained.
Similarly, an R of 0.20 means that 20% of the variation in the dependent variable (y) is predictable from the independent variable (x). The other 80% variation in ‘y’ can’t be explained.
Therefore it is important that you determine the correlation between two variables before you run a regression analysis.
If there is no linear relationship or weak linear relationship between two variables, or in other words, the correlation between the two variables is zero or weak, then such a relationship is not good to predict anything.
So there is no point running a regression analysis, then.
Coefficient of Determination or R-Squared Value
1st Definition – The coefficient of determination (denoted by R2 or R-Squared value) is a number from 0 to 1 that describes how well a regression line (i.e. trendline) fits a data set.
A trend line is most reliable/accurate when its R2 value is 1 or near 1.
2nd Definition – The coefficient of determination (denoted by R2) is a number from 0 to 1 that measures the percentage of variation in the dependent variable that is predictable from the independent variable(s).
3rd Definition – The coefficient of determination (denoted by R2) is the square of the correlation coefficient (R).
4th Definition – The coefficient of determination allows to interpret the relationship between two variables in terms of variation.
5th Definition – The coefficient of determination is the ratio of ‘explained variation’ to the total variation.
If the value of R2 is zero then the dependent variable cannot be predicted from the independent variable.
If the value of R2 is 1 then the dependent variable can be predicted without any error from the independent variable.
The value of R2 between 0 and 1 measures the extent to which the dependant variable is predictable from the independent variable.
For example, an R2 of 0.50 means that 50% of the variation in the dependent variable (y) is predictable from the independent variable (x). The other 50% variation in ‘y’ can’t be explained.
Similarly, an R2 of 0.20 means that 20% of the variation in the dependent variable (y) is predictable from the independent variable (x). The other 80% variation in ‘y’ can’t be explained.
For example, in the chart below, 62% of the variation in sales can be explained by the variation in AOV. The remaining 38% (100-62) variation in sales can be explained by other factors:
The other factors that can explain the variation in sales can be determined by running multiple regression analyses.
Note: Coefficient of determination is a key output of regression analysis.
How to Display R-Squared Value on Your Chart
To display this value on your chart, follow the steps below:
Step-1: In MS Excel 2013, right-click on the trendline and then select ‘Format Trendline’ from the drop-down menu. This will open up the ‘Format Trendline’ box on the right-hand side of the excel window.
Step-2: In the ‘Format Trendline’ box, select the checkbox ‘Display R-Squared Value on chart’:
You can now see the R-Squared value in your chart like the one below:
Note: When you add a trendline to a chart, excel automatically calculates the R-squared value of the trendline.
Methods to Find a Correlation Between Two Sets of Data
There are three methods of finding a correlation between two sets of data in Excel:
1. Through Analysis Tool Pack
2. By using CORREL function
3. By using the R2 Value
Using Analysis Tool Pack for Finding a Correlation Between Two Data Sets
The ‘Analysis tool pack’ is an excel add-in through which you can do complex statistical analysis (like multiple regressions) in Excel. It is available in Excel 2007 and above versions.
To install Analysis Tool Pack in Excel 2013, follow the steps below:
Step-1: Open MS Excel and then click on the File Tab.
Step-2: Go to Options > Excel Options > Add-ins
Step-3: Select the ‘Analysis ToolPak’ option and then click on the ‘go’ button:
Step-4: Select the ‘Analysis ToolPak’ checkbox and then click on the ‘ok’ button:
Once the ‘Analysis ToolPak’ is installed, you will see the new ‘Data Analysis’ button on the extreme right under the ‘Data’ tab:
Now click on the ‘Data Analysis’ button, select ‘correlation’ from the list and then click on the ‘ok’ button:
Select the ‘input range’ (the two columns in this case) and output range (where you want to display the results), and then click on the ok button:
You can now see the correlation between two sets of data:
Here the correlation between AOV and Sales is 0.792282
Using CORREL Function for Finding Correlation Between Two Data Sets
Through this Excel function, you can calculate the correlation between two data sets without using the ‘Analysis Tool Pack’.
Syntax: CORREL (array1, array2)
For example:
Here the correlation between AOV and Sales is 0.792282. This is the same value we got when we used the ‘Analysis Tool Pack’ to calculate the correlation between two data sets.
Using the R2 Value for Finding a Correlation Between Two Data Sets
The correlation between two data sets can also be defined as the square root of the R2 value:
You can calculate the R2 value by using the RSQ excel function.
Syntax: RSQ (known_y’s, known_x’s)
Here known_y’s are the values of the dependent variable (in our case, ‘Sales’), and known_x’s are the values of the independent variable (in our case ‘AOV’). For example:
Once you have determined the R2 value, then take its square root via the excel function SQRT to determine the correlation between two data sets.
Note: The correlation determined through the r2 value matches the correlation determined through the CORREL function and the correlation determined through the ‘Analysis Tool Pack’.
Trend Analysis Basics
A trend is a movement in a particular direction.
We do trend analysis for the following main reasons:
#1 Measure the performance of a marketing campaign over a period of time. Through trend analysis, you can identify areas where your campaign is performing well (so that you replicate success) and the areas where your campaign is underperforming.
#2 Trend analysis help in getting answers to questions like:
- What are my top selling products?
- What are my top converting keywords?
- Which keywords I should bid on?
- Where I should invest my money and resources to get highest possible ROI?
- Which are the most effective marketing channels in terms of conversions and revenue?
#3 Trend analysis help in determining where current business operations and marketing practices could take you. Negative trends act as a warning of potential problems with current practices.
#4 Trend analysis can help in refining business decisions and strategies. We develop strategies that respond to the identified trends. Moving with a positive trend and not against them can lead to more profit.
#5 Trend analysis can help in identifying risks and opportunities.
#6 Through trend analysis, we can predict future outcomes and events, like what the sales will likely be for the second half of the year.
Importance of Accurately Analyzing and Interpreting Data Trends
How we analyze and interpret the ‘data trends play a very important role in optimizing our marketing campaigns and making predictions about future outcomes.
One wrong interpretation, and we can end up losing a hundred thousand pounds (depending upon the size of your business).
Important Points to Keep in Mind while Analyzing, Interpreting and Reporting Data Trends
#1 Always question how the data is collected
#2 Understand that historical data is in fact “dated”
#3 Select the right time period to analyse your data trends
#4 Add comparison to your data trends
#5 Never report standalone metric in your data trends
#6 Segment your data before you analyse/report data trends
#7 Report something business bottomline Impacting
#8 Make the insight obvious
#9 Use sparklines and conditional formatting.
#10 Don’t jump to conclusions
#11 Analysing single trend is rarely useful.
Related Post: How to analyze, interpret and report data trends in Google Analytics
Types of Trend
1 Uptrend – moving higher
2 Downtrend – moving lower
3 Sideways/horizontal trend – moving sideways.
Note: Sideways trend is technically not a trend. It is a lack of well-defined trends in either direction.
Trend Length
A trend can be short (or seasonal), intermediate, or long-term. The longer the trend more significant it is.
For example, a 3 months trend is not as significant as 3 years trend.
Regression Analysis
Regression analysis is a statistical technique which is used to estimate relationship between a dependant variable and one or more independent variables.
In regression analysis, we try to predict the value of a dependent variable (y) from one or more independent variables (x1,x2,x3,…)
When we use one independent variable to predict the value of a dependent variable then such type of regression is known as linear regression.
The analysis, which is done in linear regression, is known as the linear regression analysis.
When we use two or more independent variables to predict the value of a dependent variable, then such type of regression is known as multiple regression.
The analysis, which is done in multiple regression, is known as the multiple regression analysis.
Equation used to calculate linear regression
Y = a + bX
Here ‘y’ is the dependant variable (the variable whose value we are trying to predict)
‘X’ is the independent variable (the variable whose value we are using to predict the value of ‘y’)
‘a’ is the ‘y-intercept’
‘b’ is the slope of the regression line
Equation used to calculate multiple regression
Y = a + b1X1+ b2X2 + b3X3 +…..bnXn
Here ‘y’ is the dependant variable (the variable whose value we are trying to predict)
‘X1’, ‘X2’, ‘X3’….’Xn’ are the independent variable (the variables whose values we are using to predict the value of ‘y’)
‘a’ is the ‘y-intercept’
‘b1’,’b2’,’b3’…..’bn’ are the slope of the regression lines
Note: You can use up to 15 independent variables in excel to do multiple regression analysis.
By using regression analysis, you can extend a trendline in a chart beyond the actual data to predict future values of the dependent variable.
For example, the following chart uses a trendline that is forecasting four months ahead and is clearly showing a trend of increase in sales.
However, for regression analysis to work, you need to select the right trendline for your data in the first place.
Choosing the Best Trendline for Your Data
Choosing the right trendline for your data is very important. If you applied the wrong trendline, then your predictions/forecast won’t be accurate.
The R2 value can help you in choosing the best trendline for your data.
There are six types of trendlines (or regression lines) available in Excel:
1. Linear Trendline
2. Logarithmic Trendline
3. Polynomial Trendline
4. Moving Average Trendline
5. Power Trendline
6. Exponential Trendline
Add each of the trendlines in turns and note down its R2 value. The trendline whose R2 value is highest is the best trendline to use.
Linear Trendline
Use this trendline if your data set is linear (resembles a straight line) and the data values are increasing or decreasing at a steady rate.
For example, the following chart contains a linear trendline as visits are increasing at a steady rate:
To add a linear trendline to your chart, follow the steps below:
Step-1: Right-click on the data points in your chart and then select the ‘Add Trendline’ option.
Step-2: From the ‘Format Trendline’ dialogue box, select ‘linear’ and click on the two checkboxes: ‘Display equation on chart’ and ‘Display R-Squared Value on chart’:
Step-3: Click on the ‘close’ button. You will now see the chart with linear trendline, linear trendline equation, and R2 value:
Use the same procedure to add other types of trendlines to your chart.
The equation used to calculate the linear trendline is: y=mx+b
Here,
‘y’ is the dependent variable,
‘m’ is the slope of the line
‘b’ is the ‘y’ intercept.
Excel automatically calculates the linear trendline for you once you apply it to your chart. In the chart above, the equation used to display the linear trendline is:
y = 2819.7x + 38691
Here 2189.7 is the slope of the line, and 38691 is the y-intercept. Also, note the R2 value of 0.9826, which is close to 1. This means this trendline is very reliable.
Logarithmic Trendline
Use this trendline if your data values increase or decrease quickly and then level out.
For example, the following chart contains a logarithmic trendline as visits increases quickly and then level out:
The equation used to calculate the logarithmic trendline is: y= c ln x + b
Here ‘c’ and ‘b’ are constants, and ‘ln’ is the natural logarithm function.
Excel automatically calculates the logarithmic trendline for you once you apply it to your chart.
In the chart above the equation used to display the logarithmic trendline is: y = 22724ln(x) + 39839
Also, note the R2 value of 0.9296, which is close to 1. This means this trendline is very reliable.
Polynomial Trendline
Use this trendline if there is fluctuation in your data.
The order of the polynomial is the number of fluctuations in the data or the number of bands (peaks and valleys) that appeared in a curve.
For example, use an order two polynomial trendline if there is only one hill or valley.
Similarly, use order 3 polynomial trendline if there are one or two hills or valleys.
The following chart contains a polynomial trendline
The equation used to calculate the polynomial trendline is:
Here b and c1, c2,,,c6 are constants
Excel automatically calculates the polynomial trendline for you once you apply it to your chart.
In the chart above, the equation used to display the polynomial trendline is: y = 3750×2 – 18351x + 61130
Also, note the R2 value of 0.8394, which is close to 1. This means this trendline is quite reliable.
Moving Average Trendline
Use this trendline if there is a lot of fluctuation in your data.
You select the number of data points you want the trendline to average (by using the period option in excel) and use it as a point in the line.
For example,
If you set the period to 2, then the average of the first two data points is used as the first point in the moving average trendline.
The average of the 2nd and 3rd data points is used as the second point in the trendline, and so on.
Moving average trendlines are generally used in stock analysis.
Generally, low period values are used to show more immediate trends, and high period values are used to show long-term trends.
The following chart contains a moving average trendline:
Note: The R2 value is not available with a moving average trendline.
Power Trendline
Use this trendline if the data values increase at a specific rate. The following chart contains a power trendline:
The equation used to calculate the power trendline is:
Here ‘c’ and ‘b’ are constants.
Excel automatically calculates the power trendline for you once you apply it to your chart. In the chart above, the equation used to display the power trendline is:
y = 334.09×0.2455
Also, note the R2 value of 0.9319, which is close to 1. This means this trendline is very reliable.
Note: You can not create a power trendline if your data contains zero or negative values.
Exponential Trendline
Use this trendline if data values increase or decrease at increasingly higher rates.
The following chart contains an exponential trendline:
The equation used to calculate the exponential trendline is:
Here ‘c’ and ‘b’ are constants. ‘e’ is the base of the natural logarithm.
Excel automatically calculates the exponential trendline for you once you apply it to your chart.
In the chart above, the equation used to display the exponential trendline is:
y = 41290e0.0461x
Also, note the R2 value of 0.9979, which is very close to 1. This means this trendline is very reliable.
Note: You can not create an exponential trendline if your data contains zero or negative values.
Least Square Line
The trendline is also known as the least square line (or best-fitting line) because Excel uses the method of least square to choose a trendline.
According to this method, Excel chooses that line as a trendline which minimizes the sum of squared vertical distances from each point to the line.
The vertical distance from each point to the line is called residual (or error).
As discussed earlier, each data point has one residual:
A positive error means a data point is above the least square line and a negative error means a data point is below the least square line.
Another definition of residual:
A residual (denoted by ‘e’) is the difference between the observed value and predicted value of a dependent variable in regression analysis.
Residual (e) = Observed value – Predicted value
Forecasting with Trendline
Forecasting with a trendline is one of the easiest ways of running regression analysis in excel.
All you have to do is to select the best trendline for your data and then enter the number of periods for which you want to forecast the data in the ‘Forecast Forward’ text box of the ‘Format Trendline’ dialogue box:
The chart below uses a trendline that is forecasting four months ahead and is clearly showing a trend toward an increase in sales:
By using regression analysis, we have extended the trendline in the chart beyond the actual data to predict future values of the dependent variable (sales).
Note: You can also do backward forecasting by entering the number of periods for which you want to forecast the data in the ‘Forecast Backward’ text box of the ‘Format Trendline’ dialogue box.
Running Linear Regression
In the case of linear regression, we use only one independent variable (x) to predict the value of a dependent variable (y).
Let us suppose we want to predict sales from AOV (average order value). So sales become the dependent variable, and AOV becomes the independent variable.
Now follow the steps below:
Step-1: Collect the data for both Sales and AOV over a period of 3 or more months and then put the data in a table in Excel:
Step-2: Determine the nature of the relationship (linear or nonlinear) between AOV and Sales by graphing the data points using Excel ‘scatter chart ‘so that the independent variable (AOV) is on the x-axis and the dependent variable (Sales) is on the y axis:
Step-3: Determine the correlation between AOV and Sales.
If there is no or weak linear relationship between two variables, or in other words, the correlation between the two variables is zero or weak, then such a relationship is not good for predicting anything.
So then, there is no point in running a regression analysis. The correlation coefficient between AOV and Sales, according to the CORREL function, turned out to be 0.79, which is a strong positive correlation.
Hence we are in a position to predict the value of Sales from AOV. In other words, we can run regression analysis.
Step-4: Add the best trendline to your chart. From the chart above, we can see that the relationship between AOV and Sales forms a straight line.
In other words, the relationship is linear. So the trendline that will best fit the data is the linear trendline:
The formula used to calculate the linear trendline is: y = 59.364x + 40397
Here ‘y’ is the dependant variable (i.e. Sales), and ‘x’ is the independent variable (i.e. AOV)
Here 59.36 is the slope of the trendline, which indicates that as AOV increases by $1, the sales increase by $59.36.
The number ‘40397’ is the y-intercept which can be interpreted as monthly fixed sales.
So even if AOV during a month is zero, this graph estimates that the business will still generate sales of $40,397 a month.
The R2 value is 0.63, which indicates that 63% of the variation in monthly sales can be predicted from the variation in AOV.
This implies that the remaining 37% (100-63) variation in monthly sales can be predicted by other factors. We need to run multiple regression to determine other factors that influence the sale.
We can now calculate the ‘predicted sales’ for each AOV by using the formula: y = 59.364x + 40397 in our spreadsheet like this:
Step-5: Calculate residuals for each data point.
As stated earlier, a residual (denoted by ‘e’) is the difference between the observed value and predicted value of a dependent variable in regression analysis.
Each data point has one residual:
Step-6: Calculate the standard error of the regression (also known as SER).
SER is a measure of the accuracy of your predictions. It measures the spread of the data points around the least square line.
The function that is used to calculate the standard error of regression in Excel is STEYX:
STEYX (yrange, xrange)
Here yrange contains the values of the dependent variable, and xrange contains the values of the independent variable:
According to descriptive statistics rule of thumb: you would expect about 68% of your predictions to be accurate within one standard error of regression (SER) and about 95% of your predictions to be accurate within two standard errors.
Any prediction/forecast that differs from the actual value by more than two standard errors is considered an outlier (extreme value).
In our case, one standard error of the regression is 14467.26.
Therefore the two standard errors of regression will be 14467.26 * 2 = 28934.52.
So, according to the descriptive statistics rule of thumb, 68% of the residuals (or errors) should be 14467.26 or smaller, and 95% of the residuals (or errors) should be 28934.52 or smaller.
If you look at the residuals in column ‘E’, you can see that 12 residuals out of 15 or 80% of the residuals are equal to or smaller than 14467.26 (.i.e. within one standard error of regression).
Similarly, 15 residuals out of 15 or 100% of the residuals are equal to or smaller than 28934.52 (.i.e. within two standard errors of regression).
This means we could expect 80% of our predictions to be accurate within one standard error of regression (SER) and about 100% of our predictions to be accurate within two standard errors.
If we look at the first observation, the residual of $7922.36 indicates that the predicted sales of $1, 15,195.64 is too low and is low by $7922.36.
Running Multiple Regression
As mentioned earlier, when we use two or more independent variables to predict the value of a dependent variable, then such a type of regression is known as multiple regression.
Let us suppose we want to predict sales from AOV (average order value) and the number of e-commerce transactions.
So sales become a dependent variable and AOV and number of e-commerce transactions become independent variables. Now follow the steps below:
Step-1: Collect the data for both Sales and AOV over a period of 3 or more months and then put the data in a table in excel like this:
Step-2: Now click on the ‘Data’ tab > Data Analysis to open the ‘Data Analysis tool’. In the ‘Data Analysis’ dialogue box, select ‘Regression’ from the list and click on the ‘ok’ button:
Note: You need to install the ‘Analysis Tool Pack’ (an Excel add-in) in order to run multiple regression analyses in Excel. This tool pack is available in Excel 2007 and above versions.
Step-3: Enter the ‘input Y range’ and input X range.
The input Y range contains all the values of a dependent variable (here, Sales).
The input X range contains all the values of independent variables (here, AOV and Transactions).
Click on the checkbox ‘labels’ as we have also included labels in the input Y range and input X ranges.
Select the output range, i.e. the cells where you want to display the results of multiple regression analysis.
Click on the ‘Residuals’ checkbox to include residuals in your multiple regression analysis reports.
Click on the ‘Residual Plots’ checkbox to include residual plots in your multiple regression analysis reports.
Then click on the ‘ok’ button.
You will now see the output of your multiple regression analysis something like this:
One of the most important output of a multiple regression analysis is the R2 value.
If you look at the R2 (or R Square) value, it is 0.95, which is pretty close to 1.
The Adjusted R2 value is 0.94, which is a more reliable statistic as it takes into account your sample size (i.e. number of observations).
The adjusted R2 value of 0.94 indicates that 94% of the variation in monthly sales can be predicted from the variation in AOV and the number of transactions.
This implies that the remaining 6% (100-94) variation in monthly sales can be predicted by other factors.
If you remember, when we ran linear regression, our R2 value was 0.63.
This means that we were able to explain only 63% of the variation in monthly sales.
With the R2 value of 0.94, we now can explain up to 94% variation in monthly sales.
This is the advantage of running multiple regression.
In the real world a single independent variable is rarely responsible for causing all of the variations in a dependent variable.
So most of the time, you will find yourself running multiple regression.
The formula used by excel to calculate the multiple regression is:
Y = a + b1X1+ b2X2
Here ‘y’ is the dependent variable (in our case, Sales)
‘X1’ and ‘X2’ are the values of the independent variable (in our case AOV and number of transactions)
‘a’ is the ‘y-intercept’
‘b1’ is the slope of AOV, and ‘b2’ is the slope of Transactions.
You can see the values of ‘a’, ‘b1’ and ‘b2’ in Fig-3. In the first output of multiple regression analysis:
So here, the y-intercept is -54398.47, ‘b1’ is 96.94 and ‘b2’ is 552.41.
So the equation of multiple regression (or the best prediction equation) is:
y= -54398.47 + 96.94X1 + 552.41X2
By using this formula, Excel calculated the predicted sales for each observation, as shown above under the ‘Predicted Sales’ column. You can also see the residuals for each observation/data point.
The standard error of regression (also known as SER) which measures the accuracy of our predictions, is 5370.16. (see the first output of multiple regression analysis, i.e. summary output).
Now according to the descriptive statistics rule of thumb:
You would expect about 68% of your predictions to be accurate within one standard error of regression (SER) and about 95% of your predictions to be accurate within two standard errors.
Any prediction/forecast that differs from the actual value by more than two standard errors is considered an outlier (.i.e extreme value)
In our case, one standard error of regression is 5370.16. Therefore the two standard errors of regression will be 5370.16 * 2 = 10740.32.
So, according to the descriptive statistics rule of thumb, 68% of the residuals (or errors) should be 5370.16 or smaller, and 95% of the residuals (or errors) should be 10740.32 or smaller.
So if you look at the residuals column in the second output of multiple regression (i.e. Residual Output), you can see that 14 residuals out of 15 or 93% of the residuals are equal to or smaller than 5370.16 (.i.e. within one standard error of regression).
Similarly, 15 residuals out of 15 or 100% of the residuals are equal to or smaller than 10740.32 (.i.e. within two standard errors of regression).
That means we could expect 93% of our predictions to be accurate within one standard error of regression (SER) and about 100% of our predictions to be accurate within two standard errors.
The last two outputs of multiple regressions (AOV Residual Plot and Transactions Residual Plot) are residual plots which I have explained earlier.
The P-Value
When you run a regression analysis, each independent variable has a p-value between 0 and 1.
The p-value is the probability that the results occurred randomly.
If the p-value of an independent variable is less than or equal to 0.05, then it is considered to be useful for predicting the dependent variable.
Thus, the smaller the p-value, the higher the predictive power of the independent variable.
In the table above, you can see that the p-values of the independent variables ‘AOV’ and ‘Transactions’ are very small: 2.742E-09 and 1.0E-06.
Since these p-values are less than 0.05, they are considered to be very useful for predicting the value of a dependent variable (in our case, ‘Sales’).
The other outputs of multiple regression analysis {like ANOVA(Analysis of variance), degree of freedom (df), sum of squares (SS), Mean of Squares (MS) etc.} are not so relevant, and their explanation is beyond the scope of this post.
However, if you want to dive deeper, here is a good video on interpreting regression output:
Note: You can also use the Excel LINEST function to run multiple regression in Excel.
Predictive Marketing Process
This was a lot of statistics. Isn’t it? Finally, we are in a position where we can carry out predictive marketing techniques.
I use the following process to do predictive marketing:
Step-1: Collect data and extract information from it – All modern-day web analytics tools are able to collect data (like visits, page views, etc.) and extract a considerable amount of information from it (like goal conversion rate, visitor behaviour, etc.).
You need to collect at least 3 months of performance data before you start trend analysis. Always strive to collect accurate data. If your data is not accurate, then predictions won’t be accurate either.
Step-2: Identify trends to analyse – Review your KPIs, esp. the ones which drive revenue and cost, in order to determine the performance data you want to analyse and compare.
For example, when reviewing the sales performance of your company, analyse KPIs like conversion volume, conversion rate, and average order value.
Step-3: Choose threshold – Define the level at which a variation is worth investigating. For example, a 30% increase or decrease in sales over a period of time should be worth investigating. Use historical data to choose the threshold.
Step-5: Chart the data and add the best trendline to the chart
Step-6: Analyse and interpret the data and spot patterns
Step-7: Conduct cause analysis – Investigate why the variation occurred so that you can use this information to discount certain peaks or valleys in your data trends.
Step-8: Conduct regression analysis.
Other Posts you may find useful: Beginners Guide to Maths and Stats behind Web Analytics
Bonus Post => Analytics Career Advice from Top Industry Experts
My best selling books on Digital Analytics and Conversion Optimization
Maths and Stats for Web Analytics and Conversion Optimization
This expert guide will teach you how to leverage the knowledge of maths and statistics in order to accurately interpret data and take actions, which can quickly improve the bottom-line of your online business.
Master the Essentials of Email Marketing Analytics
This book focuses solely on the ‘analytics’ that power your email marketing optimization program and will help you dramatically reduce your cost per acquisition and increase marketing ROI by tracking the performance of the various KPIs and metrics used for email marketing.
Attribution Modelling in Google Analytics and BeyondSECOND EDITION OUT NOW!
Attribution modelling is the process of determining the most effective marketing channels for investment. This book has been written to help you implement attribution modelling. It will teach you how to leverage the knowledge of attribution modelling in order to allocate marketing budget and understand buying behaviour.
Attribution Modelling in Google Ads and Facebook
This book has been written to help you implement attribution modelling in Google Ads (Google AdWords) and Facebook. It will teach you, how to leverage the knowledge of attribution modelling in order to understand the customer purchasing journey and determine the most effective marketing channels for investment.