# Predictive Analytics & Marketing – The next stage of Business Optimization

** **

You must have heard of something called ‘**Predictive Analytics**‘ at some point in your life as a marketer/analyst. The term ‘predictive analysis’ is so heavy that we have conferences (like Predictive Analytics World) organized around it all over the world.

## What is Predictive Analytics

Predictive analytics is an area of statistical analysis in which we extract information from data and then analyze it to identify patterns and make predictions about future outcomes/events.

Both trend analysis and regression analysis are subset of predictive analysis. In trend analysis we extract information from data and then analyze the information to identify patterns in that information. In regression analysis we make predictions about future outcomes/events by estimating the relationship among two or more variables. Regression analysis is based on the idea that what has happened in the past can give us an idea of what can happen in the future. Predictive analytics can be applied to a wide range of industries like sports, real estate, insurance, marketing etc.

## What is Predictive Marketing

**Predictive marketing** is the application of predictive analytics in marketing. It can help you answer the following questions:

What my sales will likely to be for the second half of the year?

Where I should invest my money and resources to get the highest possible ROI?

Where current business operations and marketing practices will take me?

Which customers are most likely to respond to our new social media campaign?

….. and the questions can be endless.

My post is an attempt to unravel the mysteries of the techniques used in ‘Predictive Marketing’ and **bringing statistics into marketing** for the greater good. At that time of writing this post I know only handful of marketers who sell ‘predictive marketing’ as a service. The reason behind this is simple. Predictive marketing is not everybody’s cup of tea. It is too difficult to learn and equally difficult to implement. This is the main reason of writing this post. I want to make these techniques a bit easier for you to learn and implement.

I want to be honest with you. There is a steep learning curve and this is the reason why this post is so long. In order to learn predictive marketing techniques you need to climb a pyramid in which each step is progressively difficult than the last one:

But once you climbed all the steps and reached to the top, you can actually make predictions which are statistically accurate. All you have to do is to stick with me till the very end of this post. I know it is a long journey. It took me almost 1, 1/2 month to write this post. So it will probably take you couple of days to absorb all the material in this post. But there is no hurry. Take your time. If you can read and absorb all the material in one go then you are just too smart and hats off to you.

I spent awful lot of time in deciding how to present you the course material so that you can quickly learn the basic predictive marketing techniques and apply it to your job/work straight away. In order to learn predictive marketing you need to know or learn ‘linear algebra’ first followed by ‘linear relationship’, ‘correlation’, ‘trends’ and ‘regression’. Once you know all these concepts **only then** you can proceed to making predictions in a scientific way.

Feel free to skip the section(s) which you think you are familiar with. But remember that each section is built on the knowledge acquired through the previous section. So if you miss something important in one section, it may cost you dearly in the subsequent sections. I have warned you.

I won’t go down into great details but will just present you the bare minimum you need to know in order to get you going with predictive marketing techniques. Hopefully my post will build a solid foundation for you and you can then dive deeper on your own depending upon your passion level and current situation (marketer, analyst, student etc).

## Linear Algebra

### Slope (Gradient) of a straight line

Slope of a straight line defines how steep the line is.

Slope = change in height/change in horizontal distance

**Note:** Change in height is also called ‘rise’ or ‘fall’. Whereas change in horizontal distance is also called ‘run’.

The slope of this line =3/3 =1

**Note:** Large slope means steeper line. For example:

Here the slope of line = 4/2 = 2

A slope can be positive, negative, zero or undefined.

**Positive Slope** – A slope is positive if both change in horizontal distance and vertical distance is positive. For example:

Here the slope of line = 4/2 = 2

**Negative Slope** – A slope is negative if change in horizontal distance is negative or change in vertical distance is negative. For example:

Here slope of the line = -4/2 = -2

**Zero Slope **– A slope is zero if change in vertical distance is zero. For example:

Here slope of the line = 0/5 =0

**Undefined Slope** – A slope is undefined if change in horizontal distance is zero. For example:

Here slope of the line = 3/0 = undefined

**Y Intercept of a Straight Line**

Y intercept of a straight line is the point where a line crosses the y-axis. For example:

In the diagram the line crosses the y-axis at 1. So here the y intercept is 1.

**Equation of a straight line**

The equation of a straight line is: y=mx+b

Here ‘m’ is the slope of the straight line and ‘b’ is the y intercept. For example let us calculate the equation of the following straight line:

Here slope = 2/1 =2. So m =2

Y intercept = 1. So b= 1

So equation of the straight line is: y=2x + 1

## Linear Relationship

If a linear relationship exist between two variables then increase or decrease in the value of one variable will cause corresponding increase or decrease in the value of other variable. This type of relationship exist between **dependent and independent variables**. For example:

Suppose you want to forecast sales for your company and you have concluded that your sales go up and down depending on changes in average order value (AOV). The sales you are forecasting would be the dependent variable because its value “depends” on the value of AOV, and the AOV would be the independent variable.

Independent variable is generally denoted by the letter ‘x’. Whereas dependant variable is generally denoted by the letter ‘y’. In a cause and effect relationship, the independent variable is the cause and the dependent variable is the effect.

You can express linear relationship between ‘x’ and ‘y’ via equation of the straight line: y=mx+b

Here ‘m’ is the slope of the straight line and ‘b’ is the y-intercept.

You can also express linear relationship in a graphical format via Excel scatter chart like this:

**Note:** In regression analysis we predict the value of dependent variable from one or more independent variables.

**How you can determine whether the relationship between two variables is linear or not?**

There are two methods to determine linear relationship between two variables. One is through Excel Scatter Chart and one is through residual plot.

**Excel Scatter Chart**

**Step-1: **To determine how two variables are related, graph the data points using Excel ‘scatter chart ‘so that independent variable is on the x axis and dependent variable is on the ‘y’ axis. Make sure that the column of data that you want to display on the x-axis is located to the left of the column of data you want to display on the y-axis in your excel table. Something like this:

**Step-2: **Once you have plot the data points. You will see a graph something like Fig.1 above. If you look at the data points in Fig.1, you can see that they resemble a straight line. I have added a trendline to make this look more obvious in Fig.2. **The trendlines are used to graphically display trends in data.** **To add a trendline to your chart, follow the steps below:**

1 Select the chart and then click on the layout tab in excel ribbon.

2 Click on the ‘trendline’ button and then select ‘linear trendline’

**Note:** A trendline is also known as ‘regression line’ or ‘least square line’

### Residual Plot

A residual (denoted by ‘e’) is the difference between the observed value (y) and predicted value (?) of a dependant variable in regression analysis. Each data point has one residual.

Residual = Observed value – Predicted value

e = y – ?

For example:

**Note: **Residuals are also called as ‘errors’ or ‘Regression Residuals’

In the table above column F contains all the observed values of the dependant variable ‘y’. The column G contains all the predicted values of the dependant variable ‘y’. The column H contains the residuals.

**Note:** both the sum and the mean of the residuals are equal to zero.

A residual plot is a graph that shows all the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis then the relationship between two variables is linear otherwise it is non-linear.

For example the following residual plot shows a linear relationship as the data points are randomly dispersed around the horizontal axis:

Now look at the following residual plots:

Here the first residual plot shows linear relationship. The other two plots shows non-linear relationship as the data points are not randomly dispersed around the horizontal axis.

**Note: **The sum of the residuals is always zero, whether the data set is linear or nonlinear.

**Note: ** You need to conduct regression analysis in order to create residual plots. I will teach you later in the post how to create a residual plot.

## Correlation

### What is correlation?

It is a statistical measurement of relationship between two variables. Correlation measures how two variables change in relation to each other.

**What is covariance?**

The formula used to calculate the relationship between two variables is called covariance:

Covariance measures the **strength and direction of the relationship** between two variables. It is a measure of how much two variables change together. Let us suppose ‘x’ and ‘y’ are two variables. If as ‘x’ goes up, ‘y’ goes up then covariance would be positive. However if as ‘x’ goes up, ‘y’ goes down then covariance would be negative.

The problem with calculating covariance is that the number you get from the calculation is hard to interpret. This happens because covariance in not normalized (statistically adjusted).

### What is correlation coefficient?

Correlation coefficient is the normalized version of covariance and is calculated as:

Here we are simply dividing the covariance by the product of the standard deviation of the two variables.

**Note: **Correlation coefficient is not useful if the relationship between variables is non-linear.

### Types of Correlation Coefficient

The most common correlation coefficient is **‘Pearson product-moment correlation coefficient’ **which measures the strength and direction of the **linear relationship** (relationship which resembles a straight line) between variables.

When we speak simply of a correlation coefficient, we are generally referring to the ‘Pearson product-moment correlation’.‘Spearman’s Rank Correlation’ and ‘Kendall Rank Correlation’ are some other types of correlations.

The **correlation coefficient of a sample** (sample is set of observations drawn from a population) is denoted by ‘r’.

The **correlation coefficient of a population** (population is total set of observations) is denoted by ‘R’ or ‘?’

The value of correlation coefficient can range from +1 to -1. The sign (positive or negative) of the correlation coefficient describes the direction of the relationship between two variables. Whereas the absolute value of the correlation coefficient describes the strength of the relationship between two variables.

**Correlation of +1 means **that the variables ‘x’ and ‘y’ have strongest positive linear relationship i.e. when ‘x’ is larger than average, ‘y’ tends to be larger than average. Similarly when ‘x’ is smaller than average, ‘y’ tends to be smaller than average.

**Correlation of -1 means **that the variables ‘x’ and ‘y’ have strongest negative linear relationship i.e. when ‘x’ is larger than average, ‘y’ tends to be smaller than average. Similarly when ‘x’ is smaller than average, ‘y’ tends to be larger than average.

**Correlation of 0 means** that no linear relationship exist between the two variables. Keep in mind that the ‘Pearson product-moment correlation coefficient’ only measures linear relationships.

Therefore, a correlation coefficient of 0 does not mean zero relationship exist between two variables. It simply means no linear relationship exist between the two variables. It is possible for two variables to have curvilinear relationship (it is a non linear relationship which doesn’t resemble a straight line) instead of linear relationship.

**Note: ** A correlation greater than 0.8 is generally considered as strong correlation.Whereas a correlation less than 0.5 is generally considered as weak correlation. However these values can vary based upon the type of data being analyzed.

We can conclude following facts from the above scatter charts:

1. In case of positive correlation the slope of the straight line is positive. Whereas in case of negative correlation the slope of the straight line is negative.

2. In case of **perfectly positive correlation (r=1) or perfectly negative correlation (r=-1) **the data points form an exact straight line.

3. The correlation becomes weaker as the data points become more scattered. So in case of zero correlation the data points are completely scattered (in a random manner).

4. **Outliers effect correlation**. Outliers are extreme values that differ greatly from other values. In regression analysis the data points that diverge greatly from the overall pattern of the data are called outliers.

**Note:** If values of the two variables are unreliable (because of measurement error or other errors) then the correlation coefficient could be lower than expected.

### Correlation coefficient and Predictive Power

A correlation coefficient (R) is a good indicator of predictive relationship between two variables. For example:

**If the value of R is zero**then the dependent variable cannot be predicted from the independent variable as the relationship between two variables is nonlinear.**If the value of R is 1 or -1**then the dependent variable can be predicted without any error from the independent variable.**The value of R between 0 and 1 or 0 or -1**measures the extent to which the dependant variable is predictable from the independent variable. For example an R of 0.50 means that 50% of the variation in the dependant variable (y) is predictable from the independent variable (x). The other 50% variation in ‘y’ can’t be explained. Similarly an R of 0.20 means that 20% of the variation in the dependant variable (y) is predictable from the independent variable (x). The other 80% variation in ‘y’ can’t be explained.

**Therefore it is important that you determine the correlation between two variables before you run regression analysis. **If there is no or weak linear relationship between two variables or in other words the correlation between the two variables is zero or weak then such relationship is not good to predict anything. So there is no point running a regression analysis then.

### Coefficient of determination or R-Squared Value

**1st Definition** – The coefficient of determination (denoted by R2 or R-Squared value) is a number from 0 to 1 that describes how well a regression line (i.e. trendline) fits a data set. A trend line is most reliable/accurate when its R2 value is 1 or near 1.

**2nd Definition** – The coefficient of determination (denoted by R2) is a number from 0 to 1 that measures the percentage of variation in the dependent variable that is predictable from the independent variable(s).

**3rd Definition **– The coefficient of determination (denoted by R2) is the square of the correlation coefficient (R).

**4th Definition – **The coefficient of determination allows to interpret the relationship between two variables in terms of variation.

**5th Definition **– The coefficient of determination is the ratio of ‘explained variation’ to the total variation.

**If the value of R2 is zero**then the dependent variable cannot be predicted from the independent variable.**If the value of R2 is 1**then the dependent variable can be predicted without any error from the independent variable.**The value of R2 between 0 and 1**measures the extent to which the dependant variable is predictable from the independent variable. For example an R2 of 0.50 means that 50% of the variation in the dependant variable (y) is predictable from the independent variable (x). The other 50% variation in ‘y’ can’t be explained. Similarly an R2 of 0.20 means that 20% of the variation in the dependant variable (y) is predictable from the independent variable (x). The other 80% variation in ‘y’ can’t be explained.

For example in the chart below, 62% of the variation in sales can be explained by the variation in AOV. The remaining 38% (100-62) variation in sales can be explained by other factors:

The other factors that can explain the variation in sales can be determined by running multiple regression analysis.

**Note:** Coefficient of determination is a key output of regression analysis.

### How to display R-Squared Value on your Chart?

To display this value on your chart, follow the steps below:

**Step-1: **Select your chart and then click on the Layout tab > Trendline Button

**Step-2: **In the format trendline dialog box, select the checkbox ‘Display R-Squared Value on chart’ and then click on the ‘close’ button

You can now see the R-Squared value in your chart like this:

**Note:** When you add a trendline to a chart, the excel automatically calculates the R-squared value of the trendline.

### Ways to find correlation between two sets of data

There are 3 ways of finding correlation between two sets of data in Excel:

1. Through Analysis Tool Pack

2. By using CORREL function

3. By using the R2 Value

**Analysis Tool Pack**

The ‘Analysis tool pack’ is an excel add-in through which you can do complex statistical analysis (like multiple regressions) in Excel. It is available in Excel 2007 and Excel 2010. You can learn to install the analysis tool pack for excel 2010 from here.

Once you have installed the ‘analysis tool pack’, you will see the ‘data analysis’ button under the data tab:

Now click on the ‘Data Analysis’ button, select ‘correlation’ from the list and then click on the ‘ok’ button

Select the ‘input range’ (the two columns in this case) and output range (where you want to display the results) and then click on the ok button:

You can now see the correlation between two sets of data:

Here the correlation between AOV and Sales is 0.792282

**CORREL Function**

Through this excel function you can calculate the correlation between two data sets without using the ‘Analysis Tool Pack’.

Syntax: CORREL (array1, array2)

For example:

Here the correlation between AOV and Sales is 0.792282. This is the same value we got when we used the ‘Analysis Tool Pack’ to calculate the correlation between two data sets.

**By using the R2 Value**

The correlation between two data sets can also be defined as the square root of the R2 value:

You can calculate the R2 value by using the RSQ excel function.

Syntax: RSQ (known_y’s, known_x’s)

Here known_y’s are the values of dependent variable (in our case ‘Sales’) and known_x’s are the values of independent variable (in our case ‘AOV’). For example:

Once you have determined the R2 value then take its square root via the excel function SQRT to determine the correlation between two data sets. For example:

Note the correlation determined through r2 value matches the correlation determined through CORREL function and the correlation determined through ‘Analysis tool pack’.

## Trend Analysis Basics

### What is trend?

It is a movement in a particular direction.

### Why analyze data trends?

We do trend analysis to:

1 Measure performance of a marketing campaign/business over a period of time. Through trend analysis you can identify areas where your business/campaign is performing well (so that you replicate success). You can identify areas where your business/campaign is underperforming.

2 You can get answers to questions like:

- What are my top selling products?
- What are my top converting keywords?
- Which keywords I should bid on?
- Where I should invest my money and resources to get highest possible ROI?
- Which are the most effective marketing channels in terms of conversions and revenue?

3 Determine where current business operations and marketing practices will take you. Negative trends act as a warning of potential problems with current practices.

4 Refine business decisions and strategies. We develop strategies which respond to the identified trends. Moving with positive trend and not against them can lead to more profit.

5 Identify risks and opportunities.

6 Predict future outcomes and events like what the sales will likely be for the second half of the year.

### Importance of accurately analyzing and interpreting data trends

How we analyze and interpret the ‘data trends’ plays a very important role in optimizing our marketing campaigns and making predictions about future outcomes. One wrong interpretation and we can end up losing hundred thousand pounds (depending upon the size of your business).

### Important points to keep in mind while analyzing, interpreting and reporting data trends

1: Always question how the data is collected

2: Understand that historical data is in fact “dated”

3: Select the right time period to analyze your data trends

4: Add comparison to your data trends

5: Never report standalone metric in your data trends

6: Segment your data before you analyze/report data trends

7: Report something business bottomline Impacting

8: Make the insight obvious

9: Use Sparklines and conditional formating.

10: Don’t jump to conclusions

11. Analyzing single trend is rarely useful.

**Related Post**: How to analyze, interpret and report data trends in Google Analytics

### Types of Trend

1 **Uptrend** – moving higher

2 **Downtrend** – moving lower

3 **Sideways/horizontal trend** – moving sideways. Sideways trend is technically not a trend. It is a lack of well defined trend in either direction.

### Trend Length

A trend can be short (or seasonal), intermediate or long term.

**Note: **Longer the trend more significant it is. For example a 3 months trend is not as significant as 3 years trend.

## Regression Analysis

Regression analysis is a statistical technique which is used to estimate relationship between a dependant variable and one or more independent variables. In regression analysis we try to predict the value of a dependant variable (y) from one or more independent variables (x1,x2,x3,…)

When we use one independent variable to predict the value of a dependent variable then such type of regression is known as **linear regression**. The analysis which is done in linear regression is known as the **linear regression analysis**.

When we use two or more independent variables to predict the value of a dependent variable then such type of regression is known as **multiple regression**. The analysis which is done in multiple regression is known as the **multiple regression analysis**.

Equation used to calculate linear regression

Y = a + bX

Here ‘y’ is the dependant variable (the variable whose value we are trying to predict)

‘x’ is the independent variable (the variable whose value we are using to predict the value of ‘y’)

‘a’ is the ‘y-intercept’

‘b’ is the slope of the regression line

Equation used to calculate multiple regression

Y = a + b1X1+ b2X2 + b3X3 +…..bnXn

Here ‘y’ is the dependant variable (the variable whose value we are trying to predict)

‘X1’, ‘X2’, ‘X3’….’Xn’ are the independent variable (the variables whose values we are using to predict the value of ‘y’)

‘a’ is the ‘y-intercept’

‘b1’,’b2’,’b3’…..’bn’ are the slope of the regression lines

**Note:** You can use upto 15 independent variables in excel to do multiple regression analysis.

By using regression analysis, you can extend a trendline in a chart beyond the actual data to predict future values of the dependant variable. For example, the following chart uses a trendline that is forecasting 4 months ahead and is clearly showing a trend toward increasing sales.

However for regression analysis to work you need to select the right trendline for your data in the first place.

### Choosing the best trendline for your data

Choosing the right trendline for your data is very important. If you applied the wrong trendline then your predictions/forecast won’t be accurate. The R2 value can help you in choosing the best trendline for your data.

There are 6 types of trendlines (or regression lines) available in Excel:

1. Linear Trendline

2. Logarithmic Trendline

3. Polynomial Trendline

4. Moving Average Trendline

5. Power Trendline

6. Exponential Trendline

Add each of the trendline in turns and note down its R2 value. The trendline whose R2 value is highest is the best trendline to use.

**Linear Trendline **– Use this trendline if your data set is linear (resembles a straight line) and the data values are increasing or decreasing at a steady rate. For example the following chart contains a linear trendline as visits are increasing at a steady rate:

### To add a linear trendline to your chart follow the steps below:

**Step-1:** Right click on the data points in your chart and then select the ‘Add Trendline’ option.

**Step-2: **From the ‘Format Trend’ dialog box select ‘linear’ and click on the two check boxes: ‘Display equation on chart’ and ‘Display R-Squared Value on chart’:

**Step-3: **Click on the ‘close’ button. You will now see the chart with linear trendline, linear trendline equation and R2 value:

**Note: **Use the same procedure to add other type of trendlines to your chart.

The equation used to calculate the linear trendline is: y=mx+b

Here ‘y’ is the dependant variable, ‘m’ is the slope of the line and ‘b’ is the ‘y’ intercept. Excel automatically calculates the linear trendline for you once you applied it to your chart. In the chart above the equation used to display the linear trendline is:

y = 2819.7x + 38691

Here 2189.7 is the slope of the line and 38691 is the y-intercept. Also note the R2 value of 0.9826 which is close to 1. This means this trendline is very reliable.

**Logarithmic Trendline – **Use this trendline if your data values increases or decreases quickly and then levels out. For example the following chart contains a logarithmic trendline as visits increases quickly and then levels out:

The equation used to calculate the logarithmic trendline is: y= c ln x + b

Here ‘c’ and ‘b’ are constants and ‘ln’ is the natural logarithm function.

Excel automatically calculates the logarithmic trendline for you once you applied it to your chart. In the chart above the equation used to display the logarithmic trendline is: y = 22724ln(x) + 39839

Also note the R2 value of 0.9296 which is close to 1. This means this trendline is very reliable.

**Polynomial Trendline – **Use this trendline is there is fluctuation in your data. The order of the polynomial is the number of fluctuations in the data or how many bands (peak and valleys) appear in the curve. For example use an order 2 polynomial trendline if there is only one hill or valley. Similarly use order 3 polynomial trendline if there is one or two hills or valleys.

The following chart contains a polynomial trendline

The equation used to calculate the polynomial trendline is:

Here b and c1, c2,,,c6 are constants

Excel automatically calculates the polynomial trendline for you once you applied it to your chart. In the chart above the equation used to display the polynomial trendline is: y = 3750×2 – 18351x + 61130

Also note the R2 value of 0.8394 which is close to 1. This means this trendline is quite reliable.

**Moving Average Trendline** – Use this trendline if there is lot of fluctuation in your data. You select the number of data points you want the trendline to average (by using the period option in excel) and use as a point in the line. For example, If you set the period to 2, then the average of the first two data points is used as the first point in the moving average trendline. The average of the 2nd and 3rd data points is used as the second point in the trendline and so on.

Moving average trendlines are generally used in stock analysis. Generally low period values are used to show more immediate trends and high period values are used to show long term trends. The following chart contains a moving average trendline:

**Note:** The R2 value is not available with moving average trendline.

**Power Trendline** – Use this trendline if the data values increases at a specific rate. The following chart contains a power trendline:

The equation used to calculate the power trendline is:

Here ‘c’ and ‘b’ are constants. Excel automatically calculates the power trendline for you once you applied it to your chart. In the chart above the equation used to display the power trendline is: y = 334.09×0.2455

Also note the R2 value of 0.9319 which is close to 1. This means this trendline is very reliable.

**Note:** you can not create a power trendline if your data contains zero or negative values.

**Exponential Trendline **– Use this trendline if data values increases or decreases at increasingly higher rates. The following chart contains an exponential trendline:

The equation used to calculate the exponential trendline is:

Here ‘c’ and ‘b’ are constants. ‘e’ is the base of the natural logarithm. Excel automatically calculates the exponential trendline for you once you applied it to your chart. In the chart above the equation used to display the exponential trendline is: y = 41290e0.0461x

Also note the R2 value of 0.9979 which is very close to 1. This means this trendline is very reliable.

**Note:** you can not create an exponential trendline if your data contains zero or negative values.

### Least Square Line

The trendline is also known as the least square line (or best fitting line) because Excel uses the method of least square to choose a trendline. According to this method Excel chooses that line as a trendline which minimizes the sum of squared vertical distances from each point to the line. **The vertical distance from each point to the line is called residual (or error)**.As discussed earlier, each data point has one residual.

A positive error means a data point is above the least square line and a negative error means a data point is below the least square line.

**Another definition of residual:**A residual (denoted by ‘e’) is the difference between the observed value (y) and predicted value (?) of a dependant variable in regression analysis.

Residual = Observed value – Predicted value

e = y – ?

### Forecasting with Trendline

Forecasting with trendline is one of the easiest way of running regression analysis in excel. All you have to do is to select a best trendline for your data and enter the number of periods for which you want to forecast the data in the ‘Forecast Forward’ text box of the ‘Format Trendline’ dialog box:

By using regression analysis, we have extended the trendline in the chart beyond the actual data to predict future values of the dependant variable (sales). The chart above uses a trendline that is forecasting 4 months ahead and is clearly showing a trend toward increasing sales.

**Note:** You can also do **backward forecasting** by entering the number of periods for which you want to forecast the data in the ‘Forecast Backward’ text box of the ‘Format Trendline’ dialog box:

### Running Linear Regression

In case of linear regression we use only one independent variable (x) to predict the value of a dependent variable (y). Let us suppose we want to predict sales from AOV (average order value). So sales becomes dependent variable and AOV becomes independent variable. Now follow the steps below:

**Step-1:** Collect the data for both Sales and AOV over a period of 3 or more months and then put the data in a table in excel:

**Step-2: **Determine the nature of relationship (linear or nonlinear) between AOV and Sales by graphing the data points using Excel ‘scatter chart ‘so that independent variable (AOV) is on the x axis and dependent variable (Sales) is on the ‘y’ axis.

**Step-3: **Determine the correlation between AOV and Sales. If there is no or weak linear relationship between two variables or in other words the correlation between the two variables is zero or weak then such relationship is not good to predict anything. So then there is no point running a regression analysis.

The correlation coefficient between AOV and Sales according to the CORREL function is 0.79 which is a strong positive correlation. Hence we are in a position to predict the value of Sales from AOV. In other words we can run regression analysis.

**Step-4: **Add the best trendline to your chart. From the chart above we can see that the relationship between AOV and Sales forms a straight line. In other words the relationship is linear. So the trendline that will best fit the data is linear trendline.

The formula used to calculate the linear trendline is: y = 59.364x + 40397

Here ‘y’ is the dependant variable (i.e. Sales) and ‘x’ is the independent variable (i.e. AOV)

Here 59.36 is the slope of the trendline which indicates that as AOV increases by $1 the sales increases by $59.36.

The number ‘40397’ is the y-intercept which can be interpreted as monthly fixed sales. So even if AOV during a month is zero, this graph estimates that the business will still generate a sales of $40397 a month.

The R2 value is 0.63 which indicates that 63% of the variation in monthly sales can be predicted from the variation in AOV. This implies that the remaining 37% (100-63) variation in monthly sales can be predicted by other factors. We need to run multiple regression to determine other factors which influence the sale.

We can now calculate the ‘predicted sales’ for each AOV by using the formula: y = 59.364x + 40397 in our spreadsheet like this:

**Step-5: **Calculate residuals for each data point. As stated earlier, a residual (denoted by ‘e’) is the difference between the observed value (y) and predicted value (?) of a dependant variable in regression analysis. Each data point has one residual:

**Step-6:** **Calculate the standard error of regression (also known as SER)**. It is a measure of the accuracy of your predictions. It measures the spread of the data points around the least square line. The function that is used to calculate the standard error of regression is STEYX:

**STEYX (yrange, xrange)**

here *yrange *contains the values of the dependent variable, and *xrange *contains the values of the independent variable.

According to descriptive statistics rule of thumb: you would expect about 68% of your predictions to be accurate within one standard error of regression (SER) and about 95% of your predictions to be accurate within two standard errors. Any prediction/forecast that differs from the actual value by more than two standard errors is considered as an outlier (.i.e extreme value)

In our case, one standard error of regression is 14467.26. Therefore the two standard error of regression will be 14467.26 * 2 = 28934.52. So according to descriptive statistics rule of thumb, 68% of the residuals (or errors) should be 14467.26 or smaller and 95% of the residuals (or errors) should be 28934.52 or smaller.

So if you look at the residuals in column ‘E’, you can see that 12 residuals out of 15 or 80% of the residuals are equal or smaller than 14467.26 (.i.e. within one standard error of regression). Similarly 15 residuals out of 15 or 100% of the residuals are equal or smaller than 28934.52 (.i.e. within two standard error of regression). Which means we could expect 80% of our predictions to be accurate within one standard error of regression (SER) and about 100% of our predictions to be accurate within two standard errors.

If we look at the first observation, the residual of $7922.36 indicates that the predicted sales of $115195.64 is too low and is low by $7922.36.

### Running Multiple Regression

As mentioned earlier when we use two or more independent variables to predict the value of a dependent variable then such type of regression is known as multiple regression. Let us suppose we want to predict sales from AOV (average order value) and number of e-commerce transactions. So sales becomes dependent variable and AOV and number of e-commerce transactions become independent variables. Now follow the steps below:

**Step-1:** Collect the data for both Sales and AOV over a period of 3 or more months and then put the data in a table in excel like this:

**Step-2: **Now click on ‘Data’ tab > Data Analysis to open the ‘Data Analysis tool’. This will open up the ‘Data Analysis’ dialog box. Then select ‘Regression’ from the list and click on ‘ok’:

**Note: **You need to install the ‘Analysis tool pack’ (an excel add-in) in order to run multiple regression analysis in Excel. This tool pack is available in Excel 2007 and Excel 2010. You can learn to install the analysis tool pack for excel 2010 from here.

**Step-3:** Enter the cell range: $D$3:$D$18 as ‘input Y range’ and the cell range $B$3:$C$18 as input X range. The input Y range contains all the values of a dependant variable (here Sales). The input X range contains all the values of a independent variables (here AOV and Transactions).

Click on the checkbox ‘labels’ as we have also included labels in the input Y range and input X ranges. Select the output range i.e. the cells where you want to display the results of multiple regression analysis. Click on the ‘Residuals’ checkbox to include residuals in your multiple regression analysis report. Click on the ‘Residual Plots’ checkbox to include residual plots in your multiple regression analysis report. Then click on the ‘ok’ button.

You will now see the output of your multiple regression analysis something like this:

**One of the most important output of a multiple regression analysis is the R2 value.** If you look at the R2 value in Fig 3. it is 0.95 which is pretty close to 1. **The Adjusted R2 value** is 0.94 which is a more reliable statistics as it takes into account your sample size (i.e. number of observations).

The adjusted R2 value of 0.94 indicates that 94% of the variation in monthly sales can be predicted from the variation in AOV and number of transactions. This implies that the remaining 6% (100-94) variation in monthly sales can be predicted by other factors. If you remember when we ran linear regression our R2 value was 0.63. Which means that we were able to explain only 63% of the variation in monthly sales. With the R2 value of 0.94 we now can explain up to 94% variation in monthly sales. **This is the advantage of running multiple regression**.

In real world a single independent variable is rarely responsible for causing all of the variation in a dependant variable. So most of the time you will find yourself running multiple regression.

The formula used by excel to calculate the multiple regression is:

Y = a + b1X1+ b2X2

Here ‘y’ is the dependant variable (in our case Sales)

‘X1’ and ‘X2’ are the values of the independent variable (in our case AOV and number of transactions)

‘a’ is the ‘y-intercept’

‘b1’ is the slope of AOV and ‘b2’ is the slope of Transactions. You can see the values of ‘a’, ‘b1’ and ‘b2’ in Fig-3. under the ‘coefficients’ column or in the screenshot below:

So here the y-intercept is -54398.47, ‘b1’ is 96.94 and ‘b2’ is 552.41. So the equation of multiple regression (or the **best prediction equation**) is : y= -54398.47 + 96.94X1 + 552.41X2

By using this formula Excel calculated the predicted sales for each observation as shown in Fig-4 under the ‘Predicted Sales’ column. You can also see the residuals for each observation/data point in Fig.4

The standard error of regression (also known as SER) which measures the accuracy of our predictions is 5370.16. (see Fig-3). Now according to descriptive statistics rule of thumb: you would expect about 68% of your predictions to be accurate within one standard error of regression (SER) and about 95% of your predictions to be accurate within two standard errors. Any prediction/forecast that differs from the actual value by more than two standard errors is considered as an outlier (.i.e extreme value)

In our case, one standard error of regression is 5370.16. Therefore the two standard error of regression will be 5370.16 * 2 = 10740.32. So according to descriptive statistics rule of thumb, 68% of the residuals (or errors) should be 5370.16 or smaller and 95% of the residuals (or errors) should be 10740.32 or smaller.

So if you look at the residuals column in Fig-4, you can see that 14 residuals out of 15 or 93% of the residuals are equal or smaller than 5370.16 (.i.e. within one standard error of regression). Similarly 15 residuals out of 15 or 100% of the residuals are equal or smaller than 10740.32 (.i.e. within two standard error of regression). Which means we could expect 93% of our predictions to be accurate within one standard error of regression (SER) and about 100% of our predictions to be accurate within two standard errors.

Fig.5 and Fig.6 are residual plots which i have explained earlier in the post.

### The P-Value

When you run regression analysis, each independent variable has a p-value between 0 and 1. The p-value is the probability that the results occurred randomly. If the p-value of an independent variable is less than or equal to 0.05 then it is considered to be useful for predicting the dependent variable.Thus, the smaller the p-value, the higher the predictive power of the independent variable.

In Fig.7 you can see the p-values of the independent variables ‘AOV’ and ‘Transactions’, which are very small: 2.742E-09 and 1.0E-06. Since these p-values are less than 0.05 they are considered to be very useful for predicting the value of dependant variable in our case ‘Sales’.

The other outputs of multiple regression analysis {like ANOVA(Analysis of variance), degree of freedom (df), sum of squares (SS), Mean of Squares (MS) etc} are not so relevant and there explanation is beyond the scope of this post. However if you want to dive deeper here is a good video on interpreting regression output:

**Note:** you can also use the Excel LINEST function to run multiple regression in Excel.

## Predictive Marketing Process

This was lot of statistics. Isn’t it? Finally we are in a position where we can carry out predictive marketing techniques. I use the following process to do predictive marketing:

**Step-1: Collect data and extract information from it – **All modern day web analytics tools are able to collect data (like visits, pageviews etc) and extract considerable amount of information from it (like goal coversion rate, visitor behaviour etc). Collect at least 3 months of performance data before you start trend analysis. **Always strive to collect accurate data**. If your data is not accurate than predictions won’t be accurate either.

**Step-2:** **Identify trends to analyze** – Review your KPIs esp. the ones which drive revenue and cost to determine the performance data you want to analyze and compare. For example when reviewing the sales performance of your company, analyze KPIs like conversions volume, Conversion Rate and average order value.

**Step-3: Choose threshold **– Define the level at which a variation is worth investigating. For example a 30% increase or decrease in sales over a period of time should be worth investigating. Use historical data to choose threshold.

**Step-5:** **Chart the data and add the best trendline to the chart**

**Step-6:** **Analyze and interpret the data and spot patterns**

**Step-7:** **Conduct cause analysis** – Investigate why the variation occurred so that you can use this information to discount certain peaks or valleys in your data trends.

**Step-8:** **Conduct regression analysis**.

Now it is your turn. How do you do predictive marketing for your client/business? Is there anything with which you don’t agree or would like to rectify. Please share your thoughts and insight.

**Other Posts you may find useful**: Beginners Guide to Maths and Stats behind Web Analytics

**Bonus Post** => Analytics Career Advice from Top Industry Experts

Follow @analyticsnerd Tweet