Understanding Data Sampling in Google Analytics

Sampled Data

In the context of Google Analytics, Sample (or sampled data) is a subset of traffic data. It is not a complete traffic data set.

Data Sampling

Data Sampling is the process of selecting a subset of traffic data for analysis and reporting, on the trends detected in that subset. Data sampling is widely used in statistical analysis to analyse large data sets in a cost efficient manner and in a reasonable amount of time. As long as the sample is a good representative of all of the data, analysing a subset of data (or sample) gives similar results to analysing all of the data. But if the selected sample is not a good representative of all of the data or if selected sample is too small then analysing a subset of data does not give similar results to analysing all of the data.

Google Analytics data sampling

Google Analytics has an upward limit on the amount of traffic data it will not sample to produce reports. This limit has been set to save resources (computation power and cost). Google Analytics may choose to analyse the complete traffic data set or only a subset of traffic data depending upon the nature of a user’s query. When GA start analysing only a small subset of traffic data, you can’t rely on the metrics reported by it, as the selected sample may not be a good representative of all of the traffic data.

Smaller the sample size, more inaccurate the reported traffic data becomes. Ideally you would want GA to analyse complete traffic data set every time you query the traffic data. If that is not possible then let it analyze biggest possible sample size to minimize data sampling issues. Your Google analytics data is only as accurate as the amount of traffic data not being sampled. What that means, if you change the size of the data sample being analysed, your e-commerce conversion rate could change, the revenue reported by Google analytics report could change.

Data sampling can be very damaging for your analysis

If you have got data sampling issues, any or all of your reported metrics from: ‘sessions’, ‘users’,’pageviews’, ‘bounce rate’, ‘conversion rate’, to ‘revenue’ could be anywhere from 10% to 80% off the mark. For example Google Analytics may report your last month revenue to be say $1.2 million when in fact it is only $950k. You can determine such data discrepancies by comparing a sampled report with its unsampled version and then calculate the percentage of difference between various metrics. Make sure that the difference is statistically significant before you draw any conclusion. You need a high accuracy in traffic data. Any marketing decision based on inaccurate date would not product optimum results and may also result in monetary loss.

How can you determine whether you are viewing a sampled GA report?

Whenever you see such notifications in your Google Analytics view, you must assume that the GA data you are looking at is being sampled and/or badly sampled:

data sampling warning

data sampling warning3

If you are viewing an un-sampled report then you will see following message at the top of a report: “This report is based on …..(100% of sessions):

unsampled

As long as your report is based on 100% of the sessions, the reported data is unsampled. However if you are viewing an sampled report then you will see following message at the top of your report: “This report is based on …..less than 100% of sessions):

sampled

If your report is based on less than 100% of sessions than you have got data sampling issues.  Lower the sample size, greater is the data sampling issue. For example, a report which is based on 45% of sessions has lower sampling issues than the report which is based on just 4.58% of sessions. When you view a sampled report, you get the option to adjust the data sampling rate:

slower response greater precision

Select ‘Faster Response, less precision‘ if you don’t mind sacrificing data accuracy for speed i.e. you want GA to quickly load the report even when it means smaller data sample being used for analysis.

Select ‘Slower Response, greater precision‘ if large sample size being used is more important for you than speed i.e. you want GA to use large data sample even when it means, the report will load slowly.

I would suggest to always use ‘slower response, greater precision’ settings. In some GA accounts there is a different option for adjusting data sampling rate:

faster processing higher precision

Here you drag the button to left for faster processing and loading of a GA report or to the right for higher precision but slower loading of a GA report.

Note: You can receive sampled data even when you are using Google Analytics API.

User’s Query and data sampling

Data sampling depends upon the type of user’s query and sampling rate can vary from query to query. Each GA view has got a set of unsampled and pre-aggregated data which are used to quickly display unsampled reports. A user can query the GA data via the reporting interface or via the API. If the query can be wholly satisfied by the existing unsampled and pre-aggregated data then GA does not sample the data otherwise it does. A user query can be standard or ad-hoc. A standard query can be something like requesting a report for a particular time period or running a report for a particular dimension.

  1. Any user’s query which can be wholly satisfied by the existing unsampled and pre-aggregated data is a standard query.
  2. Any ad-hoc query can not be wholly satisfied by the existing unsampled and pre-aggregated data.

A ad-hoc query can be something like:

  1. Applying advanced segments to a standard report
  2. Applying a secondary dimension to a standard report
  3. Running a custom report.
  4. Applying advanced segment and/or secondary dimension to a custom report.

If the resulting report from ad-hoc query is sampled, then you will see following message at the top of your report: “This report is based on …..less than 100% of sessions):

sampled

If the ad-hoc query can be wholly satisfied by the existing unsampled and pre-aggregated data than GA won’t sample the data. In other words, GA won’t always sample the data just because you have applied advanced segment or secondary dimension to a standard report or just because you are running a custom report. The probability of GA to sample the data increases when a user query is based on more than 500,000 sessions (in case of GA standard) or more than 25 million sessions (in case of GA premium).

But if the user query can be wholly satisfied by the existing unsampled and pre-aggregated data than GA won’t sample the data even if the query is based on more than 500,000 or 25 million sessions. In other words, GA won’t always sample the data just because the query is based on more than 500,000 sessions or 25 million sessions (in case of GA premium). Thus data sampling depends upon the type of user’s query and sampling rate can vary from query to query.

Google Analytics data tables

GA reports data in the form of tables. The data in these tables are either pre-aggregated or aggregated on the fly depending upon the user’s query. Each data table is made up of rows and columns. Each row represents a dimension and each column represents a metric:

dimensions metrics intro

Each dimension can have number of values assigned to it. Cardinality is the total number of unique values available for a dimension. For example, the dimension ‘device category’ has got 3 values: desktop, tablet and mobile. So the cardinality for this dimension would be 3:

device category cardinality

Some dimensions like ‘keyword’ or ‘page’ can have hundreds or tens of thousands of unique values assigned to it. Such dimensions are known as high-cardinality dimensions. Google Analytics reports which contain high cardinality dimensions are sampled if you see ‘(other)’ row in a report. If a report includes high cardinality dimension, Google Analytics will notify you by following message:

high cardinality dimension

In the context of data sampling, there are two types of Google Analytics data tables:

  1. Visit Tables – Unsampled reports are generated from the visit tables. Visits table is used to store raw data about each session.
  2. Processed Tables (also known as aggregate tables) – It is used to store pre aggregated data for commonly requested reports. Processed tables allow commonly requested reports to be loaded quickly and without sampling.

Note: When a user’s query can’t be satisfied with existing processed tables, GA uses the visit tables to report on the requested information.

Single day processed table

Single day processed table contains 1 days’ worth of data. These tables are processed daily and are also known as daily processed tables. A single day processed table can store upto 50,000 rows of unique data (dimension value combinations) in case of GA standard and up to 75,000 rows of unique data in case of GA premium.  If a GA premium user is using custom tables then a single day processed table can store upto 200,000 rows of unique data.

When a user’s query break these data limits then the reported data is being sampled and GA roll up the lower volume dimension value combinations into the (other) row. These are the data limits for daily processed tables. However bear in mind that these are all reporting limits for single day processed tables and are not processing limits.

Google Analytics is still tracking all those lower volume dimension value combinations which are rolled up into the (other) row and are thus not displayed in reports. Since single day processed tables are processed daily, a page/keyword that was rolled up into the (other) row one day, may not necessarily be rolled up into the (other) row the next day.

Multi-day processed table

Multi-day processed table contains multiple days’ worth of data. These tables are processed for multiple days and are made from multiple single day processed tables.  A multi-day processed table can store upto 100,000 rows of unique data (dimension value combinations) in case of GA standard and up to 150,000 rows of data in case of GA premium.

These are the data limits for multi-day processed tables. When a user’s query break these data limits then the reported data is being sampled and GA roll up the lower volume dimension values into the (other) row to stay within the data limits. However bear in mind that these are all reporting limits for multi day processed tables and are not processing limits. Google Analytics is still tracking all those lower volume dimension value combinations which are rolled up into the (other) row and are thus not displayed in reports.

Report Query Limit

In addition to data limits for single day and multi-day processed tables there are also report query limit The report query limit is that, for any date range, GA returns a maximum of 1 million rows of data for a report. When a user’s query break these data limits then the reported data is being sampled and GA roll up the lower volume dimension values into the (other) row to stay within the data limits.

Conversion Paths Limit

In addition to data limits for single day and multi-day processed tables and report query limit there is also conversion paths limit.  The conversion path limit is that, on any given day, GA returns a maximum of 200,000 unique conversion paths in a report. When a user’s query break these data limits then the reported data is being sampled and GA roll up the lower volume dimension values into the (other) row to stay within the data limits.

When Google Analytics starts sampling the data?

In following cases Google Analytics can start sampling the data when calculating the result for your report:

#1 When a user’s query is ad-hoc i.e. it can not be wholly satisfied by the existing unsampled and pre-aggregated data.

#2 For GA Standard, data sampling of non-multi channel funnel reports occurs at the property level. So view filters can impact the sample size.

#3 For GA premium, data sampling of non-multi channel funnel reports occurs at the view level. So view filters do not impact the sample size.

#4 In case of multi channel funnel reports sampling occurs at the view level whether you use GA standard or GA premium. So view filters do not impact the sample size.

#5 When a user’s query breaks the data limits for a single day or multi day processed tables.

#6 When a user’s query breaks the report query or conversion path limit.

#7 When you view a multi channel funnel report which has got more than 1 million conversions.

#8 When you view a flow visualization report (users flow or goal flow) that is based on more than 100k sessions.

Note: It is a common misconception that low traffic websites don’t face data sampling issues. They can also face data sampling issues depending upon the type of user’s query.

How you can fix the data sampling issues?

#1 Always keep the data sampling setting to ‘Slower response, greater precision’ :

slower response greater precision

The larger the data size being sampled, the more accurate the traffic estimates are. On the other hand, smaller the data size being sampled, the less accurate the traffic estimates are. But here is one caveat. There is still an upward limit on how big you can make the data sample in Google Analytics.

#2 Avoid applying advanced segments and/or secondary dimensions to your standard reports when you are analyzing metrics esp. ecommerce metrics like ‘e-commerce conversion rate’ and your analytics account has got data sampling issues.

#3 Run your GA report in such a way that you get following message at the top of your report: “This report is based on …..(100% of sessions):

unsampled

If that is not possible then run your report in a such a way that your report is based on biggest possible percentage of total website sessions. In this way you can minimize data sampling issues. You can do that by:

  • running reports for shorter time frame
  • not using filtered views
  • not applying advanced segments or secondary dimensions
  • not running custom reports.
  • downloading the report into excel and then do all the advanced segmentation and calculations there.

#4 Since for standard GA, sampling occurs at the property level, consider tracking different sections/sub-domains of your website via different properties.

#5 Switch to enterprise level analytics platform like GA premium. You can’t rely on free versions of the analytics tools for large amount of data processing and high accuracy. The data sampling limit of GA premium is approx. 200 times than that of standard GA which means you get more unsampled data in GA premium than in GA standard.

GA premium can handle websites which get 1 billion+ hits /month as opposed to just 10 million hits per month supported by GA standard. However GA premium will cost you $150,000 per year. I have been using GA premium for quite a long time now and I get lot of emails from people asking about its capabilities and whether $150k /year spend is really worth it.

Here is what I suggest. If your monthly online sales is at least $1 million and/or your website gets more than 10 million hits/month then you should definitely invest in enterprise level analytics software like GA premium. From my experience it is hard to justify $150k/year spend on an analytics tool if the online sales is less than $1 million per month.

#6 If your website gets more than 10 millions pageviews/hits each month, you can’t afford to use GA premium but still want to use Google Analytics then you will have to lower the data sample rate to stay within the GA processing limits for a standard account and avoid violating Google Analytics TOS (terms of service). According to Google Analytics TOS, you should not use GA standard, if your website gets more than 10 million hits per month. If you violate the GA TOS, then GA may stop tracking your website data any day, any time.

You can lower the data sample rate by using the ‘Sample Rate’ field with the ‘create’ command.

For example:

ga(‘create’, ‘UA-12345-21’, {‘sampleRate’: 50});

Here I have set the data sample rate to 50%. What that means, is that website usage data will be collected only for 50% of your website visitors. The website usage data for the other 50% of your visitors won’t be tracked and reported. Of Course you will get muddy analytics insight by lowering your data sampling rate. But then getting some analytics data would be better than nothing at all. I do not recommend lowering the data sample rate.

Note: The default value of sample rate is 100.

#7 Use Piwik. It is a free open source analytics platform.

Following are the key advantages of using Piwik over Google Analytics:

  1. You actually own your analytics data.
  2. Piwik allows tracking of PII (Personally Identifiable Information) which can be very useful for some businesses and projects.
  3. It has got a plugins architecture and an open marketplace: this brings almost infinite possibilities of innovation by the community.
  4. Piwik does not sample the data.
  5. Pwik does not have any data sampling limit. Yes you heard it right. This is because all of the data resides on your server.
  6. There is no limit to the number of goals you can track.
  7. There is no limit on data storage and data collection.

8. Analytics canvas can help you in reducing/eliminating data sampling programatically by using query partitioning. In query partitioning, a user’s query is broken into multiple queries in such a way that each individual query does not trigger data sampling.

Google Analytics Premium and Data Sampling

If you have got access to GA premium, you can follow the steps below to get unsampled reports:

Step-1: Go to the report for which you want an unsampled data.

Step-2: Select ‘Unsampled Report’ from the ‘Export’ drop down menu:

export unsampled report 1

Note: Unsampled Reports are available only in GA premium.

Step-3: Name the report, select the frequency and click on the ‘Request Unsampled’ button. The frequency is how often you want the unsampled report: once, daily, weekly, monthly or Quarterly:

export unsampled report 2

Step-4: Now click on the ‘customization’ tab > “Unsampled Reports” to see your requested report and the availability status of the report (pending, completed). Once the report is available for download, click ‘csv’ to download the report.

Other article you will find usefulHow to do ROI calculations for SEO

Announcement about my new books

Maths and Stats for Web Analytics and Conversion Optimization
This expert guide will teach you how to leverage the knowledge of maths and statistics in order to accurately interpret data and take actions, which can quickly improve the bottom-line of your online business.

Master the Essentials of Email Marketing Analytics
This book focuses solely on the ‘analytics’ that power your email marketing optimization program and will help you dramatically reduce your cost per acquisition and increase marketing ROI by tracking the performance of the various KPIs and metrics used for email marketing.

Attribution Modelling in Google Analytics and Beyond
Attribution modelling is the process of determining the most effective marketing channels for investment. This book has been written to help you implement attribution modelling. It will teach you how to leverage the knowledge of attribution modelling in order to allocate marketing budget and understand buying behaviour.

Himanshu Sharma

Certified web analyst and founder of OptimizeSmart.com

My name is Himanshu Sharma and I help businesses find and fix their Google Analytics and conversion issues. If you have any questions or comments please contact me.

  • Over eleven years' experience in SEO, PPC and web analytics
  • Google Analytics certified
  • Google AdWords certified
  • Nominated for Digital Analytics Association Award for Excellence
  • Bachelors degree in Internet Science
  • Founder of OptimizeSmart.com and EventEducation.com

I am also the author of three books:

error: Content is protected !!