Data Scraping for SEO and Analytics – Tutorial

Last Updated: August 26, 2021

Web scraping or web data scraping is a technique used to extract data from web documents like HTML and XML files.

Data scraping can help you a lot in competitive analysis as well as pulling out data from your client’s website like extracting the titles, keywords and content categories.

You can quickly get an idea of which keywords are driving traffic to a website, which content categories are attracting links and user engagement, what kind of resources will it take to rank your site…………and the list goes on…

Scraping organic search results

By scraping organic search results you can quickly find out your SEO competitors for a particular search term.

You can determine the title tags and the keywords they are targeting.

The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you may see the following results:

serps scraping

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

Pro Tip by Tahir Fayyaz:

Just wanted to add a tip for people using the SERPs Redux bookmarklet.

If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox to load x amount of pages all on one page and then scrape it all using the bookmarklet.

attribution modelling

Get the E-Book (52 Pages)

62 point checklist

Get the E-Book (50 Pages)

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like:

  1. Title tag
  2. Meta description tag
  3. Meta keywords tag
  4. Meta robots tag
  5. H1 tag
  6. H2 tag
  7. HTTP Header
  8. Backlinks
  9. Facebook likes etc.

data scrapping1

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

Syntax: =importXML(URL,X-path-query)

url=> URL of the web page from which you want to import the data.

x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

  1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.
  2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.
  3. Selecting nodes– The node is selected by following a path known as the path expression.
  4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works.

Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather to quickly generate path expressions while browsing HTML and XML documents.

Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather – Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

scraping

Then you will see the Xpather browser from where you can copy the X-path.

xpath browser1

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:

/html/body/div[@id=’page’]/div[@id=’page-ext’]/div[@id=’main’]/div[@id=’main-ext’]/div[@id=’mask-3′]/div[@id=’mask-2′]/div[@id=’mask-1′]/div[@id=’primary-content’]/div/div/div[@id=’post-58′]/div/ol[2]/li[1]/a

Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:

/html/body/div[@id=’page’]/div[@id=’page-ext’]/div[@id=’main’]/div[@id=’main-ext’]/div[@id=’mask-3′]/div[@id=’mask-2′]/div[@id=’mask-1′]/div[@id=’primary-content’]/div/div/div[@id=’post-58′]/div/ol[2]/li[2]/a

Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:

/html/body/div[@id=’page’]/div[@id=’page-ext’]/div[@id=’main’]/div[@id=’main-ext’]/div[@id=’mask-3′]/div[@id=’mask-2′]/div[@id=’mask-1′]/div[@id=’primary-content’]/div/div/div[@id=’post-58′]/div/ol[2]//li/a

All I have to do now is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

importXML

This is how you can scrape data using importXML.

Pro Tip by Niels Bosma.

Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in Excel.

Note: You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages and APIs. You need to know Python (a programming language) in order to use scrapy.

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into Excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into Excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

Scraping keywords from a webpage

Through this Excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

alchemy

You can get the Alchemy API key here

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into Excel and then mine keywords or just get ad copy ideas. Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into Excel.

semrush7

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

Scraping outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

google chrome scraper1

google chrome scraper21

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Note: You may need to edit the XPath if the results are not what you were expecting.

This post is very much a work in progress. If you know more cool ways to scrape data then please share in the comments below.

Register for the FREE TRAINING...

"How to use Digital Analytics to generate floods of new Sales and Customers without spending years figuring everything out on your own."



Here’s what we’re going to cover in this training…

#1 Why digital analytics is the key to online business success.

​#2 The number 1 reason why most marketers are not able to scale their advertising and maximize sales.

#3 Why Google and Facebook ads don’t work for most businesses & how to make them work.

#4 ​Why you won’t get any competitive advantage in the marketplace just by knowing Google Analytics.

#5 The number 1 reason why conversion optimization is not working for your business.

#6 How to advertise on any marketing platform for FREE with an unlimited budget.

​#7 How to learn and master digital analytics and conversion optimization in record time.



   

My best selling books on Digital Analytics and Conversion Optimization

Maths and Stats for Web Analytics and Conversion Optimization
This expert guide will teach you how to leverage the knowledge of maths and statistics in order to accurately interpret data and take actions, which can quickly improve the bottom-line of your online business.

Master the Essentials of Email Marketing Analytics
This book focuses solely on the ‘analytics’ that power your email marketing optimization program and will help you dramatically reduce your cost per acquisition and increase marketing ROI by tracking the performance of the various KPIs and metrics used for email marketing.

Attribution Modelling in Google Analytics and BeyondSECOND EDITION OUT NOW!
Attribution modelling is the process of determining the most effective marketing channels for investment. This book has been written to help you implement attribution modelling. It will teach you how to leverage the knowledge of attribution modelling in order to allocate marketing budget and understand buying behaviour.

Attribution Modelling in Google Ads and Facebook
This book has been written to help you implement attribution modelling in Google Ads (Google AdWords) and Facebook. It will teach you, how to leverage the knowledge of attribution modelling in order to understand the customer purchasing journey and determine the most effective marketing channels for investment.

About the Author

Himanshu Sharma

  • Founder, OptimizeSmart.com
  • Over 15 years of experience in digital analytics and marketing
  • Author of four best-selling books on digital analytics and conversion optimization
  • Nominated for Digital Analytics Association Awards for Excellence
  • Runs one of the most popular blogs in the world on digital analytics
  • Consultant to countless small and big businesses over the decade

Get My Step-By-Step Blueprint For Finding The Best KPIs (32 pages ebook)

X
error: Alert: Content is protected !!