Data Scraping Guide for SEO & Analytics

 

Web scraping or web data scraping is a technique used to extract data from web documents like HTML and XML files.

Data scraping can help you a lot in competitive analysis as well as pulling out data from your client’s website like extracting the titles, keywords and content categories.

You can quickly get an idea of which keywords are driving traffic to a website, which content categories are attracting links and user engagement, what kind of resources will it take to rank your site…………and the list goes on…

 

Scraping Organic Search Results

By scraping organic search results you can quickly find out your SEO competitors for a particular search term. You can determine the title tags and the keywords they are targeting.

The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you may see the following results:

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

Pro Tip by Tahir Fayyaz:

Just wanted to add a tip for people using the SERPs Redux bookmarklet.

If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

 

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like:

  1. Title tag
  2. Meta description tag
  3. Meta keywords tag
  4. Meta robots tag
  5. H1 tag
  6. H2 tag
  7. HTTP Header
  8. Backlinks
  9. Facebook likes etc.

 

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

Syntax: =importXML(URL,X-path-query)

url=> URL of the web page from which you want to import the data.

x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.

2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.

3. Selecting nodes- The node is selected by following a path known as the path expression.

4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works.

Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather to quickly generate path expressions while browsing HTML and XML documents.

Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

 

How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather
– Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

Then you will see the Xpather browser from where you can copy the X-path.

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:

/html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[1]/a

Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:

/html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[2]/a

 

Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:

/html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]//li/a

 

Now all i have to do is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

This is how you can scrape data using importXML.

Pro Tip by Niels Bosma: “Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in excel.

Note:You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages & APIs. You need to know Python (a programming language) in order to use scrapy.

 

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

 

Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

 

Scraping keywords from a webpage

Through this excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

You can get the Alchemy API key from here

 

Scraping keywords data from Google Adwords API

If you have access to Google Adwords API then you can install this plugin from seogadget website. This plugin creates a series of functions designed to fetch keywords data from the Google Adwords API like:

getAdWordAvg()- returns average search volume from the adwords API.

getAdWordStats() – returns local search volume and previous 12 months separated by commas

getAdWordIdeas() – returns keyword suggestions based on API suggest service.

Check out this video to know how this plug-in works

 

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into excel and then mine keywords or just get ad copy ideas.  Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into excel.

 

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

 

Scraping Outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

 

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Note: You may need to edit the XPath if the results are not what you were expecting.

This post is very much a work in progress. If you know more cool ways to scrape data then please share in the comments below.

 

About the Author:

My business thrives on referrals, so I really appreciate recommendations to people who would benefit from my help.Please feel free to endorse/forward my LinkedIn Profile to your clients, colleagues, friends and others you feel would benefit from SEO, PPC or Web Analytics.