Data Scraping Guide for SEO & Analytics

 

Web scraping or web data scraping is a technique used to extract data from web documents like HTML and XML files. Data scraping can help you a lot in competitive analysis like determining the titles, keywords, content categories and ad copies used by your competitors.

You can quickly get an idea of which keywords are driving traffic to your competitors’ website, which content categories are attracting links and user engagement. What kind of resources will it take to rank your site?

You can then replicate all the good strategies used by your competitors.  The idea is to do what your competitors are doing and do it even better to outrank them. Through competitive analysis you can get a head start in any SEO project.

 

Scraping Organic Search Results

Scrape organic search results to quickly find out your SEO competitors for a particular search term. Determine the title tags and keywords they are using. The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you see the following results:

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

Pro Tip by Tahir Fayyaz:

Just wanted to add a tip for people using the SERPs Redux bookmarklet.

If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

Another cool way of scraping SERP is through Keyword difficulty tool of SEOmoz. Through this tool you can scrape organic search results along with all the cool metrics offered by SEOmoz like PA, DA, Linking root domains etc.

You can download this report into excel by clicking on ‘Export to CSV’.

 

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like: Title tag, Meta description tag, Meta keywords tag, Meta robots tag, H1 tag, H2 tag, HTTP Header, Backlinks, Facebook likes etc.

 

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

Syntax: =importXML(URL,X-path-query)

url=> URL of the web page from which you want to import the data.

x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.

2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.

3. Selecting nodes- The node is selected by following a path known as the path expression.

4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works. Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather to quickly generate path expressions while browsing HTML and XML documents.  Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

 

How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather
– Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

Then you will see the Xpather browser from where you can copy the X-path.

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:

/html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[1]/a

Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:

/html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[2]/a

 

Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:

/html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]//li/a

 

Now all i have to do is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

This is how you can scrape data using importXML.

Pro Tip by Niels Bosma: “Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in excel.

Note:You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages & APIs. You need to know Python (a programming language) in order to use scrapy.

 

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

 

Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

 

Scraping keywords from a webpage

Through this excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

You can get the Alchemy API key from here

 

Scraping keywords data from Google Adwords API

If you have access to Google Adwords API then you can install this plugin from seogadget website. This plugin creates a series of functions designed to fetch keywords data from the Google Adwords API like:

getAdWordAvg()- returns average search volume from the adwords API.

getAdWordStats() – returns local search volume and previous 12 months separated by commas

getAdWordIdeas() – returns keyword suggestions based on API suggest service.

Check out this video to know how this plug-in works

 

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into excel and then mine keywords or just get ad copy ideas.  Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into excel.

 

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

 

Scraping Outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

 

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Note: You may need to edit the XPath if the results are not what you were expecting.

This post is very much a work in progress. If you know more cool ways to scrape data then please share in the comments below.

Join over 5000 subscribers!
Receive an update straight to your inbox every time I publish a new article.

 

About the Author:



My business thrives on referrals, so I really appreciate recommendations to people who would benefit from my help. Please feel free to endorse/forward my LinkedIn Profile to your clients, colleagues, friends and others you feel would benefit from SEO, PPC or Web Analytics.

 

 

  • http://nielsbosma.se Niels Bosma

    Thanks for mentioning SeoTools.

    Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.

    Read more about this on:
    http://nielsbosma.se/projects/seotools/functions/xpathonurl/

    Regards
    /Niels

    • http://seohimanshu.com/about-himanshu/ seo himanshu

      Thanks for the link Niels. Will check it out.

      • Vince

        Se

        • Vince

          Please contact be I would like to purchase some bulk services from SEO himanshu

    • David Sottimano

      They both have their pros and cons, but overall IMO Gdocs is better as it’s not using your IP and you don’t have to bother with proxy servers.

      I also find that having my Gdocs accessible in the “cloud” makes it easy to take my tools around anywhere. I’m also used to the Javascript functions in Gdocs which makes it easy for me to make scripts.

      No offence to your tool Niels, actually it’s amazing – but I just prefer Gdocs for the above reasons.

  • https://twitter.com/#!/SEOptimism Sandeep

    Awesome post..thank you for briefing all of them in a single post which is very useful..this post will definitely keep me busy this weekend..thanks a lot!

  • http://www.11-internet.nl Jan-Willem Bobbink

    What do you think about the differences in numbers of links scraped by Opensiteexplorer.org or the Yahoo search engine? Which one is the most complete and reliable source?

    • http://seohimanshu.com/about-himanshu/ seo himanshu

      Yahoo site explorer can scrape only 1000 back links. Whereas OSE can scrape 10000 back links.

  • http://www.uptodateseoservices.com/seo-services.html SEO services

    This is really Nice Information you provided i appreciate your work thanks for the Post

  • http://www.loopbeyond.com Tahir Fayyaz

    Great tips and I have found SeoTools by Niels Bosma to be very useful.

    Just wanted to add a tip for people using the SERPs Redux bookmarklet. If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

    • http://seohimanshu.com/about-himanshu/ seo himanshu

      Thanks for the tip Tahir.

  • Don

    Great article.
    Step-2: Install X-pather –
    X-pather doesn’t work with Firefox 4+. Any recommendations for a tool to use in place of it.

    • himanshu

      There are other tools but then they are not free. Like ‘Xpath expression builder’ by liquid technologies http://www.liquid-technologies.com/xpath-viewer.aspx. I am still using firefox 3.6 and would suggest you not to upgrade your firefox anymore. You may loose the functionality of some plug-ins with every update. Moreover i use firefox only for SEO purpose. So outdated versions work well for me.

  • http://www.webguru-india.com/ Chris @ Website Design

    Thanks a LOT for the post Himanshu. With growing competition, SEO is not just about keywords and links anymore. Its a complete set of strategies aiming at top of the niche. Thanks again for providing these tools.

  • http://www.theindia.org.uk/ Justin

    Thanks for the detailed info Himanshu. This is a serious stuff on data scraping tools and processes. I agree scraping data has so many uses for a SEO. It is specially usefuly in keyword research. Can you please suggest some screen scraping tools and scraping text data from flash videos.

  • http://www.trucosseosem.com Trucos Seo Sem

    This is a very very useful information for me. WOW, god i need more time for test everything. Thank you a lot, is very useful for SEO

  • http://www.intenseblog.com Jenni

    Wow, there’re some SEO tools make me amazing.

    Thank you for your post.
    Jen

  • http://www.springboardseo.com Matthew Edward

    Great post Himanshu,

    Can you explain how Google charges for use of the Adwords API tool if I would use it to import keyword search volume into Google docs? I sifted through a fair bit of documentation, but didn’t get their system for charging for data use.

    For example:

    – Is there a specific number of keyword search volume queries that can be done for free?
    – If I use code to import data into Google docs, will each time the doc is loaded count as new queries? I.e., if I have 50 queries to Google for search volume data (50 terms), and I load the page 10 times, will that count as 500 queries?

    Thanks for the great post!

  • http://twitter.com/solsonIII Stephen Olson

    Thanks for the post, I have bookmarked many of the sites you have mentioned. I just tried OutWit Hub. It is a free software. I was able to expert a list of members from a LinkedIn group using the application.

    • himanshu

      I will look into it. Thanks

  • Thoufeeq

    Tried with Xenu to scrape only the pages and their titles of an entire website? It spits out way too much. Tried Neils Bosma’s Excel Tool for SEO but its only one page/run. Any help?

    • http://anothermarketer.com/ Philip Tomlinson

      Limit your depth and exclude URLs like archives and comment replies. That should already provide cleaner data.

  • kushal

    how to scrap website data from internet to excel ? tell me if any intelligent in IT other wise all dull have a mind any man accept challange from my website http://www.thaigemjewelry.or.th/ from directry data pick in 1 time complete data