Prevent Search Engines from Caching Content

Posted on June 22, 2014

GeekThis happily runs on Vultr. Get $300 of free hosting credits to try out their cloud compute, kubernetes engine, or managed databases. Try Vultr today to claim your free $300.

At one point or another you will need to force a website or a webpage to not be in a search engines cache or want to make sure it doesn’t get cached. I’m going to show you how to prevent Google and a few other search engines and caching services from caching your website. Please note that having a cached version of your site is often a good thing since if your website is under heavy load or no longer loads, users can still see your content.

The first thing we have to do is modify the header of our website. I always have a separate file for the header contents so I only need to change it once. If you have the <head> tag in multiple locations you will need to modify them all or only for the pages you wish not to be cached by Google.

<head>
    <!-- All other head tag contents -->
    <META NAME="ROBOTS" CONTENT="NOARCHIVE">
</head>

This will prevent most robots from archiving your webpage. Of course crawlers and spiders don’t have to follow this setting, but most search engines will. To my knowledge, Google, MSN, Bing, Yahoo all abide by the NOARCHIVE tag.

Internet Archive - Archive.org

To prevent the Internet Archive from having a history of your website available, you will need to modify the “robots.txt” file for your website. Since the Internet Archive isn’t a search engine, prevent them from crawling all pages is perfectly fine and won’t affect your search rankings. But if you want to, you can make it so some pages are available for them to archive.

The below code will block the Internet Archive on all pages of your website.

User-agent: ia_archiver
Disallow: /

To only block certain pages for the Internet Archive you will use something similar to the following code. The below robots file will allow Internet Archive to cache all pages except “page1”, “contact” and all the files inside of “folder”.

User-agent: ia_archiver
Disallow: /page1
Disallow: /contact
Disallow: /folder/

This will hopefully prevent some sites from now archiving your content. I personally have most of my sites opted out of being archived by the Internet Archive but I keep search engines on the list of allowed.

Internet Archive - Archive.org

Scraping Title and Meta Data with Scrapebox