Scraping Title and Meta Data with Scrapebox

GeekThis happily runs on Vultr. Get $300 of free hosting credits to try out their cloud compute, kubernetes engine, or managed databases. Try Vultr today to claim your free $300.

Testing the many web pages of your site to verify that the title and description tags are set properly can seem like a daunting task. Luckily if you own Scrapebox, scraping meta information becomes a lot easier. If you don’t already own Scrapebox, I wouldn’t recommend buying it as a tool just to scrape meta information because there are cheaper options. But if you do like the other tools available inside of Scrapebox, then give it a shot.

I seldom use Scrapebox, so this feature came as a surprise to me when I ran across it. More experienced Scrapebox users probably know about this feature and utilize it on the regular. Once you have your domains inside the harvester, there is an option to grab the meta information (title, description, keywords) for each URL. I used this the other day to make sure I set description tags for all of my posts, but it can be used to get an idea of what type of posts a competitor publishes.

  1. Import the list of URLs you want to check the meta information for into the harvester. Do this by clicking on the button “Import URL List.”
  2. Click on the button “Grab / Check” followed by the option “Grab META Info from harvested URL list.”
  3. A window for the “Grab Meta Data” tool will open. Press the button “Start” to start scraping meta information from the list of URLs.

With the scraped meta information, you can then export it as an Excel or CSV spreadsheet. Then using your favorite spreadsheet application, you can inspect what pages are missing titles or descriptions, or research your competition quickly.

Configuring Meta Grabber Settings

Scrapebox isn’t the greatest at organizing or showing the available settings for each tool. For the Meta Scraper settings, go to the menu “Settings” followed by “Connections, Timeouts, and Other Settings.” At the bottom of the “Connections” tab is an option to adjust how many concurrent connections the “Meta Grabber” will use. By default it’s set to 100 connections, which is usually fine. If you are performing meta scraping on your own site, you could most likely increase the number of connections if you have a lot of pages to scrape.

Under the tab for “Timeouts” you can configure how long before the “Meta Grabber” times out. By default this value is set to 10 seconds. Again, if you are running this tool on your own site, increasing the timeout could be beneficial since you know the site is alive and will eventually have a response.

Getting URLs from a Sitemap

To get the list of URLs I want to grab meta information from, I always use the Sitemap Scraper add-on inside of Scrapebox. Below are the simple steps on how to extract all of the URLs from a website’s sitemap.

  1. Open the add-on “Scrapebox Sitemap Scraper” under the “Addons” menu. If it’s not yet installed, go to “Show Available Addons” and install the add-on.
  2. Inside of the Sitemap Scraper tool, click on the “Load urls” button to import your sitemap URL. Since you usually only have a single URL pointing to a sitemap, the easiest way is to import the URL from your clipboard. You could also use Scrapebox’s text editor to quickly import the URL into the harvester and then import the URL from the harvester.
  3. Configure the settings for the add-on from the menu option “Settings.”
  4. Click on “Start” and wait for the tool to finish. Once finished, the URLs by default are stored in a folder called Addon_Sessions/SitemapScraperData inside of the Scrapebox directory with the filename being a date of when it was performed. There is also a button to show the download folder for quick access.

Scraping Localhost Websites

If you are scraping your own site for meta tags, nothing is quicker than running your site locally and scraping directly from there. All of the steps are the exact same as above, just change the domain name from your normal domain name to your local host name, which is usually localhost.

Related Posts

Monitor Website Status with Twitter

Add another monitoring method for your site by monitoring Twitter in real time. This small Go script will display all tweets that mention any of the terms you want to track.

How to Separately Handle Multiple Website Deployments

Learn how to configure your computer to run multiple website deployments with rsync and multiple Linux accounts.

Selling Your Website on Flippa

A quick guide on how to sell your website on Flippa and get its true value.

Prevent Search Engines from Caching Content

Methods to prevent search engines from caching your website content.