Parsing AWStats with PHP

Posted on November 11, 2013

GeekThis happily runs on Vultr. Get $300 of free hosting credits to try out their cloud compute, kubernetes engine, or managed databases. Try Vultr today to claim your free $300.

Parsing AWStat files is a simple way to get key statistics without adding another log parser or website tracking system to your server or website. The reason I picked AWStat’s is that nearly every shared hosting server has it already installed and installing it to your own server is very simple, which makes it one of the more popular local tracking systems. That, plus it’s incredibly fast and doesn’t use up resources for each visit since it parses the log file once a day and doesn’t insert a new record of each user every time one visits.

There is only going to be PHP Code describing how to parse these files. Off of the PHP Code, you should be able to make your own parser in less than an hour.

Getting the AWStats Log Files

Hopefully your hosting company made these logs accessible to you. If they didn’t, you are out of luck. You could always contact them and ask for them to be made public and they may follow your request but don’t have to. In my shared cPanel hosting account, I can access the AWStats files in /tmp/awstats/

Look around for your AWStats files, they could be pretty much anywhere. The file names will look similar to the following.

awstats012013.geekthis.net.txt
awstats.geekthis.net.conf

The ones we are working with will be the txt files that have awstats and the date in the filename.

AWStats Stats File Structure

The first line of the file (which isn’t a comment) is the type of file it is in plain text. The whole file works as plain text, so no need to worry about binary or hex values. The type of file is identified by “AWSTATS DATA FILE (Major Version).(Minor Version) (build (build number))”. This identifier can most likely be ignored. In my example code, I just skip over this line and look for the sections, which are the most important part of the whole file.

Next, we have comments. Comments are identified by the pound or hash (#) symbols before any other characters. All lines with a comment we will skip right over and move onto the next line. By default, there are lots of comments in these log files, so you have to assume they are always going to exist. There are no comment blocks, so parsing for comments is a lot easier.

Sections are started with BEGIN_ and then their name after the underscore. With one space after their name is the number of lines before the END_ of the current block or section. There is one weird case in which the row count is incorrect and is one number lower than the amount of rows. For BEGIN_MAP the line count needs to be increased by one.

Below I have listed all of the section types that are placed in AWStat files by default. You shouldn’t assume their names in parsing, but only use the names when getting the data from your own stored format. Each section’s data per line is separated by a single space. If you look in the AWStats file, they describe the order in which each section in stored. Unless you expect to use every section, don’t split the data until the end when you need it for your specific sections.

MAP: Lists the position of all the sections from bytes from the beginning of the file. The bytes usually are not exact and end up being a line before the BEGIN_. But using this method to more quickly seek is possible.
GENERAL: General information about the log file and visits to the site. Gives such information as last updated, first visit time, total visits, unique visits, and position of the log file AWStats is parsing so it doesn’t keep parsing the same data and skewing the stats.
MISC: Shows miscellaneous data about who has support for various media types, such as Java, Flash, RealPlayer and others. This data I have found to not be all that accurate, and shouldn’t be relied on.
TIME: Has 24 lines, one for each hour of the day. Lists how many visits in that specific hour of the whole month. Will give you a good idea when it’s best to perform website updates and the best time to release new posts or content.
DOMAIN: Lists the country code and how many visits and hits.
ROBOT: Lists all the robots that visit your site. Will give you the bot’s name, hits, bandwidth usage, last visit, and how many times they requested the robots.txt file.
WORMS: Lists if any worms or malicious bots have visited your site. Will be identified by the user agent, how many hits, bandwidth usage, and time and date of the last visit.
FILETYPES: Lists how many hits to specific file types (by extension). This won’t do much if your site uses URL Rewriting like WordPress with their custom URL’s. Shows the file type, hits and bandwidth with and without compression.
DOWNLOADS: Lists files which were downloaded. Shows the path, number of downloads, visits, and the bandwidth.
OS: Lists all the various operating systems that visit your site. Shows the Operating System short name and how many hits. A few examples of names are linuxandriod, blackberry, winnt, winxp, win2000, winvista, linux, macosx, winme, win2003, win7, Unknown.
BROWSER: Takes it’s best guess at what the browser is from it’s user agent. Show the browsers short name and the number of hits from that browser.
UNKNOWNREFERER: Lists user agents where it couldn’t figure out the Operating System. Only other information provided is the last visit date.
UNKNOWNREFERERBROWSER: Lists user agents where it couldn’t find the the browser type as opposed to the operating system. Other than that, identical to UNKNOWNREFERER.
SEREFERRALS: Shows search engines name, how many pages and how many hits from that search engine. A few example names of search engines are google, baidu, bing, yandex, yahoo.
PAGEREFS: Lists all the external web pages where users have visited your website from (external link referrals). These URL’s will not include parameters, which makes it difficult to find the true origin of some visits. Lists the url, pages and hits.
SEARCHWORDS: Lists all the search phrases (unedited) that brought users to your website and how many times that search was performed and brought users to the site. Google and various other search engines remove the search terms from their referer headers to protect their user’s privacy. But users who are not logged in or are not on the HTTPS site will still have the query in the referer header.
KEYWORDS: Identical to SEARCHWORDS, but instead of the full search phrase, will list the keywords and the amount they appear in all the search phrases.
ERROR: Lists the HTTP Status Code, hits and bandwidth. Status 200 will not appear, but redirects, errors, and various other status codes will appear.
VISITOR: Shows all the IP Addresses or “visitor’s” that have visited your site. Lists the pages they viewed, hits, bandwidth and last visit date.
DATE: Lists the Date (YYYYMMDD) the pages, hits, bandwidth and visits.
SESSION: Session range of visitors to the site. Shows the time frame and the number of sessions or visitors that stayed for that duration.
SIDER: Shows a website URL, the number of pages, bandwidth, entries and exits from this page. Great to see what users are interested in, and what content makes them go running.

Now that we have a huge list of sections, we can finally take a look at how you can parse these files with some PHP Code. This code was thrown together fairly quickly and should not be used in production. This code is for educational purposes and to assist you in making your own PHP or other language library to parse AWStat files.

<?php
	/* Sample Setup to test the code */
	$p = new AWStats('01','2013','geekthis.net','./');
	print_r($p->data);


	class AWStats {
		private $fh = false;
		public $lastError = false;
		public $data = array();

		function __construct($month,$year,$domain,$path='/tmp/awstats/') {
			$filename = $path.'awstats'.$month.$year.'.'.$domain.'.txt';
			if(!file_exists($filename)) {
				$this->lastError = 'File does not exist.';
				return false;
			}

			$this->fh = fopen($filename,'r');
			if($this->fh === false) {
				$this->lastError = 'File cannot be opened.';
				return false;
			}

			$this->parse();
		}

		/* Checks if line is a comment */
		private function comment($line) {
			if(isset($line[0]) && $line[0] == '#') {
				return true;
			}
			return false;
		}

		/* Builds an array based on a section */
		private function section() {
			$in_section = false;
			$section_name = '';
			$section_lines = 0;
			$on_line = 0;
			$section_content = array();

			if($this->fh === false) {
				return false;
			}

			while(($line = fgets($this->fh)) !== false) {
				$line = trim($line);
				if($this->comment($line)) {
					continue;
				}

				if($in_section) {
					if(strpos($line,'END_'.$section_name) === 0) {
						return array(
							'name' => $section_name,
							'lines' => $section_lines,
							'content' => $section_content
						);
					}else if($on_line <= $section_lines) {
						array_push($section_content,$line);
						$on_line++;
						continue;
					}else {
						$this->lastError = 'Section Can Not Find Ending';
						return false;
					}
				}

				if(strpos($line,'BEGIN_') === 0) {
					$in_section = true;
					$section_info = explode(' ',$line);
					$section_name = substr($section_info[0],6);
					$section_lines = $section_info[1];
					$on_line = 0;
					$section_content = array();
					continue;
				}
			}
			return false;
		}

		/* Parses the sections array and uses that data for whatever it needs it for */
		private function parse() {
			if($this->fh === false) {
				return false;
			}

			while($section = $this->section()) {
				/*
					Here you would place extra parsing code based on what you want
					to do with the data. But since this is only an example, the
					data is placed into an array with just the section name and
					the data for each line (untouched). Will have to split by [space]
				*/
				array_push($this->data,$section);


				/* You can add specific rules based on the section here */
				switch($section['name']) {
					case 'GENERAL':

						break;
					case 'ROBOT':

						break;
					/* Add the rest of the section cases */
				}
			}
		}
	}

The above code will output a huge array of all the data in the AWStats file. Each array inside of the global data variable will contain “name”, “lines” and “content”. The “name” value will be the section name, as listed above. The content value is another array, each child is the line from the AWStats file. In the parse function, you should setup your own rules that you wish to have parsed inside of the switch. Since I don’t know what you want the stats file for, I cannot code exactly for your purpose, so the above code is just an outline.

Getting the AWStats Log Files

AWStats Stats File Structure

PHP - Check if Production or Sandbox

Setup PHP Development Web Server Quickly inside of Windows

PHP Dynamic Subdomains

PHP Calculate Time Since