Basic Analytics - Server Logs

March 07, 2008, Dave Taillefer,

Web Analytics 101 It's one thing to collect website metrics, but making heads or tails of it it is another matter all together. There are a few web stat programs out there to decipher server logfiles in a quasi coherent fashion, but they all output metrics in a different way. Understanding the differences between them can be useful - especially when your boss is asking you for specifics on the company traffic report. How do you interpret the information? What do all the discrepancies mean? How do these programs read logfiles? There are some great 'free' (open source) server tools out there to help us wade through data to extract measurable results. I'll go over what I know about how logfiles are interpreted and i'll briefly cover 3 popular logfile statistic programs - Webalizer, AWStats, and Analog.

How Server Logs are Collected - the CLF (Common Log Format) Web servers keep a log of what they are doing, and they usually do it by logging events in plain text files. Each time someone or something asks for a web page, and any component on that page - like a graphic - the server writes another line in the logfile to represent that request. Errors and unsuccessful requests are also logged, so details pile up. Raw logs are ugly and a little tricky for humans to read... thankfully, logfile analysis programs such as Webalizer, AWStats, and Analog (there are many more) have been written to help us interpret the results.

Understanding the Limitations Servers are limited by the fact that they can not distinguish between a human being and a robot (search spider, email harvester, or other). The server sees remote IPs making individual requests. The process is simple - a remote address connects, sends a request, receives a response and then it disconnects. The server records this instance in the logfile. The thing to understand here is that each individual request looks the 'same' to the server... it won't decipher a human form a robot. The results outputted by the stats program are therefore up to interpretation...

Another limitation is that unique IPs don't always represent a unique individual (or entity) looking at your website. An IP address can represent an array of possibilities; a robot scanning your site, a human, or an entire group people behind a single IP. The opposite could also be true, where an individual might revisit your website later in the day, but the Internet Service Provider assigned this surfer a 'new' IP... now the server logs show 2 IPs for 1 user - so the logfile analyzer registered 2 unique visitors. Ultimately, 'unique IP' metrics are really only 'best guesses' where a 'visit' is based on the assumption that a single IP address represents a single user.

Lastly, web browser caching also presents a problem for logfile analysis. If a person revisits a page, the second request will often be retrieved from the browser's cache, and so no request will be received by the web server. Therefore, that visitor's path through the site is lost. Web servers can be configured to stop this, but at the expense of performance (for the visitor).

Alternative formats used to make better assumptions There are non-standard log formats that can be applied to logfile analyzers. One popular method is the 'combined' log format, where the basics of the CLF are used in conjunction with 'User-Agent' and 'Referrer' logs. A 'User-Agent' could be a browser like Firefox, Internet Explorer, Safari, etc... The 'Referrer' represents the web page that directed the user to your website - it could be a Google search results page, a blog or a website. Unfortunately, these alternatives can also be misleading since both User-Agents and Referrers requests can be modified and/or spoofed resulting in erroneous logfile stat reports. But in general, it cleans up a little bit of the mess.

For more info on how Apache logs in either the Common Log Format or the Combined Log Format - check out their site here.

Popular 'open source' logfile analyzers - Webalizer is a free web server logfile analysis program, distributed under the GNU General Public License. It comes standard with most Linux (or Apache) based hosting accounts and basically provides a detailed report in HTML format. It was designed to be run from command line prompt or as a cron job. Most web hosting companies assume you'll just be looking at the HTML report via provided URL - so they often default to running daily cron jobs. Webalizer supports CLF and Combined log formats. Incidentally, the last Webalizer release seems to have been back in 2002, so it appears that not much has been done on Webalizer since then.

AWStats is also a free web server logfile analysis program, distributed under the GNU General Public License. AWstats is actually a Perl script (awstats.pl), which parses your server's logfiles and generates reports. AWStats also supports CLF and Combined log formats (and more). The development of this program seems to be alive and kicking, with the latest AWStats 6.8 release in November 2007.

AWStats compiles statistics for unique visitors by looking at 'pages' (not IPs). This is significant in that many visitors are behind a proxy server when they surf (ie: AOL users). When an AOL user hits your site - it's possible that several hosts (several IP addresses) are used to reach your web site (ie: one proxy server to download the webpage and 2 other servers to download all the images). Therefore logging 3 unique visitors (IP based) when really only the 1 visitor explored your site. So AWStats, considers HTML pages to count unique visitors. This decrease the margin of error... but a level of error still exists, since some websites use 'frames' which are in effect a combination of pages.

AWStats also uses a .txt list of known search engine spiders to separate search engine activity by referencing a robots.txt file (if you have one in place). Note however that many crawlers ignore the robot.txt file and go unnoticed - they are then counted as human visitors... more error.

Analog is considered by some as the best web server logfile analysis tool around, although it has not undergone further development since 2004. So, this however may be more relevant to die-hards more inclined to configure it to their liking. Analog tries to output almost every metric possible, but it can be a bit overwhelming as it is all (by default) on a single page. It is however quite informative.

The results for logfile analysis are open to interpretation - they can be used to look for trends, and useful to extract which pages are being viewed - filtered logfiles are a powerful metric but must be interpreted with care. This introduces a few questions - how then can we better interpret our findings? How can we minimize the margin for error? What are the alternatives to logfile analysis?

Enter Google Analytics stage right. While this tool also has its limitations, it provides us with a wealth of knowledge beyond the scope of logfile analysis. I'll be sure to post more on this as soon as I can find the time. ;)


SEM and SEO - icona.ca



Add 'Basic Analytics - Server Logs' to Sphinn Add 'Basic Analytics - Server Logs' to DropJack Add 'Basic Analytics - Server Logs' to SlashDot Add 'Basic Analytics - Server Logs' to Digg Add 'Basic Analytics - Server Logs' to Reddit Add 'Basic Analytics - Server Logs' to Del.icio.us Add 'Basic Analytics - Server Logs' to Facebook Add 'Basic Analytics - Server Logs' to Technorati
Add 'Basic Analytics - Server Logs' to Simpy Add 'Basic Analytics - Server Logs' to Google Bookmarks Add 'Basic Analytics - Server Logs' to StumbleUpon Add 'Basic Analytics - Server Logs' to BlinkList Add 'Basic Analytics - Server Logs' to Spurl Add 'Basic Analytics - Server Logs' to Furl Add 'Basic Analytics - Server Logs' to Ma.gnolia Add 'Basic Analytics - Server Logs' to Yahoo My Web

SEO Software


SEO Tools


SEO Articles


Text Link Ads



Compete Search Analytics

KeywordDiscovery.com Keyword Research Tool

Get the SEO Book

SEOmoz.org - Learn From SEO Experts. Become an Expert.