Web Site Statistics
Jaqui Lynch
Boston College
This session will provide information that will allow the webmaster to analyze traffic on their web site. It will cover the basics including an analysis of the logs as well as information about some of the noncommercial utilities available to help the webmaster.
In the past few years the Web Server has become more important in most sites and, thus it has become critical to monitor performance on that server. However, in the internet world, monitoring performance not only includes traditional performance metrics such as CPU utilization and memory. In the new world, it becomes critical to be able to analyze traffic and track hits against the servers. Webmasters and their employers are interested in where people go online, how long people stay on the site or page and what software they are using. Advertisers at the site may also be interested in demographic data on people accessing the site. Another important reason for reviewing the logs kept is tracking hackers into your site. The purpose of this paper is to discuss the steps in tracking web statistics and to provide information about some of the public domain packages available.
Lets start with some terminology:
Visit: All requests made by a specific user to the site during a set period of time. The visit is deemed to have ended if a set period of time (say 30 minutes) goes by with no further accesses. Users are identified by cookies, username or hostnames/ip addresses.
Session: All of the activities of a specific user during a single visit. This is difficult to track as it can be hard to know when a session has ended. This is because the client does not send an end session command.
Hit: A hit is normally viewed as the number of times a page or file is accessed. However, a single request to a server can generate several transactions, namely to the page, its images and any redirections that are necessary. Many pages also have counters that supposedly define hits to the pages. It should also be noted that, because of caching at commercial sites, it is possible to receive many more hits than are reported. If a site has a page in its cache, then all they do is check whether the page has changed at the original server. If the page has not changed then it is not downloaded, instead the version in the cache is used. In particular, sites such as Compuserve, AOL, etc. use large cached servers so hits reported from these sites tend to be lower than actual page reads.
It should also be noted that when a page is requested such as
ttp://www.cmg.org/jaqui the server will generate a redirect to http://www.cmg.org/jaqui/. This may end up being counted as two hits, instead of one.Cookies and tokens: Cookies are magic headers that a server (or CGI program) sends to a client browser providing the system with an identifier using the set-cookie command. Whenever this system accesses the site again the browser sends a similar magic header back to the server. The file that is saved on the local system is usually called cookies.txt. They can be used to provide continuity or state information about this clients accesses. There is a great deal of resistance among users on the web to the use of cookies, because of the intrusive nature of cookies users object strenuously to a server writing to their hard disk. For some people, the mere presence of cookies at your site will discourage them from accessing it ever again.
Counters: Counters are heavily used on web sites it is very common to go to a page and see something like "You are visitor number 12345 to this page". These numbers cannot be trusted as the page designer has the ability to seed the base number or to alter the counter such that it adds more than 1 each time.
In order to better understand the statistics, it is necessary to understand the log formats provided by most servers. There are four basic logs the transfer log, error log, referrer log and the agent log. It is also possible to create and process other logs to obtain additional statistics. We will look at the details of each and then at analyzing each log.
i. Transfer Log
This log contains information about each request including the date and time, where the request came from and what the request was as well as other information.
Most servers store their transfer logs in a format called Common Log Format. In this format the log consists of a host field, an identification field (RFC 931), an authentication subsystem field, a time stamp, the http request itself, a status code field and the transfer volume (bytes transferred to the client because of the request). Most web server analysis tools rely on their files being provided in this format.
a) Host Field
The host field contains the name or ip address of the remote system the request is coming from. In order to provide the system name the server has to do a DNS lookup based on the ip address that is provided in the incoming packet. This causes an additional load on the network and is complicated by the fact that many of the ip addresses will fail the DNS lookup anyway. This is often because ISPs do not assign names to their ip addresses. If the server is exceptionally busy it can be helpful to turn off domain name logging to reduce this load. This is done in the httpd.conf file. On the Cern server DNSLookup is set to off and for the Apache and NCSA servers it is necessary to set DNSMode to one of the following:
Many sites use DNSMODE=Maximum for security reasons. Be aware that setting DNS logging off means that only ip addresses will be reported so it will be impossible to know by domain where accesses are coming from.
b) Identification Field (RFC 931)
In general this field contains only a hyphen (-) as it is rarely used. The RFC defines a protocol that causes a request to go back to the client system asking who the client really is. It requires the ident software to be running on the client. Ident servers dont run on Windows 95 and Macintoshes today so this field is almost never used.
c) The Authentication Field
This field will contain either a hyphen (-) or the name of the user if authentication is used for this page. The authentication method used here is the default uuencoded method, not SSL. If a directory is protected with the default uuencoded security then the username that was used will show here.
d) The Time Stamp
The time stamp consists of 3 fields date, time and offset from GMT. The format of the date is: DD/MMM/YYYY where:
DD is the 2 digit day of the month
MMM is the first 3 letters of the month
YYYY is the 4 digit year
The format for the time component is HH:MM:SS where:
HH is the hour (00-23)
MM is the minutes (00-59)
SS is the seconds (00-59)
This is followed by the offset in the format: -HHMM (i.e. 0700). So a typical example would be:
[01/Jan/1997:22:18:12 0500]
e) The http request field
There are three types of http request from the client Get, Post and Read. The request starts off with the method and is followed by the name and version number of the protocol (typically HTTP/1.0).
The Get request the standard for requesting a document or a program execution. The browser will first check its cache for the page. If the page is found in the cache then an If-modified conditional Get is sent to the server. If the page has not been modified since the last Get to cache then the browser uses its current copy.
The Post request tells the server that data is following. This is normally used by programs or URLs that make use of data such as forms.
The Head request is not normally generated by browsers as all it does is return the head section of the page. It is useful for testing out the validity of hypertext links.
f) The status code field
This code is written to the log by the server to record the success or failure of the transaction. Status codes come in four classes:
Within each class there are several codes. Typical codes seen are:
304 Not modified (i.e. image was cached)
This field is only applicable if the status was 200 (OK) as no other request receives data. The number shown is either a hyphen (-), 0 or an ASCII representation of the number of bytes transferred to the client as a result of the request. A hyphen (sometimes a 0) would be shown for all codes apart from 200. A value of 0 will also show up when no data was transferred because the image or page was already cached (i.e. code 304 not modified).
ii. Error Log
This log contains information about errors and failures for requests. In some servers it is combined with the access or transfer log. Information recorded includes the date and time and any error codes and messages. Typically if you have problems with any perl scripts or other CGI scripts, this is where you will find out what the problem is. The error log also contains information about the startup of the server i.e.
[Mon Feb 28 22:18:15 1997] HTTPd: Starting as ../httpd d /usr/local/httpd
The time stamp is in a different format to the transfer log for reasons unknown. The format is:
[Dow Mon D HH:MM:SS YYYY] where:
Dow is the day of the week (i.e. Mon)
Mon is the month (i.e. Feb)
D is the day of the month (i.e. 28)
Error lines are generated in the error log for any of the failure codes mentioned earlier. However, errors are reporting using the description rather than the error codes. There are also additional error log entries for server startup, shutdown, receives, access failures, lost connections and timeouts.
iii. Referrer Log
This log includes the URL from which the reader linked to the page. It also includes URL for this page. The primary value of this log is that it shows the webmaster where accesses are coming from and thus it can be useful for marketing and advertisement placement.
iv. Agent Log
The agent log is used to store information about the browsers that are being used to access the server. Every browser identifies itself in some way and this information can be used to determine what browser pages should be designed for.
v. Other Logs
These keep track of information about the various protocols and can also be analyzed to provide valuable information.
In general, the only two logs that are of major importance in analyzing a web site are the error log and the transfer log. From these two logs it is possible to obtain the following statistics:
There are many other statistics available, but those listed above are typical of the type of statistics that will be needed.
Given the size of the logs involved it is necessary to utilize some sort of reporting software in order to get meaningful reports. There are three methods of accomplishing this use freeware/shareware, buy a commercial package or use a Traffic Tracking service. Things that need to be considered when looking at these include the platform you want to run it on, whether you want to do your analysis on the server or elsewhere, ease of installation, ease of maintenance and use, customization options and cost.
In the freeware/shareware arena there are many options of which only a few are discussed here. There are many commercial products available but these are not discussed here. An analysis of the commercial products is available in the "Web Site Stats" book by Rick Stout (published by Osborne). Depending on what your sites requirements are it may be necessary to move to a commercial product, althoug the recommendation would be to look at the freeware/shareware options first. Traffic Tracking services are companies to whom the servers logs get sent for analysis and reporting.
Freeware/Shareware Software:
Freeware refers to software that you can use and copy for no charge. Shareware usually involves a small to moderate sum of money. The software below falls into one of these two categories.
This is a Perl program that goes through the log file and provides an html page of statistics. Wwwstat offers only cursory log analysis and does not cover visit-based reporting or top 10 style reporting. It also provides no graphs and is very detailed. Examples of the output can be seen at
http://www.cmg.org/stats/wwwstats.html.The report has 6 sections
This reports consists of total files transmitted during the summary period, total bytes transmitted during the summary period, average files transmitted daily and average bytes transmitted daily. This is useful to get an idea of average load.
This lists by day the following information: % this represents of all requests for the summary period, % this represents of all bytes sent this summary period, Number of bytes sent during this day, number of requests dealt with for this day followed by the date for the day concerned.
This reports the same data as 2) above but it is based on hourly for the whole summary period rather than daily.
Shows the number of requests and bytes for each domain (i.e. .edu).
Reports the same data as 4) above, but by reversed domain names.
This sections reports on every file including images and cgi-bin programs that is accessed at the web site in terms of numbers of requests and bytes sent during the summary period.
wwwstat should always be called from a cron job that does some file copying and renaming so that monthly or weekly accumulation files can be kept. A sample script is provided in Figure 4.
wusage is shareware rather than freeware and a license costs less than $100. It consists of 2 programs a configuration program (makeconf) and the wusage program itself. wusage produces reports on:
1) Accesses or bytes by the hour of the day
2) Top 10 documents by access count or bytes
3) Top 10 referring sites by access count or bytes
4) Number of hits per result code
Reports include both a numerical and a graphical view of the data.
vbstats is a 16bit windows program that requires Visual Basic and Windows. It consists of 5 programs:
Vbstats provides reports as follows:
This program generates a 3D VRML model of the server load but requires a VRML viewer.
Getstats is a C program that produces detailed and concise reports as follows: summaries of monthly, weekly, daily and hourly data; domain, request, directory tree and errors. It processes logs from httpd and gopher servers and also allows for incremental runs. There are several frontends available for getstats including CreateStats and Getgraph. CreateStats takes weekly log files and formats them to html files. Getgraph produces graphical versions of the reports. There is also a Csh script called getstats_plot that produces access plots.
FTPWeblog is a freeware integrated www and ftp log analyzer. The report produced is broken down into the following areas:
Summary Period covered, Total files and bytes transferred and the number of unique domains.
Graphical Report Graphical view of the data
Daily Number of Hits Access, Bytes and Date
Hourly Statistics Accesses and bytes per hour
Top Level Domains Accesses and bytes by domain
Top 40 Archive Sections Top 40 sections by number of Hits and then by volume transferred (bytes).
Top 40 Files Top 40 files accessed by number of hits and then by volume (bytes) transferred.
Top 40 Domains Top 40 Domains accessing the site by number of hits and by volume (bytes) transferred.
Complete Archive Section Statistics Statistics for the whole archive section.
Complete File Statistics Statistics for all file accesses.
Complete Domains Statistics Statistics for all accesses by domain.
Administrative Issues
The logs that are produced by all of these tools can become very large over time, depending on the number of accesses. Thus, it becomes necessary to cycle the logs such that a history can be built, while still making the logs usable. An example of how to do this is in figure 4 (wwwstat.cron) which creates a monthly cycle for the logs. This job takes the access_log and the error_log and passes their data to the wwwstat program. On the first of the month it runs wwwstat, stops the web server, creates new files called type_log.monyy, where type is access or error, mon is the month (i.e. Jun) and yy is the year (i.e.97). It then clears out the web logs and restarts the server.
This script basically automates the archival of the web logs and the wwwstat script it calls creates the analysis data in the form of a web page called wwwstat.html. It creates a history of those pages by saving last months web page as wwwstat.Monyy.html and creating a pointer back to it in the wwwstat.html page.
Since the web logs can become very large it may also be necessary to add commands to the script to compress or zip the old log files in order to reduce the disk space they take up. The script needs to be scheduled to run using the UNIX cron command or its equivalent on other platforms. Cron uses a format that allows you to specify the minute of the day, hour of the day, day of the week, day of the month and month of the year that the command is to be executed.
Because it is a standard script it can also be useful to add automated scanning features into the cron job. This is useful in identifying hackers and/or trends that lead up to hacker attacks. Useful things to scan for include phf, rshd, xterm, passwd and site names that are known to have hacked into others. The script can be modified to create an output file containing the results of the scans and a second script will take that file and automatically email it to a list of pre-chosen people.
Other administrative considerations include the performance of the web analyzers, disk space and manageability of the log files as well as how much of a history you want to keep of the data. There are several papers available out on the web about Performance of Web Analyzers. It is important to note that, unless your logs are huge, performance of the analyzers will not be a problem. However, in the case of a site where the logs need to be cycled weekly because of their size, it is important to review performance of the analyzers. It is not atypical to see a monthly log with a size of 500000 lines (about 48mb).
Summary
There a huge number of freeware and shareware statistical analyzers for the logs associated with web servers and other network applications. It is important to determine in advance exactly what information is important to your site before looking at tools and testing them out.
Figure 1 and 2 contain a partial list of log analysis tools and figure 3 contains a list of URLs for additional software that many of these tools require. Figure 4 contains a copy of the wwwstat.cron program that does log cycling for the wwwstat program. Figure 5 contains a list of tools by platform.
Recommended Reading:
http://www.uu.se/Software/Getstats/Performance.html
http://www.piperinfo.com/pl01/usage.html
http://gopher.nara.gov:70/0h/what/stats/webanal.html
ISBN 0-07-882236-X
4. Yahoo search on "log analysis software"
Figure 1 -
Freeware/Shareware Log Analysis ToolsThe author of this paper does not necessarily endorse any of the software listed below and takes no responsibility for the code. Readers should check the web for the latest information and updates to the notes below. There are also many other tools well worth reviewing this is just a partial list.
Product |
Description |
3dstats |
3D access statistics generator using VRML viewer. Analyzes web server logfiles and creates VRML model with the average load by hour. |
http-analyze |
Freeware web server access log analyzer |
accesspanda |
Web log graphing Perl 5 CGI script. |
accesswatch |
Perl based access accounting tool for web servers. Creates text data and bar graphs. |
analog |
Similar to getstats but is highly configurable and can produce output in multiple languages. |
browsercounter |
Perl script to analyze agent log by browser, version and platform. |
createstats |
Perl frontend to getstats. Analyzes weekly log files. |
getstats |
C source program. |
wwwstat |
Processes access and error logs producing a log summary in html format |
gwstat |
Processes wwwstat output to generate gif graphs. |
getgraph.pl |
Perl script to generate graphs from getstats output. |
ftpweblog |
Freeware Perl 5 www and ftp log reporting tool. |
fwgstat |
Perl script that parses ftp, gopher, wais, http logs and creates a usage report from them. |
httpd-log |
Python script. |
iisstat |
Perl 4 script to generate statistics for FTP, Gopher and httpd. |
getstats_plot |
Csh script to generate access plots. Does access by month, week, day, top 25 |
IIstats |
Free program that reads MS Internet Information server logfiles |
country-codes |
Translates the suffixes (i.e. .nz) to country names |
mkstats |
Shareware Perl scripts that provide user and visitor info, summary pages, time trends, etc |
metasummary |
Perl 5 script that analyzes wwwstat output into summaries |
mw3s |
Perl script that analyzes logs from multiple web servers and produces a list of the top 20 pages |
pressview |
Windows NT Logfile Database and reporting for HTTP servers |
refstats |
Perl script to analyze the referer log |
statbot |
Creates instant snapshot graphs |
vbstats |
Windows 3.1 and 3.11 tool to analyze common log format logs. |
webstat |
Python tool that produces services & domain usage, country, by page, date and summary reports. |
wusage |
Shareware producing reports on overall hits and page accesses |
|
|
|
|
Figure 2 - URLs for Freeware/Shareware Log Analysis Tools
The author of this paper makes no claims about and takes no responsibility for the content of any of the URLs listed below.
Product |
URL |
3dstats |
|
http-analyze |
|
accesspanda |
|
accesswatch |
|
analog |
|
browsercounter |
http://www.netimages.com/~snowhare/utilities/browsercounter.html |
createstats |
|
getstats |
|
wwwstat |
|
gwstat |
|
getgraph.pl |
|
ftpweblog |
|
fwgstat |
|
httpd-log |
|
iisstat |
http://ftp.support.lotus.com/pub/utils/InternetServices/iisstat/iisstat.html |
getstats_plot |
http://infopad.eecs.berkeley.edu/~burd/software/getstats_plot/source |
iistats |
|
country-codes |
http://www.ics.uci.edu/pub/websoft/wwwstat/country-codes.txt |
mkstats |
|
metasummary |
|
mw3s |
|
pressview |
|
refstats |
|
statbot |
|
vbstats |
|
webstat |
http://www.webstat.com http://www.huygeno.org/~sijben/statistics/advertisement.html |
wusage |
|
|
|
|
|
Figure 3 -
Additional Software URLs
The author of this paper makes no claims about and takes no responsibility for the content of any of the URLs listed below.
Product |
URL |
Description |
perl |
http://prep.ai.mit.edu/pub/gnu (v4 and v5) |
Interpreted language used by many of the tools. |
gnuplot |
Plotting tool |
|
gcc |
Freeware C compiler |
|
gd.pm |
|
|
Giftrans |
|
|
gzip |
Tool needed to zip and unzip the files for these tools |
|
xmgr |
|
|
imagemagick |
|
|
ghostscript |
|
|
gd graphics |
|
|
pbm |
|
|
webspace |
|
|
python |
|
|
povray |
|
|
gifconv |
|
|
|
|
|
Figure 4 - Sample UNIX script to call wwwstat
#!/bin/sh -fh
#
# wwwstat.cron
# Run this script just after midnight on every day of the month.
# Example crontab entries:
# --------------------------------------------------
# 1 0 * * * /var/www/cgi-bin/wwwstat.cron
# --------------------------------------------------
#
{
program="/var/www/cgi-bin/wwwstat"
httpd="/usr/contrib/bin/httpd"
statdir="/usr/local/htdocs/stats"
statfile="wwwstats.html"
tmpfile="/tmp/wwwstats.$$"
accessfile="/var/log/httpd/access_log"
errorfile="/var/log/httpd/error_log"
pidfile="/var/run/httpd.pid"
umask 077
day="`/bin/date +'%d'`"
month="`/bin/date +'%m'`"
year="`/bin/date +'%y'`"
set -- Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
shift $month
lmonth="$1"
if [ "$day" -eq 1 ]; then
#
# First kill HTTP daemon to avoid interference
#
httpdpid=`/bin/cat "$pidfile"`
if [ -z "$httpdpid" ]; then
kill "$httpdpid"
fi
#
# Copy Access and Error logfiles
#
cp -p "$accessfile" "$accessfile.$lmonth$year"
cp -p "$errorfile" "$errorfile.$lmonth$year"
chown www.www "$accessfile.$lmonth$year"
chown www.www "$errorfile.$lmonth$year"
#
# Empty Access and Error logfiles
#
echo >"$accessfile"
echo >"$errorfile"
#
# Restart HTTP daemon
#
(cd / ; "$httpd")
#
# Run stats program
#
$program -d "$lmonth" "$accessfile.$lmonth$year" >"$tmpfile" &&
mv "$tmpfile" "$statdir/$lmonth$year.$statfile" &&
chown www.www "$statdir/$lmonth$year.$statfile"
#
# Copy this as current stats file
cp -p "$statdir/$statfile" "$statdir/$lmonth$year.$statfile"
else
#
# Run stats program
$program >"$tmpfile" &&
cp "$tmpfile" "$statdir/$statfile" &&
chown www.www "$statdir/$statfile"
fi
# Add scanning command here
}
exit
Figure 5 - Freeware/Shareware Log Analysis Tools by Platform
The author of this paper does not necessarily endorse any of the software listed below and takes no responsibility for the code. Readers should check the web for the latest information and updates to the notes below. There are also many other tools well worth reviewing this is just a partial list.
FTPWeblog analyzes www and ftp logs
http-analyze
BrowserCounter analyzes agent log
RefStats analyzes referer log
CookieStats Analyzes cookie log
AccountStats Extracts AuthUser activity summaries
Hitcount for NT Freeware webpage hit counter
DNS Workshop Shareware used to do ip to name conversions when
Processing log files
Site*Sleuth Shareware Analyzes where traffic is coming from, who
visits your site and how well your site works.
Hitlist Std Freeware 11 reports, 162 tables, graphs and
Calculations. Autogenerate ascii or html reports.
Netstats Pro Shareware Tracks and analyzes usage of your site.
Webcount Shareware Hit counter with multiple styles.
OnTheRoad Freeware Converts Unix logs to Windows based logs
The Counter Shareware Hit counter for Win 95 or NT servers
I am not certain which of those below are freeware or shareware.
http://arpp.carleton.ca/mac/tool/log.html
Analog Logfile analysis
Bolero Website acticity tracking
FTPd2MacHTTP Converts gopher, http and anonymous FTP log entries to
MacHTTP format
Getstats Server log analysis
LogDoor Multi-site realtime log processor
LogRoller Timed rollover of log files
MacHTTP Logger Generates stats for reading in Hypercard
Serverstat Analyses MacHTTP, Webstar & gophersurfer logs
Webstat Analyses MacHTTP logs
Wusage Produces pie and line graphs
Wwwstat Server log reports