Apache Log Analysis Using Python LG #83

...making Linux just a little more fun!

Apache Log Analysis Using Python
By Rob Tougher

1. Introduction
2. The Framework: 2.1 First pass - Awk attempt; 2.2 Next pass - Python to the rescue
3. Example Handlers: 3.1 Return visitors; 3.2 Referring domains
4. Files
5. Conclusion

1. Introduction

I use the Apache HTTP Server to run my web site. When a visitor requests a page from the site, Apache records the following information in a file named "access_log":

The IP address of the computer requesting the page
The name of the page being requested
The date and time of the request
The page that referred the visitor to the requested page

Until recently I used a combination of command line utilities (grep, tail, sort, wc, less, awk) to extract this information from the access log. But some complex calculations were difficult and time-consuming to perform using these tools. I needed a more powerful solution - a programming language to crunch the data.

Enter Python. Python is fast becoming my favorite language, and was the perfect tool for solving this problem. I created a framework in Python for performing generic text file analysis, and then utilized this framework to glean information from my Apache access log.

This article first explains the framework, and then describes two examples that use it. My hope is that by the end of this article you will be able to use this framework for analyzing your own text files.

2. The Framework

2.1 First pass - Awk attempt

When trying to solve this problem I initially turned to Gawk, an implementation of the Awk language. Awk is primarily used to search text files for certain pieces of data. The following is a basic Awk script:

Listing 1: count_lines.awk

#!/usr/bin/awk -f

BEGIN {
	count = 0
}

{ count++ }

END {
	print count
}

This script prints the number of lines in a file. You can run it by typing the following at a command prompt:

prompt$ ./count_lines.awk access_log

Awk reads in the script, and does the following:

Runs the code in the BEGIN block.
Runs the middle block of code for each line in "access_log".
Runs the code in the END block.

I liked this processing model. It made sense to me - first run some initialization code, next process the file line by line, and finally run some cleanup code. It seemed perfectly suited to the task of analyzing text files.

Awk gave me trouble, though. It was very difficult to create complex data structures - I was jumping through hoops for tasks that should have been much more straightforward. So after some time I started looking for an alternative.

2.2 Next pass - Python to the rescue

My situation was this: I liked the Awk processing model, but I didn't like the language itself. And I liked Python, but it didn't have Awk's processing model. So I decided to combine the two, and came up with the current framework.

The framework resides in awk.py. This module contains one class, controller, which implements the following methods:

__init__(file) - the constructor, which takes a file object to process.
subscribe(handler) - subscribes a handler to the controller.
run() - processes the file.
print_results() - prints the results of the process.

A handler is a class that implements a defined set of methods. Multiple handlers can be subscribed to the controller at any given time. Every handler must implement the following methods:

begin() - gets called once before the file is processed.
process_line(line) - gets called for each line of the file.
end() - gets called after the file is processed.
description() - gets called from controller.print_results(). It should return a description of the handler.
result() - also called from controller.print_results(). It should return the results of the class' calculations.

You create handlers, subscribe them to the controller, and then run the controller. The following is a simple example with one handler:

Listing 2: count_lines.py

# Standard sys module
import sys

# Custom awk.py module
import awk

class count_lines:

	def begin(self):
		self.m_count = 0

	def process_line(self, s):
		self.m_count += 1

	def end(self):
		pass

	def description(self):
		return "# of lines in the file"

	def result(self):
		return self.m_count


#
# Step 1: Create the Awk controller
#
ac = awk.controller(sys.stdin)

#
# Step 2: Subscribe the handler
#
ac.subscribe(count_lines())

#
# Step 3: Run
#
ac.run()

#
# Step 4: Print the results
#
ac.print_results()

You can run this script using the following command:

prompt$ cat access_log | python count_lines.py

The results of the script should be printed to the console.

3. Example Handlers

Now that the framework was in place, I had to figure out how I was going to use it. I came up with many ideas, but the following two were the top priorities.

3.1 Return visitors

The first question that I wanted to answer using my new framework was the following:

How many people have returned to the site more than N times?

My thinking was this: if people return often, they must enjoy the site, right? The following script answers the above question:

Listing 3: return_visitors (can be found in handlers.py)

class return_visitors:

	def __init__(self, n):
		self.m_n = n
		self.m_ip_days = {}

	def begin(self):
	    pass

	def process_line(self, s):

		try:
			array = s.split()
			ip = array[0]
			day = array[3][1:7]

			if self.m_ip_days.has_key(ip):

				if day not in self.m_ip_days[ip]:
					self.m_ip_days[ip].append(day)

			else:
				self.m_ip_days[ip] = []
				self.m_ip_days[ip].append(day)

		except IndexError:
			pass



	def end(self):

		ips = self.m_ip_days.keys()
		count = 0

		for ip in ips:

			if len(self.m_ip_days[ip]) > self.m_n:
				count += 1

		self.m_count = count


	def description(self):
		return "# of IP addresses that visited more than %s days" % self.m_n

	def result(self):
		return self.m_count

The script stores the number of days that each IP address has visited the site. When the file is finished processing, it returns how many IP addresses have visited more than N times.

3.2 Referring domains

Another thing I wanted to know was how people found out about the site. I was getting a decent amount of traffic, and I wasn't sure why. I kept asking myself:

Where are all these people coming from?

I guess you shouldn't argue with a site that's popular. But I was curious to know how people were learning about my site. So I wrote the following script:

Listing 4: referring_domains (can be found in handlers.py)

class referring_domains:

	def __init__(self):
		self.m_domains = {}

	def begin(self):
		pass

	def process_line(self, line):

		try:
			array = line.split()
			referrer = array[10]

			m = re.search('//[a-zA-Z0-9\-\.]*\.[a-zA-z]{2,3}/',
				      referrer)

			length = len(m.group(0))
			domain = m.group(0)[2:length-1]

			if self.m_domains.has_key(domain):
				self.m_domains[domain] += 1
			else:
				self.m_domains[domain] = 1

		except AttributeError:
			pass
		except IndexError:
			pass


	def end(self):
		pass


	def description(self):
		return "Referring domains"


	def sort(self, key1, key2):
		if self.m_domains[key1] > self.m_domains[key2]:
			return -1
		elif self.m_domains[key1] == self.m_domains[key2]:
			return 0
		else:
			return 1


	def result(self):

		s = ""
		keys = self.m_domains.keys()
		keys.sort(self.sort)

		for domain in keys:
			s += domain
			s += " "
			s += str(self.m_domains[domain])
			s += "\n"

		s += "\n\n"

		return s

This script stores the referral information for each request, and generates a list of referring domains, sorted by frequency.

I ran the script and found that most of the referrals came from my own site. This makes sense - when a visitor moves from one page to another on the site, the referring domain for the page is my web site's domain. But I did find some interesting entries in the referral list, and my question about site traffic was answered.

4. Files

The following files contain the code from this article:

count_lines.awk - a basic Awk script
awk.py - the controller class
count_lines.py - count_lines handler
handlers.py - return_visitors and referring_domains handlers

5. Conclusion

In this article I described how I use Python to process my Apache HTTP Server access log. Hopefully I explained my techniques clearly enough so that you can use them for your text files.