...making Linux just a little more fun! |
By Rob Tougher |
I use the Apache HTTP Server to run my web site. When a visitor requests a page from the site, Apache records the following information in a file named "access_log":
Until recently I used a combination of command line utilities (grep, tail, sort, wc, less, awk) to extract this information from the access log. But some complex calculations were difficult and time-consuming to perform using these tools. I needed a more powerful solution - a programming language to crunch the data.
Enter Python. Python is fast becoming my favorite language, and was the perfect tool for solving this problem. I created a framework in Python for performing generic text file analysis, and then utilized this framework to glean information from my Apache access log.
This article first explains the framework, and then describes two examples that use it. My hope is that by the end of this article you will be able to use this framework for analyzing your own text files.
When trying to solve this problem I initially turned to Gawk, an implementation of the Awk language. Awk is primarily used to search text files for certain pieces of data. The following is a basic Awk script:
Listing 1: count_lines.awk
#!/usr/bin/awk -f BEGIN { count = 0 } { count++ } END { print count }
This script prints the number of lines in a file. You can run it by typing the following at a command prompt:
prompt$ ./count_lines.awk access_log
Awk reads in the script, and does the following:
I liked this processing model. It made sense to me - first run some initialization code, next process the file line by line, and finally run some cleanup code. It seemed perfectly suited to the task of analyzing text files.
Awk gave me trouble, though. It was very difficult to create complex data structures - I was jumping through hoops for tasks that should have been much more straightforward. So after some time I started looking for an alternative.
My situation was this: I liked the Awk processing model, but I didn't like the language itself. And I liked Python, but it didn't have Awk's processing model. So I decided to combine the two, and came up with the current framework.
The framework resides in
awk.py.
This module contains one class, controller
, which
implements the following methods:
__init__(file)
- the constructor, which takes a
file object to process.
subscribe(handler)
- subscribes a handler to the controller.
run()
- processes the file.
print_results()
- prints the results of the process.
A handler is a class that implements a defined set of methods. Multiple handlers can be subscribed to the controller at any given time. Every handler must implement the following methods:
begin()
- gets called once before the file is processed.
process_line(line)
- gets called for each line of the file.
end()
- gets called after the file is processed.
description()
- gets called from
controller.print_results()
. It should
return a description of the handler.
result()
- also called from
controller.print_results()
.
It should return the results of the class' calculations.
You create handlers, subscribe them to the controller, and then run the controller. The following is a simple example with one handler:
Listing 2: count_lines.py
# Standard sys module import sys # Custom awk.py module import awk class count_lines: def begin(self): self.m_count = 0 def process_line(self, s): self.m_count += 1 def end(self): pass def description(self): return "# of lines in the file" def result(self): return self.m_count # # Step 1: Create the Awk controller # ac = awk.controller(sys.stdin) # # Step 2: Subscribe the handler # ac.subscribe(count_lines()) # # Step 3: Run # ac.run() # # Step 4: Print the results # ac.print_results()
You can run this script using the following command:
prompt$ cat access_log | python count_lines.py
The results of the script should be printed to the console.
Now that the framework was in place, I had to figure out how I was going to use it. I came up with many ideas, but the following two were the top priorities.
The first question that I wanted to answer using my new framework was the following:
My thinking was this: if people return often, they must enjoy the site, right? The following script answers the above question:
Listing 3: return_visitors (can be found in handlers.py)
class return_visitors: def __init__(self, n): self.m_n = n self.m_ip_days = {} def begin(self): pass def process_line(self, s): try: array = s.split() ip = array[0] day = array[3][1:7] if self.m_ip_days.has_key(ip): if day not in self.m_ip_days[ip]: self.m_ip_days[ip].append(day) else: self.m_ip_days[ip] = [] self.m_ip_days[ip].append(day) except IndexError: pass def end(self): ips = self.m_ip_days.keys() count = 0 for ip in ips: if len(self.m_ip_days[ip]) > self.m_n: count += 1 self.m_count = count def description(self): return "# of IP addresses that visited more than %s days" % self.m_n def result(self): return self.m_count
The script stores the number of days that each IP address has visited the site. When the file is finished processing, it returns how many IP addresses have visited more than N times.
Another thing I wanted to know was how people found out about the site. I was getting a decent amount of traffic, and I wasn't sure why. I kept asking myself:
I guess you shouldn't argue with a site that's popular. But I was curious to know how people were learning about my site. So I wrote the following script:
Listing 4: referring_domains (can be found in handlers.py)
class referring_domains: def __init__(self): self.m_domains = {} def begin(self): pass def process_line(self, line): try: array = line.split() referrer = array[10] m = re.search('//[a-zA-Z0-9\-\.]*\.[a-zA-z]{2,3}/', referrer) length = len(m.group(0)) domain = m.group(0)[2:length-1] if self.m_domains.has_key(domain): self.m_domains[domain] += 1 else: self.m_domains[domain] = 1 except AttributeError: pass except IndexError: pass def end(self): pass def description(self): return "Referring domains" def sort(self, key1, key2): if self.m_domains[key1] > self.m_domains[key2]: return -1 elif self.m_domains[key1] == self.m_domains[key2]: return 0 else: return 1 def result(self): s = "" keys = self.m_domains.keys() keys.sort(self.sort) for domain in keys: s += domain s += " " s += str(self.m_domains[domain]) s += "\n" s += "\n\n" return s
This script stores the referral information for each request, and generates a list of referring domains, sorted by frequency.
I ran the script and found that most of the referrals came from my own site. This makes sense - when a visitor moves from one page to another on the site, the referring domain for the page is my web site's domain. But I did find some interesting entries in the referral list, and my question about site traffic was answered.
The following files contain the code from this article:
controller
class
count_lines
handler
return_visitors
and referring_domains
handlers
In this article I described how I use Python to process my Apache HTTP Server access log. Hopefully I explained my techniques clearly enough so that you can use them for your text files.