Parse Apache log files with Python

By , last updated July 22, 2019

We run our website on Apache servers and they generate a huge amount of log files every day. Looking through these files is a full time job, so we decided to parse them automatically to retrieve the data about possible 404 and 403 errors.

Monitoring your logs is also important for your website’s SEO. You get a good idea of the number of pages that are unavailable and can fix the problem quickly. Knowing your problem is half of the solution.

The following Python script will parse the log file, find string that we are looking for, group them together for each webpage and send the result to our email. It is much easier to read the resulting email once or twice a week, than look through thousands of lines of logging.

  1. First we define how often we would like the script to run. In this example, it is one week. Import datetime to use this functionality.
    import datetime
    oneweek = 604800
    
  2. Specify the log file to parse. In the example, it is placed in the same folder:
    logfile = "mywebsite.log"
    
  3. Import pprint. This will help format the statistics so it does not look messy and will be readable. Define pretty print variable with indent 4.
    pp = pprint.PrettyPrinter(indent=4)
    
  4. Create a line parser for Apache log file. We used apache-log-parser library:
    line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"")
    
  5. Open log file and parse line after line. Search for 404 or 403 codes with spaces:
    re.search(" 404 ", line) 
    

    You may exclude some results as not relevant, for example admin css views:

     not(re.search("/admin.min.css", line))
    

    Here is the whole function code:

    with open(logfile, "r") as f:
        line = f.readline()
    
        while line:
    	if (re.search(" 404 ", line) and not(re.search("/admin.min.css", line)) ):
    		try:
    		 log_line_data = line_parser(line)
    		 
    		 if count % 1000 == 0:
    		  print count, nomatch
    		 count += 1
    		 handle_line_data(log_line_data)
    		except:
    			print "Unexpected error:", sys.exc_info()[0]
    			raise
    	else:
    		nomatch += 1
    	line = f.readline()
    
    f.close()
    

    Use PrettyPrint to print the debug information if you encounter some errors:

    pp.pprint(log_line_data)
    
  6. Next step is to group similar occurrences together as we don’t need to go through thousands of similar 404 or 403 errors. By grouping the errors together, we also get an idea of which errors are most important to fix.

    Here is an example from one of our runs:

    studiofreya-404-errors-statistics-email

    Grouping of similar rows happens in the function handle_line_data by gathering together line with the same “request_url” and “request_header_user_agent”. It means that Google bot hitting your page and getting 404 error is grouped differently from users with Firefox or any other browser. Google bot getting 404 several times in a period of one week is a serious matter that needs to be addressed at once. One or to users with Firefox browsers getting 404 a couple of times is probably nothing.

    We store all the data for each line in an array called lines. We also use array hits to count the number of occurrences.

    lines = {}
    hits = {}
    
    def handle_line_data(line_data):
        global hits
        global lines
        dateobj = line_data["time_received_datetimeobj"]
        datediff = now - dateobj
        diffsec = datediff.total_seconds()
    
        url = line_data["request_url"]
        header = line_data["request_header_user_agent"]
        ind = url+" - "+header
        if diffsec < oneweek:
            hits[ind] = hits.get(ind, 0) + 1
    			
    	if diffsec > oneweek:
    		return
    		
        lines[ind] = ind
    
  7. Sort the resulting data descending. It will ensure that when you open your mail you get to see the problems with highest score or occurrence in the log file.
    Here is how to sort the list with Python:

    import operator
    sortedhits = sorted(hits.iteritems(), key=operator.itemgetter(1), reverse=True)
    

The result of the code is the list sortedhits which holds all occurrences of 404 errors grouped together by request url and user agent. You can either print it out into your console or send by email.

You may also need to format the result to make it more readable:

result = "<pre>" + pp.pformat(sortedhits) + "</pre>\n\n"

Here is the whole Python Apache log parser code:


import re
import apache_log_parser
import pprint
import sys
import datetime

pp = pprint.PrettyPrinter(indent=4)
now = datetime.datetime.now()
oneweek = 604800

logfile = "mywebsite.log"

line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"")

lines = {}
hits = {}

def handle_line_data(line_data):
    global hits
    global lines
    dateobj = line_data["time_received_datetimeobj"]
    datediff = now - dateobj
    diffsec = datediff.total_seconds()

    url = line_data["request_url"]
    header = line_data["request_header_user_agent"]
    ind = url+" - "+header
    if diffsec < oneweek:
        hits[ind] = hits.get(ind, 0) + 1
			
	if diffsec > oneweek:
		return
		
    lines[ind] = ind

count = 0
nomatch = 0

with open(logfile, "r") as f:
    line = f.readline()

    while line:
	if (re.search(" 404 ", line) and not(re.search("/admin.min.css", line)) ):
		try:
		 log_line_data = line_parser(line)
		 if count % 1000 == 0:
		  print count, nomatch
		 count += 1
		 handle_line_data(log_line_data)
		except:
			print "Unexpected error:", sys.exc_info()[0]
			raise
	else:
		nomatch += 1
	line = f.readline()

f.close()

import operator
sortedhits = sorted(hits.iteritems(), key=operator.itemgetter(1), reverse=True)

pp.pprint(sortedhits)