We run our website on Apache servers and they generate a huge amount of log files every day. Looking through these files is a full time job, so we decided to parse them automatically to retrieve the data about possible 404 and 403 errors.
Monitoring your logs is also important for your website’s SEO. You get a good idea of the number of pages that are unavailable and can fix the problem quickly. Knowing your problem is half of the solution.
The following Python script will parse the log file, find string that we are looking for, group them together for each webpage and send the result to our email. It is much easier to read the resulting email once or twice a week, than look through thousands of lines of logging.
import datetime oneweek = 604800
logfile = "mywebsite.log"
pp = pprint.PrettyPrinter(indent=4)
line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"")
re.search(" 404 ", line)
You may exclude some results as not relevant, for example admin css views:
not(re.search("/admin.min.css", line))
Here is the whole function code:
with open(logfile, "r") as f: line = f.readline() while line: if (re.search(" 404 ", line) and not(re.search("/admin.min.css", line)) ): try: log_line_data = line_parser(line) if count % 1000 == 0: print count, nomatch count += 1 handle_line_data(log_line_data) except: print "Unexpected error:", sys.exc_info()[0] raise else: nomatch += 1 line = f.readline() f.close()
Use PrettyPrint to print the debug information if you encounter some errors:
pp.pprint(log_line_data)
Here is an example from one of our runs:
Grouping of similar rows happens in the function handle_line_data
by gathering together line with the same “request_url” and “request_header_user_agent”. It means that Google bot hitting your page and getting 404 error is grouped differently from users with Firefox or any other browser. Google bot getting 404 several times in a period of one week is a serious matter that needs to be addressed at once. One or to users with Firefox browsers getting 404 a couple of times is probably nothing.
We store all the data for each line in an array called lines
. We also use array hits
to count the number of occurrences.
lines = {} hits = {} def handle_line_data(line_data): global hits global lines dateobj = line_data["time_received_datetimeobj"] datediff = now - dateobj diffsec = datediff.total_seconds() url = line_data["request_url"] header = line_data["request_header_user_agent"] ind = url+" - "+header if diffsec < oneweek: hits[ind] = hits.get(ind, 0) + 1 if diffsec > oneweek: return lines[ind] = ind
import operator sortedhits = sorted(hits.iteritems(), key=operator.itemgetter(1), reverse=True)
The result of the code is the list sortedhits
which holds all occurrences of 404 errors grouped together by request url and user agent. You can either print it out into your console or send by email.
You may also need to format the result to make it more readable:
result = "<pre>" + pp.pformat(sortedhits) + "</pre>\n\n"
Here is the whole Python Apache log parser code:
import re import apache_log_parser import pprint import sys import datetime pp = pprint.PrettyPrinter(indent=4) now = datetime.datetime.now() oneweek = 604800 logfile = "mywebsite.log" line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"") lines = {} hits = {} def handle_line_data(line_data): global hits global lines dateobj = line_data["time_received_datetimeobj"] datediff = now - dateobj diffsec = datediff.total_seconds() url = line_data["request_url"] header = line_data["request_header_user_agent"] ind = url+" - "+header if diffsec < oneweek: hits[ind] = hits.get(ind, 0) + 1 if diffsec > oneweek: return lines[ind] = ind count = 0 nomatch = 0 with open(logfile, "r") as f: line = f.readline() while line: if (re.search(" 404 ", line) and not(re.search("/admin.min.css", line)) ): try: log_line_data = line_parser(line) if count % 1000 == 0: print count, nomatch count += 1 handle_line_data(log_line_data) except: print "Unexpected error:", sys.exc_info()[0] raise else: nomatch += 1 line = f.readline() f.close() import operator sortedhits = sorted(hits.iteritems(), key=operator.itemgetter(1), reverse=True) pp.pprint(sortedhits)
Clean code