Logtilla and GeoIP: analyze the geolocation of web clients

This article presents a simple Logtilla log analysis module, log_geoip_stats, which gives the top N client countries, in terms of hits, from web access log files. This module uses the libgeoip-erlang library to get geolocations from clients’ IP addresses.

libgeoip-erlang installation

Install pre-requisite software: Mercurial, and the GeoIP library. On Debian, those are packages mercurial, libgeoip1, and libgeoip-dev. Then, get the libgeoip-erlang sourcecode, and compile it:

hg clone http://bitbucket.org/mattsta/libgeoip-erlang/
cd libgeoip-erlang/
make

Then, make sure that the generated libgeoip-1.0.1 directory is in the code load path, e.g. by passing -pz .../libgeoip-1.0.1 to the erlinterpreter. I personally prefer to install everything into /usr/local, and to use GNU Stow (Debian package stow) to manage packages there:

sudo mkdir -p /usr/local/stow/libgeoip-1.0.1/lib/erlang/lib/
sudo cp -r libgeoip-1.0.1 /usr/local/stow/libgeoip-1.0.1/lib/erlang/lib/
sudo chown -R root:root /usr/local/stow/libgeoip-1.0.1/
sudo stow -d /usr/local/stow/ libgeoip-1.0.1

This has the effect of installing libgeoip-1.0.1 into /usr/local/lib/erlang/lib/ with the possibility to easily uninstall it like any Stow package, with one command: sudo stow -d /usr/local/stow/ -D libgeoip-1.0.1. After installing additional Erlang libraries into /usr/local/lib/erlang/lib/, those can be loaded simply by setting the ERL_LIBS=/usr/local/lib/erlang/lib environment variable, as shown below.

Get MaxMind‘s free GeoLite City database. On Debian, this can be done by running:

sudo sh /usr/share/doc/libgeoip1/examples/geolitecityupdate.sh

This command installs the database into /usr/share/GeoIP/GeoIPCity.dat.

Test that libgeoip works correctly:

$ ERL_LIBS=/usr/local/lib/erlang/lib erl
> application:start(libgeoip_app).
> libgeoip:set_db("/usr/share/GeoIP/GeoIPCity.dat").
> libgeoip:lookup(<<91,121,26,170>>).

This should give you the location of the www.berabera.info server in France.

Using log_geoip_stats to analyze web client locations

Logtilla’s log_geoip_stats module uses the libgeoip library to count the number of parsed log entries per client country. Here is a sample usage to analyze a single Apache access.log file:

$ cd src
$ PATH=../c_src:$PATH ERL_LIBS=/usr/local/lib/erlang/lib erl
> application:start(libgeoip_app).
> libgeoip:set_db("/usr/share/GeoIP/GeoIPCity.dat").
> {ok, Pid} = gen_log_analyzer:start_link(log_geoip_stats, [], []).
> ok = gen_log_analyzer:parse(Pid, "/var/log/apache2/access.log").
> log_geoip_stats:get_stats(Pid, 10).

The get_stats/2 function orders the countries by number of hits, converts the numbers of hits into percentages, and returns the top N countries (here, N=10). For one of my access.log files, this prints out:

[{'US',40.94006639874192},
 {'JP',30.572543537771566},
 {'GB',4.403284990389656},
 {'FR',3.9664511619779836},
 {'BY',3.8266643368862483},
 {'TR',3.6286330013396237},
 {'CH',2.958821131108393},
 {'TH',0.9260877162327451},
 {'PR',0.9202632651872561},
 {'CA',0.9086143630962782}]

The vast majority of my visitors in that period came from the USA (40%) and Japan (30%).

Implementation overview

In module log_geoip_stats, most of the code is boilerplate to implement the gen_log_analyzer behaviour. The most interesting pieces are functions handle_log_entry/2 and get_stats/2:

% Analyze a parsed log entry:
handle_log_entry(LogEntry, State) ->
    case LogEntry#'LogEntry'.'remote-host' of
        {'ip-address', IPAddress} ->
            case libgeoip:lookup(list_to_binary(IPAddress)) of
                {geoip, Country, _, _, _, _, _, _, _} ->
                    % Address found:
                    State1 = update_country(list_to_atom(Country), State),
                    {ok, State1};
                [] ->
                    % Address Unknown to the GeoIP library:
                    State1 = update_country('unknown', State),
                    {ok, State1}
            end;
        _Else ->
            % If the client address is a hostname or an ip6-address,
            % count it as 'unknown':
            State1 = update_country('unknown', State),
            {ok, State1}
    end.

% Query the stats from the analysis process, and order and convert the data:
get_stats(Name, Length) ->
    Stats = dict:to_list(gen_log_analyzer:call(Name, get_stats)),
    Total = lists:foldl(fun({_, Count}, Total) -> Total + Count end,
                         0, Stats),
    Stats1 = lists:sort(fun({_, C1}, {_, C2}) -> C1 > C2 end, Stats),
    Stats2 = lists:sublist(Stats, Length),
    lists:map(fun({Country, Count}) -> {Country, Count*100/Total} end, Stats2).

One possible improvement would be to handle timeouts to calls to libgeoip:lookup/1. The implementation of that function implicitly imposes an arbitrary timeout of 200ms, which I have sometimes observed. My current implementation does not tolerate such timeouts.