This article presents a simple Logtilla log analysis module, log_geoip_stats
, which gives the top N client countries, in terms of hits, from web access log files. This module uses the libgeoip-erlang library to get geolocations from clients’ IP addresses.
libgeoip-erlang installation
Install pre-requisite software: Mercurial, and the GeoIP library. On Debian, those are packages mercurial, libgeoip1, and libgeoip-dev. Then, get the libgeoip-erlang sourcecode, and compile it:
hg clone http://bitbucket.org/mattsta/libgeoip-erlang/ cd libgeoip-erlang/ make
Then, make sure that the generated libgeoip-1.0.1
directory is in the code load path, e.g. by passing -pz .../libgeoip-1.0.1
to the erl
interpreter. I personally prefer to install everything into /usr/local
, and to use GNU Stow (Debian package stow) to manage packages there:
sudo mkdir -p /usr/local/stow/libgeoip-1.0.1/lib/erlang/lib/ sudo cp -r libgeoip-1.0.1 /usr/local/stow/libgeoip-1.0.1/lib/erlang/lib/ sudo chown -R root:root /usr/local/stow/libgeoip-1.0.1/ sudo stow -d /usr/local/stow/ libgeoip-1.0.1
This has the effect of installing libgeoip-1.0.1
into /usr/local/lib/erlang/lib/
with the possibility to easily uninstall it like any Stow package, with one command: sudo stow -d /usr/local/stow/ -D libgeoip-1.0.1
. After installing additional Erlang libraries into /usr/local/lib/erlang/lib/
, those can be loaded simply by setting the ERL_LIBS=/usr/local/lib/erlang/lib
environment variable, as shown below.
Get MaxMind‘s free GeoLite City database. On Debian, this can be done by running:
sudo sh /usr/share/doc/libgeoip1/examples/geolitecityupdate.sh
This command installs the database into /usr/share/GeoIP/GeoIPCity.dat
.
Test that libgeoip
works correctly:
$ ERL_LIBS=/usr/local/lib/erlang/lib erl > application:start(libgeoip_app). > libgeoip:set_db("/usr/share/GeoIP/GeoIPCity.dat"). > libgeoip:lookup(<<91,121,26,170>>).
This should give you the location of the www.berabera.info
server in France.
Using log_geoip_stats to analyze web client locations
Logtilla’s log_geoip_stats module uses the libgeoip
library to count the number of parsed log entries per client country. Here is a sample usage to analyze a single Apache access.log
file:
$ cd src $ PATH=../c_src:$PATH ERL_LIBS=/usr/local/lib/erlang/lib erl > application:start(libgeoip_app). > libgeoip:set_db("/usr/share/GeoIP/GeoIPCity.dat"). > {ok, Pid} = gen_log_analyzer:start_link(log_geoip_stats, [], []). > ok = gen_log_analyzer:parse(Pid, "/var/log/apache2/access.log"). > log_geoip_stats:get_stats(Pid, 10).
The get_stats/2
function orders the countries by number of hits, converts the numbers of hits into percentages, and returns the top N countries (here, N=10). For one of my access.log
files, this prints out:
[{'US',40.94006639874192}, {'JP',30.572543537771566}, {'GB',4.403284990389656}, {'FR',3.9664511619779836}, {'BY',3.8266643368862483}, {'TR',3.6286330013396237}, {'CH',2.958821131108393}, {'TH',0.9260877162327451}, {'PR',0.9202632651872561}, {'CA',0.9086143630962782}]
The vast majority of my visitors in that period came from the USA (40%) and Japan (30%).
Implementation overview
In module log_geoip_stats
, most of the code is boilerplate to implement the gen_log_analyzer
behaviour. The most interesting pieces are functions handle_log_entry/2
and get_stats/2
:
% Analyze a parsed log entry: handle_log_entry(LogEntry, State) -> case LogEntry#'LogEntry'.'remote-host' of {'ip-address', IPAddress} -> case libgeoip:lookup(list_to_binary(IPAddress)) of {geoip, Country, _, _, _, _, _, _, _} -> % Address found: State1 = update_country(list_to_atom(Country), State), {ok, State1}; [] -> % Address Unknown to the GeoIP library: State1 = update_country('unknown', State), {ok, State1} end; _Else -> % If the client address is a hostname or an ip6-address, % count it as 'unknown': State1 = update_country('unknown', State), {ok, State1} end. % Query the stats from the analysis process, and order and convert the data: get_stats(Name, Length) -> Stats = dict:to_list(gen_log_analyzer:call(Name, get_stats)), Total = lists:foldl(fun({_, Count}, Total) -> Total + Count end, 0, Stats), Stats1 = lists:sort(fun({_, C1}, {_, C2}) -> C1 > C2 end, Stats), Stats2 = lists:sublist(Stats, Length), lists:map(fun({Country, Count}) -> {Country, Count*100/Total} end, Stats2).
One possible improvement would be to handle timeouts to calls to libgeoip:lookup/1
. The implementation of that function implicitly imposes an arbitrary timeout of 200ms, which I have sometimes observed. My current implementation does not tolerate such timeouts.