This article presents a simple Logtilla log analysis module, log_geoip_stats, which gives the top N client countries, in terms of hits, from web access log files. This module uses the libgeoip-erlang library to get geolocations from clients' IP addresses.
Install prerequisite software: Mercurial, and the GeoIP library. On Debian, those are packages mercurial, libgeoip1, and libgeoip-dev. Then, get the libgeoip-erlang sourcecode, and compile it:
hg clone http://bitbucket.org/mattsta/libgeoip-erlang/ cd libgeoip-erlang/ make
Then, make sure that the generated libgeoip-1.0.1 directory is in the code load path, e.g. by passing -pz .../libgeoip-1.0.1 to the erlinterpreter. I personally prefer to install everything into /usr/local, and to use GNU Stow (Debian package stow) to manage packages there:
sudo mkdir -p /usr/local/stow/libgeoip-1.0.1/lib/erlang/lib/ sudo cp -r libgeoip-1.0.1 /usr/local/stow/libgeoip-1.0.1/lib/erlang/lib/ sudo chown -R root:root /usr/local/stow/libgeoip-1.0.1/ sudo stow -d /usr/local/stow/ libgeoip-1.0.1
This has the effect of installing libgeoip-1.0.1 into /usr/local/lib/erlang/lib/ with the possibility to easily uninstall it like any Stow package, with one command: sudo stow -d /usr/local/stow/ -D libgeoip-1.0.1. After installing additional Erlang libraries into /usr/local/lib/erlang/lib/, those can be loaded simply by setting the ERL_LIBS=/usr/local/lib/erlang/lib environment variable, as shown below.
Get MaxMind's free GeoLite City database. On Debian, this can be done by running:
sudo sh /usr/share/doc/libgeoip1/examples/geolitecityupdate.sh
This command installs the database into /usr/share/GeoIP/GeoIPCity.dat.
Test that libgeoip works correctly:
$ ERL_LIBS=/usr/local/lib/erlang/lib erl
> application:start(libgeoip_app).
> libgeoip:set_db("/usr/share/GeoIP/GeoIPCity.dat").
> libgeoip:lookup(<<91,121,26,170>>).
This should give you the location of the www.berabera.info server in France.
Logtilla's log_geoip_stats module uses the libgeoip library to count the number of parsed log entries per client country. Here is a sample usage to analyze a single Apache access.log file:
$ cd src
$ PATH=../c_src:$PATH ERL_LIBS=/usr/local/lib/erlang/lib erl
> application:start(libgeoip_app).
> libgeoip:set_db("/usr/share/GeoIP/GeoIPCity.dat").
> {ok, Pid} = gen_log_analyzer:start_link(log_geoip_stats, [], []).
> ok = gen_log_analyzer:parse(Pid, "/var/log/apache2/access.log").
> log_geoip_stats:get_stats(Pid, 10).
The get_stats/2 function orders the countries by number of hits, converts the numbers of hits into percentages, and returns the top N countries (here, N=10). For one of my access.log files, this prints out:
[{'US',40.94006639874192},
{'JP',30.572543537771566},
{'GB',4.403284990389656},
{'FR',3.9664511619779836},
{'BY',3.8266643368862483},
{'TR',3.6286330013396237},
{'CH',2.958821131108393},
{'TH',0.9260877162327451},
{'PR',0.9202632651872561},
{'CA',0.9086143630962782}]
The vast majority of my visitors in that period came from the USA (40%) and Japan (30%).
In module log_geoip_stats, most of the code is boilerplate to implement the gen_log_analyzer behaviour. The most interesting pieces are functions handle_log_entry/2 and get_stats/2:
% Analyze a parsed log entry:
handle_log_entry(LogEntry, State) ->
case LogEntry#'LogEntry'.'remote-host' of
{'ip-address', IPAddress} ->
case libgeoip:lookup(list_to_binary(IPAddress)) of
{geoip, Country, _, _, _, _, _, _, _} ->
% Address found:
State1 = update_country(list_to_atom(Country), State),
{ok, State1};
[] ->
% Address Unknown to the GeoIP library:
State1 = update_country('unknown', State),
{ok, State1}
end;
_Else ->
% If the client address is a hostname or an ip6-address,
% count it as 'unknown':
State1 = update_country('unknown', State),
{ok, State1}
end.
% Query the stats from the analysis process, and order and convert the data:
get_stats(Name, Length) ->
Stats = dict:to_list(gen_log_analyzer:call(Name, get_stats)),
Total = lists:foldl(fun({_, Count}, Total) -> Total + Count end,
0, Stats),
Stats1 = lists:sort(fun({_, C1}, {_, C2}) -> C1 > C2 end, Stats),
Stats2 = lists:sublist(Stats, Length),
lists:map(fun({Country, Count}) -> {Country, Count*100/Total} end, Stats2).
One possible improvement would be to handle timeouts to calls to libgeoip:lookup/1. The implementation of that function implicitly imposes an arbitrary timeout of 200ms, which I have sometimes observed. My current implementation does not tolerate such timeouts.