<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Web</title>
  <link rel="alternate" type="text/html" href="http://www.berabera.info/en/taxonomy/term/44"/>
  <link rel="self" type="application/atom+xml" href="http://www.berabera.info/en/taxonomy/term/44/atom/feed"/>
  <id>http://www.berabera.info/en/taxonomy/term/44/atom/feed</id>
  <updated>2009-09-13T18:53:22+09:00</updated>
  <entry>
    <title>Logtilla and GeoIP: analyze the geolocation of web clients</title>
    <link rel="alternate" type="text/html" href="http://www.berabera.info/en/node/276" />
    <id>http://www.berabera.info/en/node/276</id>
    <published>2009-09-21T15:06:26+09:00</published>
    <updated>2009-09-21T15:12:40+09:00</updated>
    <author>
      <name>Romain Lenglet</name>
    </author>
    <category term="Erlang/OTP" />
    <category term="Logtilla" />
    <category term="Web" />
    <summary type="html"><![CDATA[<p>This article presents a simple <a href="http://github.com/rlenglet/Logtilla">Logtilla</a> log analysis module, <a href="http://github.com/rlenglet/Logtilla/blob/master/src/log_geoip_stats.erl"><code>log_geoip_stats</code></a>, which gives the top N client countries, in terms of hits, from web access log files. This module uses the <a href="http://bitbucket.org/mattsta/libgeoip-erlang/src/">libgeoip-erlang</a> library to get geolocations from clients' IP addresses.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>This article presents a simple <a href="http://github.com/rlenglet/Logtilla">Logtilla</a> log analysis module, <a href="http://github.com/rlenglet/Logtilla/blob/master/src/log_geoip_stats.erl"><code>log_geoip_stats</code></a>, which gives the top N client countries, in terms of hits, from web access log files. This module uses the <a href="http://bitbucket.org/mattsta/libgeoip-erlang/src/">libgeoip-erlang</a> library to get geolocations from clients' IP addresses.</p>
<p>&lt;!--break--></p>
<h2>libgeoip-erlang installation<br /></h2>
<p>Install prerequisite software: Mercurial, and the GeoIP library. On Debian, those are packages <a href="http://packages.debian.org/sid/mercurial">mercurial</a>, <a href="http://packages.debian.org/sid/libgeoip1">libgeoip1</a>, and <a href="http://packages.debian.org/sid/libgeoip-dev">libgeoip-dev</a>. Then, get the libgeoip-erlang sourcecode, and compile it:</p>
<pre>hg clone <a href="http://bitbucket.org/mattsta/libgeoip-erlang/" title="http://bitbucket.org/mattsta/libgeoip-erlang/">http://bitbucket.org/mattsta/libgeoip-erlang/</a>
cd libgeoip-erlang/
make
</pre><p>Then, make sure that the generated <code>libgeoip-1.0.1</code> directory is in the code load path, e.g. by passing <code>-pz .../libgeoip-1.0.1</code> to the <code>erl</code>interpreter. I personally prefer to install everything into <code>/usr/local</code>, and to use <a href="http://www.gnu.org/software/stow/">GNU Stow</a> (Debian package <a href="http://packages.debian.org/sid/stow">stow</a>) to manage packages there:</p>
<pre>sudo mkdir -p /usr/local/stow/libgeoip-1.0.1/lib/erlang/lib/
sudo cp -r libgeoip-1.0.1 /usr/local/stow/libgeoip-1.0.1/lib/erlang/lib/
sudo chown -R root:root /usr/local/stow/libgeoip-1.0.1/
sudo stow -d /usr/local/stow/ libgeoip-1.0.1
</pre><p>This has the effect of installing <code>libgeoip-1.0.1</code> into <code>/usr/local/lib/erlang/lib/</code> with the possibility to easily uninstall it like any Stow package, with one command: <code>sudo stow -d /usr/local/stow/ -D libgeoip-1.0.1</code>. After installing additional Erlang libraries into <code>/usr/local/lib/erlang/lib/</code>, those can be loaded simply by setting the <code>ERL_LIBS=/usr/local/lib/erlang/lib</code> environment variable, as shown below.</p>
<p>Get <a href="http://www.maxmind.com/">MaxMind</a>'s free GeoLite City database. On Debian, this can be done by running:</p>
<pre>sudo sh /usr/share/doc/libgeoip1/examples/geolitecityupdate.sh
</pre><p>This command installs the database into <code>/usr/share/GeoIP/GeoIPCity.dat</code>.</p>
<p>Test that <code>libgeoip</code> works correctly:</p>
<pre>$ ERL_LIBS=/usr/local/lib/erlang/lib erl
&gt; application:start(libgeoip_app).
&gt; libgeoip:set_db("/usr/share/GeoIP/GeoIPCity.dat").
&gt; libgeoip:lookup(&lt;&lt;91,121,26,170&gt;&gt;).
</pre><p>This should give you the location of the <code>www.berabera.info</code> server in France.</p>
<h2>Using log_geoip_stats to analyze web client locations<br /></h2>
<p>Logtilla's <a href="http://github.com/rlenglet/Logtilla/blob/master/src/log_geoip_stats.erl">log_geoip_stats</a> module uses the <code>libgeoip</code> library to count the number of parsed log entries per client country. Here is a sample usage to analyze a single Apache <code>access.log</code> file:</p>
<pre>$ cd src
$ PATH=../c_src:$PATH ERL_LIBS=/usr/local/lib/erlang/lib erl
&gt; application:start(libgeoip_app).
&gt; libgeoip:set_db("/usr/share/GeoIP/GeoIPCity.dat").
&gt; {ok, Pid} = gen_log_analyzer:start_link(log_geoip_stats, [], []).
&gt; ok = gen_log_analyzer:parse(Pid, "/var/log/apache2/access.log").
&gt; log_geoip_stats:get_stats(Pid, 10).
</pre><p>The <code>get_stats/2</code> function orders the countries by number of hits, converts the numbers of hits into percentages, and returns the top N countries (here, N=10). For one of my <code>access.log</code> files, this prints out:</p>
<pre>[{'US',40.94006639874192},
 {'JP',30.572543537771566},
 {'GB',4.403284990389656},
 {'FR',3.9664511619779836},
 {'BY',3.8266643368862483},
 {'TR',3.6286330013396237},
 {'CH',2.958821131108393},
 {'TH',0.9260877162327451},
 {'PR',0.9202632651872561},
 {'CA',0.9086143630962782}]
</pre><p>The vast majority of my visitors in that period came from the USA (40%) and Japan (30%).</p>
<h2>Implementation overview</h2>
<p>In module <code>log_geoip_stats</code>, most of the code is boilerplate to implement the <code>gen_log_analyzer</code> behaviour. The most interesting pieces are functions <code>handle_log_entry/2</code> and <code>get_stats/2</code>:</p>
<pre><em>% Analyze a parsed log entry:</em>
handle_log_entry(LogEntry, State) -&gt;
    case LogEntry#'LogEntry'.'remote-host' of
        {'ip-address', IPAddress} -&gt;
            case libgeoip:lookup(list_to_binary(IPAddress)) of
                {geoip, Country, _, _, _, _, _, _, _} -&gt;
                    <em>% Address found:</em>
                    State1 = update_country(list_to_atom(Country), State),
                    {ok, State1};
                [] -&gt;
                    <em>% Address Unknown to the GeoIP library:</em>
                    State1 = update_country('unknown', State),
                    {ok, State1}
            end;
        _Else -&gt;
            <em>% If the client address is a hostname or an ip6-address,</em>
            <em>% count it as 'unknown':</em>
            State1 = update_country('unknown', State),
            {ok, State1}
    end.

<em>% Query the stats from the analysis process, and order and convert the data:</em>
get_stats(Name, Length) -&gt;
    Stats = dict:to_list(gen_log_analyzer:call(Name, get_stats)),
    Total = lists:foldl(fun({_, Count}, Total) -&gt; Total + Count end,
                         0, Stats),
    Stats1 = lists:sort(fun({_, C1}, {_, C2}) -&gt; C1 &gt; C2 end, Stats),
    Stats2 = lists:sublist(Stats, Length),
    lists:map(fun({Country, Count}) -&gt; {Country, Count*100/Total} end, Stats2).
</pre><p>One possible improvement would be to handle timeouts to calls to <code>libgeoip:lookup/1</code>. The implementation of that function implicitly imposes an arbitrary timeout of 200ms, which I have sometimes observed. My current implementation does not tolerate such timeouts.</p>
    ]]></content>
  </entry>
  <entry>
    <title>First release of Logtilla, a web access log analyzer in Erlang</title>
    <link rel="alternate" type="text/html" href="http://www.berabera.info/en/node/266" />
    <id>http://www.berabera.info/en/node/266</id>
    <published>2009-09-13T18:11:09+09:00</published>
    <updated>2009-09-13T18:53:22+09:00</updated>
    <author>
      <name>Romain Lenglet</name>
    </author>
    <category term="Erlang/OTP" />
    <category term="Logtilla" />
    <category term="Web" />
    <summary type="html"><![CDATA[<p>I have written a small Erlang framework for parsing web access logs, called <a href="http://github.com/rlenglet/Logtilla">Logtilla</a>, hosted on <a href="http://github.com/rlenglet">GitHub</a>. This framework supports parsing logs in the <a href="http://en.wikipedia.org/wiki/Common_Log_Format">Common Log Format</a>, or in <a href="http://httpd.apache.org/docs/2.0/logs.html">Apache's Combined Log Format</a>. Thanks to the use of a C port program to do the parsing, Logtilla is very efficient: it can parse and analyze 15,000 entries/sec on my 4-year-old laptop.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>I have written a small Erlang framework for parsing web access logs, called <a href="http://github.com/rlenglet/Logtilla">Logtilla</a>, hosted on <a href="http://github.com/rlenglet">GitHub</a>. This framework supports parsing logs in the <a href="http://en.wikipedia.org/wiki/Common_Log_Format">Common Log Format</a>, or in <a href="http://httpd.apache.org/docs/2.0/logs.html">Apache's Combined Log Format</a>. Thanks to the use of a C port program to do the parsing, Logtilla is very efficient: it can parse and analyze 15,000 entries/sec on my 4-year-old laptop.</p>
<p>&lt;!--break--></p>
<h2>Installation</h2>
<p>To build it, pull the Git archive from <a href="http://github.com/rlenglet/Logtilla">the Logtilla Git repository</a>, and then initialize the build system, configure, and build:</p>
<pre>autoreconf -vi
./configure
make
</pre><p>This requires you to install <a href="http://www.gnu.org/software/autoconf/">Autoconf</a>, <a href="http://www.gnu.org/software/automake/">Automake</a>, the <a href="http://lionet.info/asn1c/">asn1c ASN.1-to-C compiler</a> which is used by Logtilla (I have tested that both the released 0.9.21 version and the version in the asn1c SVN repository are usable for Logtilla), and of course any recent version of <a href="http://www.erlang.org/">Erlang/OTP</a>.</p>
<h2>Overview of Logtilla</h2>
<p>Logtilla consists essentially of a single behaviour module: <code>gen_log_analyzer</code>, which defines the following callbacks:</p>
<ul>
<li><code>init/1</code>: Initialize the state.<br />
<pre><strong>init(</strong><span style="color: green;">Args</span>::any()<strong>)</strong> -&gt;
    {'ok', State::any()}
    | 'ignore'
    | {'stop', Reason::any()}.
</pre></li>
<li><code>handle_log_entry/2</code>: Handle a parsed log entry. The <code>LogEntry</code> record type is defined in header file <code>WebAccessLog.hrl</code>.<br />
<pre><strong>handle_log_entry(</strong><span style="color: green;">LogEntry</span>::#'LogEntry'(), <span style="color: green;">State</span>::any()<strong>)</strong> -&gt;
    {'ok', NewState::any()}
    | {'error', Reason::any(), NewState::any()}.
</pre></li>
<li><code>handle_call/3</code>: Handle an application-specific call. This callback is similar to the <code>gen_server:handle_call/3</code> callback.<br />
<pre><strong>handle_call(</strong><span style="color: green;">Msg</span>::any(), {<span style="color: green;">From</span>::pid(), <span style="color: green;">Tag</span>::any()}, <span style="color: green;">State</span>::any()<strong>)</strong> -&gt;
    {'reply', Reply::any(), NewState::any()}
    | {'reply', Reply::any(), NewState::any(), Timeout::timeout()}
    | {'noreply', NewState::any()}
    | {'noreply', NewState::any(), Timeout::timeout()}
    | {'stop', Reason::any(), Reply::any(), NewState::any()}.
</pre></li>
<li><code>handle_cast/2</code>: Handle an application_specific cast. This callback is similar to the <code>gen_server:handle_cast/2</code> callback.<br />
<pre><strong>handle_cast(</strong><span style="color: green;">Msg</span>::any(), <span style="color: green;">State</span>::any()<strong>)</strong> -&gt;
    {'noreply', NewState::any()}
    | {'noreply', NewState::any(), Timeout::timeout()}
    | {'stop', Reason::any(), NewState::any()}.
</pre></li>
<li><code>terminate/2</code>: Cleanup on termination. This callback is similar to the <code>gen_server:terminate/2</code> callback.<br />
<pre><strong>terminate(</strong><span style="color: green;">Reason</span>::any(), <span style="color: green;">State</span>::any()) -&gt;
    no_return().
</pre></li>
<li><code>code_change/3</code>: Update the state after a module upgrade. This callback is similar to the <code>gen_server:code_change/3</code> callback.<br />
<pre><strong>code_change(</strong>{'down', <span style="color: green;">OldVsn</span>::any()} | <span style="color: green;">OldVsn</span>::any(), <span style="color: green;">State</span>::any(), <span style="color: green;">Extra</span>::any()) -&gt;
    {'ok', NewState::any()}.
</pre></li>
</ul>
<p>The most important callbacks to implement are <code>init/1</code> and <code>handle_log_entry/2</code>.</p>
<h2>Running example</h2>
<p>Logtilla contains a basic example module, <code>log/logtilla_test</code>. It counts how many parsed log entries correspond to a query reply for which a length was returned, and how many don't have a length. This module has no practical purpose, but is useful to illustrate the behaviour callbacks. The module's most important parts are:</p>
<pre>-module(logtilla_test).

<em>% Implement Logtilla's gen_log_analyzer behaviour:</em>
-behaviour(gen_log_analyzer).
<em>% Include Logtilla's header for the definition of the LogEntry record:</em>
-include("WebAccessLog.hrl"). <em>

% Define and initialize the state:</em>
-record(state, {count_without_length=0, count_with_length=0}).
init([]) -&gt;
  State = #state{},
  {ok, State}.

<em>% Analyze the log entry and update the state:</em>
handle_log_entry(LogEntry, State) -&gt;
  case LogEntry#'LogEntry'.length of
    asn1_NOVALUE -&gt;
      {ok, State#state{
        count_without_length=State#state.count_without_length+1}};
    _Length -&gt;
      {ok, State#state{
        count_with_length=State#state.count_with_length+1}}
  end.

<em>% Implement an application-specific call to return the stats:</em>
handle_call(get_stats, _, State) -&gt;
  {reply, {State#state.count_without_length, State#state.count_with_length},
   State}.
</pre><p>To execute this example to parse a file named <code>/var/log/apache2/access.log</code>:</p>
<pre>$ cd src
$ PATH=../c_src:$PATH erl
&gt; {ok, Pid} = gen_log_analyzer:start_link(logtilla_test, [], []).
&gt; ok = gen_log_analyzer:parse(Pid, "/var/log/apache2/access.log").
&gt; gen_log_analyzer:call(Pid, get_stats).
</pre><p>This prints out a tuple with the count of entries without a length and the count of entries with a length.</p>
<p>You must add the <code>c_src</code> directory to the <code>PATH</code>, as it is where the <code>logtilla_parser</code> program is generated, and this program is executed as a port program by <code>gen_log_analyzer</code> to parse the files, so this program must be found in the <code>PATH</code>.</p>
<p>I will soon write other blog posts on the internals of Logtilla (which is the most interesting), and on future works.</p>
    ]]></content>
  </entry>
</feed>
