Gathering statistics for outbound mail

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Gathering statistics for outbound mail

Josh Hammond
I am managing a small postfix server for about 10 domains.
Most of the messages are inbound except for some users that use it as
their outgoing server.
With the users' permission I want to implement a system to gather outbound
data and see what from/to combination is the most frequent.

Basically every time a user uses the postfix server for relaying mail to a
non-local domain, I want to save sender and recipient addresses in a mysql
database (eg. "[hidden email],[hidden email]").
I was thinking to use either bash or python to code the script (but I'm
wondering if php work with postfix?).
What is unclear is how to configure postfix so that it calls the script
before relaying the message to the user2's MX. What arguments does the
script need to understand and how do I send them from postfix?
I will probably have more questions in the future, but that's for now.
Thanks,

Josh

Reply | Threaded
Open this post in threaded view
|

Re: Gathering statistics for outbound mail

Wietse Venema
Josh Hammond:
> I am managing a small postfix server for about 10 domains.
> Most of the messages are inbound except for some users that use it as
> their outgoing server.
> With the users' permission I want to implement a system to gather outbound
> data and see what from/to combination is the most frequent.

Postfix has SMTP-level hooks at the inbound side that give sender
and recipient information. These hooks are described in
http://www.postfix.org/SMTPD_POLICY_README.html.

In fact, a lot of the information you want is already maintained
by the policyd server (http://www.policyd.org/).

There are no equivalent hooks on the outbound side, because there
has not been a need to implement them.

        Wietse

> Basically every time a user uses the postfix server for relaying mail to a
> non-local domain, I want to save sender and recipient addresses in a mysql
> database (eg. "[hidden email],[hidden email]").
> I was thinking to use either bash or python to code the script (but I'm
> wondering if php work with postfix?).
> What is unclear is how to configure postfix so that it calls the script
> before relaying the message to the user2's MX. What arguments does the
> script need to understand and how do I send them from postfix?
> I will probably have more questions in the future, but that's for now.
> Thanks,
>
> Josh
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Gathering statistics for outbound mail

mouss-2
In reply to this post by Josh Hammond
Josh Hammond wrote:

> I am managing a small postfix server for about 10 domains.
> Most of the messages are inbound except for some users that use it as
> their outgoing server.
> With the users' permission I want to implement a system to gather outbound
> data and see what from/to combination is the most frequent.
>
> Basically every time a user uses the postfix server for relaying mail to a
> non-local domain, I want to save sender and recipient addresses in a mysql
> database (eg. "[hidden email],[hidden email]").
> I was thinking to use either bash or python to code the script (but I'm
> wondering if php work with postfix?).
> What is unclear is how to configure postfix so that it calls the script
> before relaying the message to the user2's MX. What arguments does the
> script need to understand and how do I send them from postfix?
> I will probably have more questions in the future, but that's for now.
>  


you can write a policy server to do that (you can modify one of the
available policy servers).

alternatively, you can parse logs. If you use amavisd-new, then it's
easy because it logs multiple infos in one line. otherwise, you'll need
to "correlate" multiple lines.

Reply | Threaded
Open this post in threaded view
|

Re: Gathering statistics for outbound mail

Josh Hammond
On Tue, June 10, 2008 8:36 pm, mouss wrote:

> Josh Hammond wrote:
>> I am managing a small postfix server for about 10 domains.
>> Most of the messages are inbound except for some users that use it as
>> their outgoing server.
>> With the users' permission I want to implement a system to gather
>> outbound
>> data and see what from/to combination is the most frequent.
>>
>> Basically every time a user uses the postfix server for relaying mail to
>> a
>> non-local domain, I want to save sender and recipient addresses in a
>> mysql
>> database (eg. "[hidden email],[hidden email]").
>> I was thinking to use either bash or python to code the script (but I'm
>> wondering if php work with postfix?).
>> What is unclear is how to configure postfix so that it calls the script
>> before relaying the message to the user2's MX. What arguments does the
>> script need to understand and how do I send them from postfix?
>> I will probably have more questions in the future, but that's for now.
>>
>
>
> you can write a policy server to do that (you can modify one of the
> available policy servers).
>
> alternatively, you can parse logs. If you use amavisd-new, then it's
> easy because it logs multiple infos in one line. otherwise, you'll need
> to "correlate" multiple lines.
>

Thanks both to Wietse and mouss for the fast answers.
I am now wondering what could be better between a daily log parsing task
and a policy server. I can understand that the latter would give me
real-time statistics while for the former I should wait for the scheduled
time, but what about performance?
As I said before, most of the emails are inbound (circa 10-12k per day)
and the relayed one are at most 500. Wouldn't this make log parsing
unnecessairly expensive?

Josh

Reply | Threaded
Open this post in threaded view
|

Re: Gathering statistics for outbound mail

mouss-2
Josh Hammond wrote:

> On Tue, June 10, 2008 8:36 pm, mouss wrote:
>  
>> Josh Hammond wrote:
>>    
>>> I am managing a small postfix server for about 10 domains.
>>> Most of the messages are inbound except for some users that use it as
>>> their outgoing server.
>>> With the users' permission I want to implement a system to gather
>>> outbound
>>> data and see what from/to combination is the most frequent.
>>>
>>> Basically every time a user uses the postfix server for relaying mail to
>>> a
>>> non-local domain, I want to save sender and recipient addresses in a
>>> mysql
>>> database (eg. "[hidden email],[hidden email]").
>>> I was thinking to use either bash or python to code the script (but I'm
>>> wondering if php work with postfix?).
>>> What is unclear is how to configure postfix so that it calls the script
>>> before relaying the message to the user2's MX. What arguments does the
>>> script need to understand and how do I send them from postfix?
>>> I will probably have more questions in the future, but that's for now.
>>>
>>>      
>> you can write a policy server to do that (you can modify one of the
>> available policy servers).
>>
>> alternatively, you can parse logs. If you use amavisd-new, then it's
>> easy because it logs multiple infos in one line. otherwise, you'll need
>> to "correlate" multiple lines.
>>
>>    
>
> Thanks both to Wietse and mouss for the fast answers.
> I am now wondering what could be better between a daily log parsing task
> and a policy server. I can understand that the latter would give me
> real-time statistics while for the former I should wait for the scheduled
> time,

you can run a parser in real time (tail like). just make sure to take
log rotation into account. probably easier with syslog-ng.

>  but what about performance?
> As I said before, most of the emails are inbound (circa 10-12k per day)
> and the relayed one are at most 500. Wouldn't this make log parsing
> unnecessairly expensive?
>  

if you parse the logs periodically, your script should remember where it
was last time to avoid reparsing. use seek() or equivalent). or run the
parser daily at an hour where the machine is not too busy (then run it
on the last rotated log file).

the problem with the log parser is that you need to implement the
"correlation" yourself. Here is a quick and dirty perl to print from and
to. you can try it on yesterday logs and see if it uses a lot of resources.

#!/usr/bin/perl

use strict;

my $logfile = $ARGV[0];
my %from;
my %to;



open(IN, $logfile) or die "Cannot open $logfile: $!\n";
while (<IN>) {

    if (m| postfix/smtp\[\d+\]: (\S+): to=<([^>]+)>, relay=(\S+),|) {
        my $qid = $1;
        my $to = $2;
        my $relay = $3;
        if ($relay =~ /\[127\.0\.0\.0\]/) {
            # ignore if relayed to localhost
            next;
        }
        $to{$qid} = $to;
        next;
    }

    if (m| postfix/qmgr\[\d+\]: (\S+): from=<([^>]+)>, size=|) {
        $from{$1} = $2;
        next;
    }

}

foreach my $qid (keys %to) {
    my $to = $to{$qid};
    my $from = $from{$qid};
    print "$qid from=<$from> to=<$to>\n";
}


Reply | Threaded
Open this post in threaded view
|

Re: Gathering statistics for outbound mail

Josh Hammond
On Tue, June 10, 2008 10:21 pm, mouss wrote:

> Josh Hammond wrote:
>> On Tue, June 10, 2008 8:36 pm, mouss wrote:
>>
>>> Josh Hammond wrote:
>>>
>>>> I am managing a small postfix server for about 10 domains.
>>>> Most of the messages are inbound except for some users that use it as
>>>> their outgoing server.
>>>> With the users' permission I want to implement a system to gather
>>>> outbound
>>>> data and see what from/to combination is the most frequent.
>>>>
>>>> Basically every time a user uses the postfix server for relaying mail
>>>> to
>>>> a
>>>> non-local domain, I want to save sender and recipient addresses in a
>>>> mysql
>>>> database (eg. "[hidden email],[hidden email]").
>>>> I was thinking to use either bash or python to code the script (but
>>>> I'm
>>>> wondering if php work with postfix?).
>>>> What is unclear is how to configure postfix so that it calls the
>>>> script
>>>> before relaying the message to the user2's MX. What arguments does the
>>>> script need to understand and how do I send them from postfix?
>>>> I will probably have more questions in the future, but that's for now.
>>>>
>>>>
>>> you can write a policy server to do that (you can modify one of the
>>> available policy servers).
>>>
>>> alternatively, you can parse logs. If you use amavisd-new, then it's
>>> easy because it logs multiple infos in one line. otherwise, you'll need
>>> to "correlate" multiple lines.
>>>
>>>
>>
>> Thanks both to Wietse and mouss for the fast answers.
>> I am now wondering what could be better between a daily log parsing task
>> and a policy server. I can understand that the latter would give me
>> real-time statistics while for the former I should wait for the
>> scheduled
>> time,
>
> you can run a parser in real time (tail like). just make sure to take
> log rotation into account. probably easier with syslog-ng.
>
>>  but what about performance?
>> As I said before, most of the emails are inbound (circa 10-12k per day)
>> and the relayed one are at most 500. Wouldn't this make log parsing
>> unnecessairly expensive?
>>
>
> if you parse the logs periodically, your script should remember where it
> was last time to avoid reparsing. use seek() or equivalent). or run the
> parser daily at an hour where the machine is not too busy (then run it
> on the last rotated log file).
>
> the problem with the log parser is that you need to implement the
> "correlation" yourself. Here is a quick and dirty perl to print from and
> to. you can try it on yesterday logs and see if it uses a lot of
> resources.
>
> #!/usr/bin/perl
>
> use strict;
>
> my $logfile = $ARGV[0];
> my %from;
> my %to;
>
>
>
> open(IN, $logfile) or die "Cannot open $logfile: $!\n";
> while (<IN>) {
>
>     if (m| postfix/smtp\[\d+\]: (\S+): to=<([^>]+)>, relay=(\S+),|) {
>         my $qid = $1;
>         my $to = $2;
>         my $relay = $3;
>         if ($relay =~ /\[127\.0\.0\.0\]/) {
>             # ignore if relayed to localhost
>             next;
>         }
>         $to{$qid} = $to;
>         next;
>     }
>
>     if (m| postfix/qmgr\[\d+\]: (\S+): from=<([^>]+)>, size=|) {
>         $from{$1} = $2;
>         next;
>     }
>
> }
>
> foreach my $qid (keys %to) {
>     my $to = $to{$qid};
>     my $from = $from{$qid};
>     print "$qid from=<$from> to=<$to>\n";
> }
>
>
>

I tried you script and it seems to work in with an acceptable combination
of time and resources.
I think this might be the most effective solution since I don't really
need live results but rather data in the long run.
I'll code something myself and if I see a performance decrease then
implement the policyd service.

Josh

Reply | Threaded
Open this post in threaded view
|

Re: Gathering statistics for outbound mail

Brent Bice
In reply to this post by Josh Hammond
Josh Hammond wrote:
> Thanks both to Wietse and mouss for the fast answers.
> I am now wondering what could be better between a daily log parsing task
> and a policy server. I can understand that the latter would give me
> real-time statistics while for the former I should wait for the scheduled
> time, but what about performance?

    Well, you could also have the logs fed (live) to a program that
parsed them and updated a sql database. This is easiest with syslog-ng
but you can also do it with syslogd and a fifo.

    I recently banged out a quick C hack for another postfix-users
reader that reads log entries from stdin and notes the latencies of
messages removed from the queue in the last 5 minutes and keeps a
running average of these latencies stored in a file where
cacti/mrtg/whatever can pick them up.  I just defined a new destination
in syslog-ng specifying my program and then added that destination to
the log entry:

destination d_avgdelay { program("/home/bbice/avgdelays >/dev/null"); };
log { source(net); filter(f_mail); filter(f_postfix);
destination(postfix); destination(d_avgdelay); };

    The "postfix" destination writes to the log file and the d_avgdelay
destination is my C program.  I coulda used perl except that I figured
that C would be lots faster (in case I started streaming really large
amounts of log data to it).

    The same thing can be done with syslogd and a fifo file (syslogd
writing to the fifo and the program/script reading from the fifo) but
it's not nearly as nice. You gotta wrap the program in a watchdog script
to restart it anytime syslogd gets HUPd, for instance.

Brent
Reply | Threaded
Open this post in threaded view
|

Re: Gathering statistics for outbound mail

Bill Anderson-2
In reply to this post by Wietse Venema

On Jun 10, 2008, at 12:14 PM, Wietse Venema wrote:

> Josh Hammond:
>> I am managing a small postfix server for about 10 domains.
>> Most of the messages are inbound except for some users that use it as
>> their outgoing server.
>> With the users' permission I want to implement a system to gather  
>> outbound
>> data and see what from/to combination is the most frequent.
>
> Postfix has SMTP-level hooks at the inbound side that give sender
> and recipient information. These hooks are described in
> http://www.postfix.org/SMTPD_POLICY_README.html.
>
> In fact, a lot of the information you want is already maintained
> by the policyd server (http://www.policyd.org/).
>
> There are no equivalent hooks on the outbound side, because there
> has not been a need to implement them.

Though some of us have quite wished for them. :)
Reply | Threaded
Open this post in threaded view
|

Re: Gathering statistics for outbound mail

Bill Anderson-2
In reply to this post by Josh Hammond

On

> Thanks both to Wietse and mouss for the fast answers.
> I am now wondering what could be better between a daily log parsing  
> task
> and a policy server. I can understand that the latter would give me
> real-time statistics while for the former I should wait for the  
> scheduled
> time, but what about performance?
> As I said before, most of the emails are inbound (circa 10-12k per  
> day)
> and the relayed one are at most 500. Wouldn't this make log parsing
> unnecessairly expensive?
>

My preference is for the correlation policy daemon. it is far less  
overhead than parsing, and less work. To correlate senders and  
recipients in the logs you have to track queue ids (and bes ure to  
account for the reusability of queue IDs), delivery attempts (to get  
recipients) and the sender line (or lines if you log subjects) and tie  
them together. Further, some mail can come in on one maillog and leave  
on the other, leading to a loss of data unless to persist the  
"partial" data between parser runs, or if a "real time" parser such as  
through syslog-ng's "command" destination (or pipe) through restarts.

In policy daemon they are already tied together for you, and the  
parsing is key=value. Pretty simple.

I've got oodles of Python, Awk, and QT4 code for parsing the logs to  
do this but the policy daemon route is much faster and easier to  
maintain. Of course, I've got around 16GB of mail logs per day so  
there might be a slight bias there. ;)

Cheers,
Bill