I have my Bacula instances set to only email me if there is a problem, the drawback is that if the Director dies, or if the email system stops working or anything that can affect those critical mails getting to me the integrity of my backup system might be compromised.
I prefer monitoring to be done with fine grained control and so prefer a per-job monitor in my Nagios installation, remote machines will be queried via SNMP.
Bacula supports running commands before and after the backup, commands can be run on either the Director or on the Client. We'll use this functionality to write status files on each Clients harddrives, these status files will crucially contain a Unix timestamp but also some other textual information.
Below is a sample backup job, notice the Client Run After Job line:
Job {
Name = "client1.example.com main"
Type = Backup
Client = client1-fd
FileSet = "client1 main set"
Schedule = "WeeklyCycle"
Messages = Standard
Storage = director1-sd
Priority = 10
Write Bootstrap = "/export/bacula/db/client1-main.bsr"
Full Backup Pool = Full-Pool
Incremental Backup Pool = Inc-Pool
Differential Backup Pool = Diff-Pool
Pool = Full-Pool
Client Run After Job = "/usr/local/bin/postbaculajob.pl -c \"%c\" -d \"%d\" -i \"%i\" -l \"%l\" -n \"%n\" -o /var/log/bacula-main.log"
}
This is pretty simple, if the job was successful - and only if it was successful - will the script /usr/local/bin/postbaculajob.pl be run with the parameters. The parameters just pass bits like Client Name, Job Number etc on to the script.
Below is a sample script that you can use for the /usr/local/bin/postbaculajob.pl file:
#!/usr/bin/perl
use Getopt::Std;
getopts('d:i:l:n:c:o:h');
if ($opt_h) {
showHelp();
exit(0);
}
unless ($opt_d && $opt_i && $opt_l && $opt_n && $opt_c && $opt_o) {
showHelp("Not all options specified");
exit(1);
}
open (OUT, ">$opt_o") || die ("Can't write to output file ($opt_o): $!");
print(OUT time() . " $opt_l Job $opt_n ($opt_i) by director $opt_d on client $opt_c\n");
close (OUT);
sub showHelp() {
$msg = shift;
print ("$0 usage information\n\n");
if ($msg) {
print (STDERR "ERROR: $msg\n\n");
}
print ("Compulsary Options:\n");
print ("\t-d\tName of the Director running the job\n");
print ("\t-i\tJob ID\n");
print ("\t-l\tJob Level\n");
print ("\t-n\tJob Name\n");
print ("\t-c\tClients Name\n");
print ("\t-o\tFile to create showing the job status\n");
print ("\nOptional Options:\n");
print ("\t-h\tThis help message\n\n");
}
With this in place and the /usr/local/bin/postbaculajob.pl script marked executable, you will now have a file /var/log/bacula-main.log that should look more or less like this after a job has been run:
1153320241 Incremental Job client1.example.com_main (27) by director director1-dir on client client1-fd
The number in the beginning of the line there is a Unix timestamp - seconds since 1970 - the rest is just there for some context.
At this point you have a few options, you can put this file on a webserver and write something to monitor it, or pull it into Nagios over HTTP or you can follow my - admittedly more complex - example below to hook it into snmpd and let Nagios query it over snmp.
The basic concept is that a 2nd script will load the file above and calculate the difference between current time and the time the job was run. If that difference is longer than - in my example - a day, it should alert you. But you can adjust the interval to whatever suits your system.
Here is the script to calculate this age and print it to STDOUT:
#!/usr/bin/perl
$notFoundAge = 31536000;
$now = time();
foreach $file (@ARGV) {
if (-r "$file") {
open (STAMP, $file);
chop($line = <STAMP>);
if ($line =~ /^(\d+) /) {
$timestamp = $1;
print ($now - $timestamp . "\n");
} else {
print ("$notfoundage\n");
}
close (STAMP);
} else {
print ("$notFoundAge\n");
}
}
I save this file in /usr/local/bin/baculaJobStatsSNMP.pl on the machine client1.example.com and mark it executable. Next I need to hook it into my snmpd, you will need your own Enterprise OID, this is a number assigned to an organization by IANA, you can fill in the form here the whole process should take a few days.
The next step is to configure snmpd, I'll assume you have the basics of snmpd working and that you can already query other SNMP values on client1.example.com, if not you'll need to go read the snmpd documentation and come back when you have snmpd working. I use Net-SNMP
Edit your snmpd.conf and add lines like this:
exec .1.3.6.1.4.1.XXX.1 BaculaMainAge /usr/local/bin/baculaJobStatsSNMP.pl /var/log/bacula-main.log exec .1.3.6.1.4.1.XXX.2 BaculaDBAge /usr/local/bin/baculaJobStatsSNMP.pl /var/log/bacula-db.log
You will need to substitute XXX with your own OID Number that you got from IANA.
Now if you restart your snmpd you can check that it works using snmpwalk, again substitute XXX with your own OID Number:
# snmpwalk -c topsecret -v 1 client1.example.com .1.3.6.1.4.1.XXX.1 SNMPv2-SMI::enterprises.11252.1.1.1 = INTEGER: 1 SNMPv2-SMI::enterprises.11252.1.2.1 = STRING: "BaculaMainAge" SNMPv2-SMI::enterprises.11252.1.3.1 = STRING: "/usr/local/bin/baculaJobStatsSNMP.pl /var/log/bacula-main.log" SNMPv2-SMI::enterprises.11252.1.100.1 = INTEGER: 0 SNMPv2-SMI::enterprises.11252.1.101.1 = STRING: "5704" SNMPv2-SMI::enterprises.11252.1.102.1 = INTEGER: 0
Here you can see that the backup last ran 5704 seconds ago.
The next step is to configure nagios, I'll assume you have the basics of nagios working and that you can already query other SNMP values on client1.example.com, if not you'll need to go read the Nagios documentation and come back when you have Nagios working.
On a FreeBSD system the /usr/ports/net-mgmt/nagios-plugins is required for this to work, on RedHat Linux you can get the nagios-plugins rpm on RPMForge.
Configure a command in Nagios that will do a simple numeric check over snmp using the following, you can just put it into your misccommands.cfg or wherever else you configure local commands:
define command{
command_name check_numeric_snmp
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C $USER3$ -o $ARG1$ -w $ARG2$ -c $ARG3$
}
Now define a service as follows, again I am assuming you already have the host defined in Nagios:
define service {
host_name client1.example.com
service_description Bacula Main Backup
check_command check_numeric_snmp!.1.3.6.1.4.1.XXX.1.101.1!90000!90000
max_check_attempts 1
contact_groups sysadmin
use generic-service-template
}
You will need to modify the above a bit to fit with your own environment, your own contact_groups and templates etc, but this should give you a good idea of how to do it. Also again substitute XXX with your own OID from IANA.
Thats it, now if you restart Nagios it will tell you whenever the main backup was last run more than 90000 seconds (25 hours) ago.
