Today we had an experience which demonstrates rather nicely why we bother to take the time with things like Puppet and Nagios, even though they're both a big pain in the arse to retrofit into an existing environment.

First, Nagios.

Our NTP setup is pretty old. Way back in the mists of time, the firewalls at each of our sites were also given the job of providing site-local NTP service. All machines would then sync to the two nearest NTP servers. So, for example, Sydney machines synced to the Sydney and Melbourne NTP servers, Melbourne machines likewise.

This is being phased out.

Last Friday we shut off the Sydney firewall's NTP server. Most of the machines in Sydney were already using the newer arrangement, and everything's talking to at least two NTP servers anyway, right?

I got in this morning to find a lot of services in CRITICAL state. NTP on about a quarter of our machines, in fact. ntpd was unable to satisfactorily sync time. Without having done the legwork to get Nagios clients running on all our UNIX systems this simply wouldn't have shown up, we wouldn't have noticed this until there'd been some real clock drift and serious problems as a result.

This lead to discovering an even bigger problem: the reason all these NTP daemons were in CRITICAL was because both the NTP servers they were talking to -- the Sydney and Melbourne firewalls -- were unresponsive. Digging turned up that the external time sources Melbourne had been using had changed IPs at some point, so the firewall rules weren't letting the queries through any more. And with its remaining peer -- Sydney -- gone it was no longer authoritative for time.

In the old days we probably wouldn't have spotted this. The monitoring system we were using was only able to query UNIX hosts via SNMP (it'd use WMI for Windows boxes) and this sort of thing isn't typically exposed via SNMP. So, all that time and effort adding Nagios has already paid off.

Fixing this was really easy. Modify the site-specific NTP configs being pushed by Puppet, let it do its thing, including restarting ntpd on all the affected machines.

Now, we should rig up Nagios to use check_ntp_peer against the new NTP servers. We should've had it watching the old ones too. Otherwise, though, I think we've got it about right.

Monitoring is something managers are usually okay with you spending time on, up to a point. External service checks and maybe looking to see that the disks haven't failed or filled is often about as much as is considered necessary. We're watching, on average, about 20 services per host.

Automation is a harder sell. It pays off as soon as you need a bulk change, but those don't come up all that often in a stable environment. And integrating something like Puppet into an extant production environment is dangerous if you don't do it carefully. This takes time. Quite a lot of it. It'd be very easy for a manager under time and resource pressure to nix such a project.

We got lucky. My manager is pretty smart, and when I started almost three years ago he took the opportunity to not publicize his new "resource" too widely. I was instead asked to get my head around the environment and look at ways to improve things. And system automation was on his list.


Abort, Rephrase, Ignore?

October 2011

2 345678


RSS Atom

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags