abortrephrase | This is why we bother

Today we had an experience which demonstrates rather nicely why we bother to take the time with things like Puppet and Nagios, even though they're both a big pain in the arse to retrofit into an existing environment.

First, Nagios.

Our NTP setup is pretty old. Way back in the mists of time, the firewalls at each of our sites were also given the job of providing site-local NTP service. All machines would then sync to the two nearest NTP servers. So, for example, Sydney machines synced to the Sydney and Melbourne NTP servers, Melbourne machines likewise.

This is being phased out.

Last Friday we shut off the Sydney firewall's NTP server. Most of the machines in Sydney were already using the newer arrangement, and everything's talking to at least two NTP servers anyway, right?

I got in this morning to find a lot of services in CRITICAL state. NTP on about a quarter of our machines, in fact. ntpd was unable to satisfactorily sync time. Without having done the legwork to get Nagios clients running on all our UNIX systems this simply wouldn't have shown up, we wouldn't have noticed this until there'd been some real clock drift and serious problems as a result.

This lead to discovering an even bigger problem: the reason all these NTP daemons were in CRITICAL was because both the NTP servers they were talking to -- the Sydney and Melbourne firewalls -- were unresponsive. Digging turned up that the external time sources Melbourne had been using had changed IPs at some point, so the firewall rules weren't letting the queries through any more. And with its remaining peer -- Sydney -- gone it was no longer authoritative for time.

In the old days we probably wouldn't have spotted this. The monitoring system we were using was only able to query UNIX hosts via SNMP (it'd use WMI for Windows boxes) and this sort of thing isn't typically exposed via SNMP. So, all that time and effort adding Nagios has already paid off.

Fixing this was really easy. Modify the site-specific NTP configs being pushed by Puppet, let it do its thing, including restarting ntpd on all the affected machines.

Now, we should rig up Nagios to use check_ntp_peer against the new NTP servers. We should've had it watching the old ones too. Otherwise, though, I think we've got it about right.

Monitoring is something managers are usually okay with you spending time on, up to a point. External service checks and maybe looking to see that the disks haven't failed or filled is often about as much as is considered necessary. We're watching, on average, about 20 services per host.

Automation is a harder sell. It pays off as soon as you need a bulk change, but those don't come up all that often in a stable environment. And integrating something like Puppet into an extant production environment is dangerous if you don't do it carefully. This takes time. Quite a lot of it. It'd be very easy for a manager under time and resource pressure to nix such a project.

We got lucky. My manager is pretty smart, and when I started almost three years ago he took the opportunity to not publicize his new "resource" too widely. I was instead asked to get my head around the environment and look at ways to improve things. And system automation was on his list.

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Most Popular Tags

Threaded | Top-Level Comments Only

From:

rone

"It pays off as soon as you need a bulk change, but those don't come up all that often in a stable environment."

All the more reason to do them; if you don't make bulk changes often, doing them manually is a great way to fuck things up. Plus, you'll find that setting up automation means you aren't as reticent to make bulk changes that "ought" to be done, even if they fall short of "needing" to be done.

ideological_cuddle

Sure. But if you're a manager, you've got lots of pressure to get client-visible stuff done, and this thing is going to take serious time and if it's not done right create a shitload of trouble, maybe you're going to decide it's nice but not worth the bother because it'll only save your arse once a year.

In our case the value has already been demonstrated a bunch of times. But a less-bright manager might've not gone with it, and a less-trusting one might have decided it was just too dangerous given the sort of idiots you see doing sysadmin work...

vatine

In the old days we probably wouldn't have spotted this.

Or at least not spotted it until clock were de-synchronised enough that it caused other issues. I find that to be depressingly frequent.

Yeah. It would've shown up once the clocks were badly de-synced, or worse, when something happened where people actively looked at stuff and the timestamps really mattered. Given that some parts of this are trading and market-data environments, that could've been Bad(TM).

check_ntp_peer is very useful. And it's not in the standard set of checks that our lords and masters insist are run...

Edited (typo!) Date: 2010-07-12 12:31 pm (UTC)

heliumbreath

Not sure I like an ntp monitor that doesn't complain until ntpd is completely off the rails; certainly it merits a red alert when all peers go away, but a yellow alert when any one of the peers is down might have been appropriate, and maybe even an orange alert when more than half of them are gone. (I've been playing a little bit with nagios myself, and in a former life wrote a bespoke monitoring system for a former ISP.)

I'm not sure how it reacts to just losing one of the peers, because the systems effectively lost both at the same time -- one shut down, the other no longer able to provide sync because the one that went down was unintentionally the last remaining contactable source.

If we'd been watching the old NTP servers that second part would've been spotted a while back. We're not, because we're not putting the legacy FreeBSD boxes (with no compiler or ports tree, and a really old userland) into Nagios as they're being retired as fast as we can.

Yep, de-synced clocks for when you need to reconcile logs from more than one machine is Pretty Bad. And that Badness probably comes with a (large) dollar figure when you're talking trading and market data.

Abort, Rephrase, Ignore?

This is why we bother

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Profile

October 2011

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags