Floating Fag End Bins Spell Disaster

5 08 2007

Been busy for a while, lots to catch up on. Cast your mind back a few weeks to those balmy weeks in June when the Monsoon season first paid a visit to the UK. Needless to say, it got more than a tad moist around the data centre. After a few days of torrential downpour, the woodland at the back of the building had enough of the deluge and decided not to hold on to the rain water any more. We happen to be located in a bit of a dip. We noticed with initial amusement the water levels rising around the back of our building. As it steadily rose over a few minutes great mirth was had as the fag end bins from the smoking shelter decided now was their chance to make a break for freedom, and floated nonchalantly past the window.

Then an alarm rang. The bung in a conduit into one of our computer rooms had given out and water was gushing into it under the false floor. Proved that the water sensors worked! So we decided to start cleanly shutting down affected applications over as quickly as possible to fail over to the remote DR site. This was soon followed by numerous identical queries: ‘has your session hung?’

Smoke had been smelt in the affected computer room, so the EPO had been hit. I was soooo jealous. I’ve always wanted to hit the EPO! Always imagined a scene like the one in ‘Total Recall’ where air is gushing out of the terminal on Mars after Arnie’s fake head exploded, and the brave soldier hanging on for dear life hovers his hand dramatically over the big red button for precarious seconds until he slams it and the emergency air lock doors come down. Instead, it was “<sniff>… I smell burning” followed by a quick poke of a small red button. Real life is so much more boring than the movies!

To cut a long story short, nobody had much sleep for the next 3 days. Some of the techies pulled 24+ hour shifts to restore service. We were plied with pizza – catnip to a techie – and heroism ensued. Got my first real life experience of TSM backupsets as well. Never felt the urge to use them, but someone thought it would be a bright idea to take some remote site backupsets to speed up recovery at our striken site, and I was asleep when it was agreed to use them.

So, if I’ve learnt one thing from this, it’s to make sure that if anybody ever suggests using TSM backupsets for a speedier recovery again, I shall staple their mouth shut. They are a righteous pain in the arse, especially when you’re sleep deprived!

Since then I’ve also had more fun with Brocade’s answer to the MG Montego, the 12000 director. More to come on that soon……





Is it wrong for a SAN to wear make-up?

12 05 2007

Thanks for spotting my typo, I’ve had a good chuckle. About the only chuckle this week as this week has been my week on the Operations Bridge and be on-call, where I get the pleasure of sitting in uncomfortable chairs during the day, having to respond to inane flapping and bleeting, and then get woken up in the middle of the night for more flapping. Plus I’ve been in pay negotiations, and to thanks us for all our hard work my employer wants to offer us a pay cut in real terms. Nice. Well, if they want a fight, they’re going to get one. This is a good advertisment for union membership – in parts of our business where there is a high percentage of union members, pay offers are high, and where it’s not, they’re low. Hopefully I’ll be persuading more non-union members to join over the next few weeks, it’s time for their free ride to come to an end if they want to get a decent pay deal.

As you might be able to tell, I have a pretty low opinion of those who won’t join, but are happy to take the benefits that are gained from the subs paid by members, and complain that they haven’t had their pay rise yet. It’s like people who say “I didn’t vote – I couldn’t be bothered” or “it doesn’t make a difference”. Democracy is a priviledge, not a right, and if you don’t vote, you have no right to complain. If you don’t like the choice, spoil your ballot paper, express your displeasure. I have a lot of respect for those who’ll do that – it’s democracy in action. People are dying around the world every day in places like Burma, China and Zimbabwe to have what we have.

Now I’ve finished venting, on the SAN side, we’ve worked around the ISL hit for now, though we’ve rigged up another server to try various load tests. I’ve tried the suggestion about changing trunk masters etc., but I still think there’s a problem. The loads across these ISLs are just not balancing, i.e. if I’ve got 100mb throughput, I’d expect to see close to 25mb on each of the four trunks, yet it seems to load mostly onto one connection. Strange. Next week shall see me pulling cables (orange ones 😉 ), adding more, changing trunks and generally hiding away in the server room.

On the plus side, I’ve got Brocade Fabric Manager to play with for the moment, which seems useful, though like any Java app it sucks resource like an Electrolux (for those of you not British or of my generation, there was an ad in the 80s for a hoover that said ‘Nothing sucks like an Electrolux’ – naturally in my youth we would use that as a challenge for the ladies my friends and I would date). Anyhoo, I digress, so returning from the dirt track to the highway next week we’ll also be trying out some tuning of AIX server fibre cards, as we seem to be maxing out the throughput of our new TSM servers prematurely when duplicating LTO3 tapes, and we believe got the TSM settings optimised.

So, thanks for the input. I’m off to bed soon (after the end of the Eurovision Song Contest) as my wife’s away this weekend so I’m looking after the children (5 and 2) on my own, and they’ll be wanting their breakfast bright and early. At least it puts my storage problems into perspective!