Curing the long distance blues

23 05 2007

Now is the time to rejoice! I enabled ‘ISL R_RDY’ mode on the four ISLs between sites, enabled long distance with sufficient buffer credits for 30km at full packet size (my links are either 16km or 18km, depending on who I’m talking to) and sat back to watch the portperfshow. I thought I was going to wet my pants when I saw the speed on each link leap from 60mb/s to 150! I reset the port stats, and noticed that I was still getting a number of ‘out of buffer credits’ errors. I’m not sure if I can get stats on the average packet size going through (can anyone advise?), so I took a stab and increased to 40km and reset the stats again. My self pleasure went into overload as the speed topped 170mb/s!

To get a tape copy to run from the remote site to production, we deleted a couple of copy pool volumes in TSM and re-ran a copy storage pool with multiple drives, so I had tape traffic going in both directions, and soon the throughput was topping 200mb/s!

For once, I am happy, and needed a good hosing down to calm myself. I just need to get some long distance licences for the 4100s I have spare so I can move the other site links (we have a second remote site linked to another machine room) from the creaking 12000s and I will be truly joyful.

On the home front, so far there have been no further steaming gifts left in the living room, just the odd puddle!


R_RDY, Potty, Go!

18 05 2007

Had an interesting meeting today (don’t say that often) with a man from Nortel who summed up my long distance ISL problem nicely – I need to use ISL R_RDY mode with the long distance settings on the ISLs across our DWDM links (fronted by Nortel MOTRs). Apparently the Brocade VC Link Init protocol is too sensitive to work with their kit, so I shall be trying that out early next week. It may even cure my lack of throughput, which I suspect is because I’m not getting a huge amount of full size packets going through the links, hence exhausting the buffer credits.

Also found out that with Condor ASIC equipped Brocade switches (4100, 48000 etc) I should not expect to see an even spread of throughput across ISL trunks, as their algorithm loads a single link up to around 70% first before starting to spread out further, to cut out unneccessary frame splits and re-joins. Makes sense I suppose. 

There was a hint of a beige tint to the air when I got home tonight. My daughter has started potty training, and not long before I came home had decided that she couldn’t get to the potty in time, so left what the Big Yin (that’s Billy Connolly, one of the funniest people on the planet) may describe as a ‘wee beige jobby’ in the middle of the living room! At least she went and washed her hands afterwards!

Which leads nicely to the fact the I’m on call this weekend and already had my first call within half an hour of getting home 😦

Is it wrong for a SAN to wear make-up?

12 05 2007

Thanks for spotting my typo, I’ve had a good chuckle. About the only chuckle this week as this week has been my week on the Operations Bridge and be on-call, where I get the pleasure of sitting in uncomfortable chairs during the day, having to respond to inane flapping and bleeting, and then get woken up in the middle of the night for more flapping. Plus I’ve been in pay negotiations, and to thanks us for all our hard work my employer wants to offer us a pay cut in real terms. Nice. Well, if they want a fight, they’re going to get one. This is a good advertisment for union membership – in parts of our business where there is a high percentage of union members, pay offers are high, and where it’s not, they’re low. Hopefully I’ll be persuading more non-union members to join over the next few weeks, it’s time for their free ride to come to an end if they want to get a decent pay deal.

As you might be able to tell, I have a pretty low opinion of those who won’t join, but are happy to take the benefits that are gained from the subs paid by members, and complain that they haven’t had their pay rise yet. It’s like people who say “I didn’t vote – I couldn’t be bothered” or “it doesn’t make a difference”. Democracy is a priviledge, not a right, and if you don’t vote, you have no right to complain. If you don’t like the choice, spoil your ballot paper, express your displeasure. I have a lot of respect for those who’ll do that – it’s democracy in action. People are dying around the world every day in places like Burma, China and Zimbabwe to have what we have.

Now I’ve finished venting, on the SAN side, we’ve worked around the ISL hit for now, though we’ve rigged up another server to try various load tests. I’ve tried the suggestion about changing trunk masters etc., but I still think there’s a problem. The loads across these ISLs are just not balancing, i.e. if I’ve got 100mb throughput, I’d expect to see close to 25mb on each of the four trunks, yet it seems to load mostly onto one connection. Strange. Next week shall see me pulling cables (orange ones 😉 ), adding more, changing trunks and generally hiding away in the server room.

On the plus side, I’ve got Brocade Fabric Manager to play with for the moment, which seems useful, though like any Java app it sucks resource like an Electrolux (for those of you not British or of my generation, there was an ad in the 80s for a hoover that said ‘Nothing sucks like an Electrolux’ – naturally in my youth we would use that as a challenge for the ladies my friends and I would date). Anyhoo, I digress, so returning from the dirt track to the highway next week we’ll also be trying out some tuning of AIX server fibre cards, as we seem to be maxing out the throughput of our new TSM servers prematurely when duplicating LTO3 tapes, and we believe got the TSM settings optimised.

So, thanks for the input. I’m off to bed soon (after the end of the Eurovision Song Contest) as my wife’s away this weekend so I’m looking after the children (5 and 2) on my own, and they’ll be wanting their breakfast bright and early. At least it puts my storage problems into perspective!

Of ISLs and Men

29 04 2007

My shower is working again. It started working again the next day. Maybe a dodgy connection in the switch? Who cares, showers all round again.

I have a cold. My wife had it for two days, but took a day off work sick to aid recouperation, which did the trick. Then I caught it. Two weeks later I still have it. I’m too busy to be sick, same old story, so I just carry on at work as normal. At least I can share my misery. When I’m miserable, I want to share it. When I’m happy, I’m selfish, it’s all mine!

Over the past few weeks lots of new AIX servers have been going in. Almost without fail, for each one I’ll hear the whinge from the Unix guys that one of the partitions won’t log onto the SAN. “Must be a SAN problem” they cry. So each time I’ve gone into the server room, have seen no lights, so checked the cables, flipped the ends, tried other ports and even thrown 30m cables across the room to check whether or not it’s the patch panels, and every time I’ve turned round and said “it’s your server – check the fibre card”. And each time I’ve been right. Shit like this just never happenned when I worked with mainframes. Looks to me like we’re getting more and more shoddy server fibre cards emerging from wherever they come from in the Far East. IBM need to get their act together.

However, a curious performance problem hit recently. Chronic performance from a new server/app on one of our shiny new USPs. We did the usual checks, dispersed allocation across RAID groups, port contention etc. Noticed that the filesystem was not striped. So that was sorted. But problem was still there, seeing 30ms responses. Head scratches all round. It was noted that the server was connected to one 48000, the storage on another, both linked by a under-utilised 8gb trunk. I beefed the trunk to 16gb, still bad. To cut a long story short, we proved the point the ISL was the bottleneck, server and storage on same switch = 3ms response. Now that has really baffled me. Our normal standards are to try and host servers and their storage as best we can on the same switch, though it’s not always possible, but I’ve never seen latency like this. The switches are close to each other, linked by 9m cables between ISLs, and I’d expect no more than 20 micro seconds each way latency, not milli-seconds! Been hitting the books again, but baffled once again. If any of you kind techies out who might be taking your valuable time to enjoy my rantings have any thoughts, I’d appreciate any insight, as I have appreciated the comments I’ve received to date.

Showers, SANs and Insurgency

9 04 2007

I’m doomed not to be able to have a decent hot electric shower. Came home from an Easter weekend away to find it’s the second time it has stopped working. I expect it’s the electrics again rather than the shower. Looks like I’ll have to buy another, less powerful, shower and try and sell this one on ebay. Anyone want to buy a 10kw shower? In the meantime, bath time each morning. Just one more thing to irritate me.

Spent the weekend with my wife and children with her brother and his family and extended family. A great weekend, only spoiled by the damn shower when we got home! On Sunday we went to a first birthday party, which was actually more of an adult affair. There I met a neighbour, an American gentleman called Michael and his wife and child. He told me he was a journalist, and had been a foreign correspondant now writing his second book. I finally got the chance to sit down for a chat when we had to leave for the long journey back home. So I promised to look up his book when I got home.

After a quick google, I found out some more about this interesting gentleman. His name is Michael Goldfarb, his book is called “Ahmad’s War, Ahmad’s Peace“. Michael is a very, very accomplished journalist. After reading bios, I am sad that I didn’t have more time to talk with him. He has created a documentary for the “Inside Out” program on WBUR Boston on the subject, which, in my opinion, is a moving account of the start of the current war in Iraq from a personal perspective. Click here to visit that documentary, which includes the radio program. It’s an hour long. Take my advice: get yourself a clear hour, turn up the speakers on your PC, get comfy and listen to it.

At work, I’ve finally sorted out my SAN problem, no thanks to the hopeless vendor (two letters, known for printers). Brocade were on site earlier last week for another matter, and their technical guru sat down with me, looked at the problem for a few minutes, and said “hafailover”. I’d suggested that to my vendor over a week previous – perhaps it got lost in translation between here and India. So, I spent an hour raising the change ticket, another hour talking to our customers, then the following morning I hit the return key, crossed my fingers and hoped the 12000 wouldn’t die. It was as monumentous as when the clocks ticked over to 1st January 2000. No planes dropped out of the sky, I still owed the bank several body parts for a mortgage, and the problem went away.

Take note Brocade customers – Brocade are offering a supplementary support service, obviously at a cost. If you’ve bought your switches from HP, IBM or EMC, get them in now. Don’t delay!

Dark porcelain

1 04 2007

Considering the grief I’ve had recently over dark fibre, Google’s April Fool really made me chuckle ( Enjoy.

Wood, Trees, D’oh!

26 03 2007

Well, good news and bad. Good news, the DWDM links are working properly. Bad news, my inability to see the wood for the trees. Got so tied up looking at the quality of the links, someone said to me today “You’ve allocated too many buffer credits”. So, I turned off the long distance, hey presto, no more time outs. D’oh, back to the dummy’s book for me. Too many credits meant the destination port was being flooded. Now I need to test and tweak. Without LD, I’ve not got enough credits. We’ll have a diverse 22km route soon, maths says “1 credit per km” for a 2gb link. I though dynamic allocation (LD) mode on the Brocades would handle everything nicely for me, but obviously not. Need to find out why it thought the link was 30km. So, time for testing with static allocations. Nothing like sucking and seeing.

Bad news, still waiting for my 12000 vendor to come back to me and explain why one half still thinks the fabric is busy and stops me propagating fabric changes. Not impressed. I fear I’ll have to reboot, which means an emergency change. Perfect timing as our Change Management team has just introduced a new template, which means half a days just to fill out the change request for the outage. And of course, they’ve not publicised the change. I noted in the new template there’s not a section relating to hypocrisy.