Of ISLs and Men

29 04 2007

My shower is working again. It started working again the next day. Maybe a dodgy connection in the switch? Who cares, showers all round again.

I have a cold. My wife had it for two days, but took a day off work sick to aid recouperation, which did the trick. Then I caught it. Two weeks later I still have it. I’m too busy to be sick, same old story, so I just carry on at work as normal. At least I can share my misery. When I’m miserable, I want to share it. When I’m happy, I’m selfish, it’s all mine!

Over the past few weeks lots of new AIX servers have been going in. Almost without fail, for each one I’ll hear the whinge from the Unix guys that one of the partitions won’t log onto the SAN. “Must be a SAN problem” they cry. So each time I’ve gone into the server room, have seen no lights, so checked the cables, flipped the ends, tried other ports and even thrown 30m cables across the room to check whether or not it’s the patch panels, and every time I’ve turned round and said “it’s your server – check the fibre card”. And each time I’ve been right. Shit like this just never happenned when I worked with mainframes. Looks to me like we’re getting more and more shoddy server fibre cards emerging from wherever they come from in the Far East. IBM need to get their act together.

However, a curious performance problem hit recently. Chronic performance from a new server/app on one of our shiny new USPs. We did the usual checks, dispersed allocation across RAID groups, port contention etc. Noticed that the filesystem was not striped. So that was sorted. But problem was still there, seeing 30ms responses. Head scratches all round. It was noted that the server was connected to one 48000, the storage on another, both linked by a under-utilised 8gb trunk. I beefed the trunk to 16gb, still bad. To cut a long story short, we proved the point the ISL was the bottleneck, server and storage on same switch = 3ms response. Now that has really baffled me. Our normal standards are to try and host servers and their storage as best we can on the same switch, though it’s not always possible, but I’ve never seen latency like this. The switches are close to each other, linked by 9m cables between ISLs, and I’d expect no more than 20 micro seconds each way latency, not milli-seconds! Been hitting the books again, but baffled once again. If any of you kind techies out who might be taking your valuable time to enjoy my rantings have any thoughts, I’d appreciate any insight, as I have appreciated the comments I’ve received to date.


Actions

Information

3 responses

5 05 2007
Nigel

Ok, before I start I know that none of this is rocket science and, more importantly, I also know that this is probably not much help in a large production environment. So I dont know why Im bothering actually but here goes anyway……

Id be tempted to pull the cable that is the trunkmaster, forcing a new trunkmaster to be chosen, and see if the problem goes away.

If it persists Id also be tempted to systematically lower the number of ISLs in the trunk group, and just in case you have a rouge connection.

Id also be suspicious of the firmware. Have you got a latest and greatest version of firmware? Have you upgraded it recently?

Id be interested to find out what the problem is once youve got it sorted.

10 05 2007
RacaSAN

“rouge connection”

I thought most of these optical cable thingies were orange….
(gentle joshing 🙂

12 05 2007
nigel

that will be rogue then 😉

Leave a comment