I've had this issue stewing for some time now and am just now posting it.  I'll start by saying, I like Sun.  As a whole, I think Solaris is a GREAT OS when coupled with Sparc processors.  Great things happen when you can develop an OS around your own hardware, IE Sun and Apple to name a few.
A b!tch to setup and equally as painful to troubleshoot, however once you get a system dialed, you're golden.  For a long time or until hardware fails or someone mucks it up.  I also REALLY like ZFS.  If I could butter my bread with it, wear it under my pants, or or kiss it good night, I would.  ZFS is a GREAT piece of software, no one knows about.  
Now, here is my issue.  A few years ago, I set up 2 T2000s to run as Oracle DB servers.  These servers front ended a Sun branded Stor-Edge 6130.  I chose to go with ZFS rather than Veritas or UFS.  ZFS had only been out for about a year but I felt the benefits where worth it and it wasn't exactly bleeding edge.  So, for this setup, 3 of the 4 varibles are Sun owned.  I turned off caching within the kernel for ZFS and let the RAID handle due to known issues.  I followed every best practice article I could find to make this setup solid.  And it was as it lasted almost 2 years and 2 san switch failures(without issue).
Fast forward to January.  We run avg to small sized Oracle DB as far as enterprise DBs go.  Its about 100 Gb.  Not crazy right?  So, our DBA adds a small 5gb DB onto the server and BOOM GOES DYNAMITE!  We get tons of cpu bottlenecking and throttling.  Long story short, Oracle doesn't see an issue with the setup.  Sun doesn't see any issue with the setup, yet acknowledge this could be a ZFS kernel issue.  Bugs happen and I'm OK with that.
The real problem occurs with Sun's response to our issue.  I basically plead with them to attempt to reproduce our errors.  I question what is different about our setup than anyone else in the world.  Surely we can't be the only one's running an Oracle 10g DB on ZFS over Solaris10u4.  Little old me?   I'm the only one with the gumption, guile stupidity or whatever to run an Oracle DBA on ZFS?  C'mon.  Is Sun hurting that bad?  Maybe I'm the only customer left.  Sun tech guy says there is not patch and it won't even be addressed in Solaris 11.  WTF?!?!  So they know of it, but I must be the only one with an issue.  Sun tech guys, "we'll right you a patch but you need to sign something stating you'll actually deploy it.  And test it out.  On production machines.  HA!  Good one Sun tech guy.  After literally days of arguing and pleading to get them to do something I get a recomendation from him to to UFS and enabling directIO.  Woohoo, beaten by your own OS.  Way to stay with it.  Prior to this incident I have had nothing but great experiences with Sun support.  This took the cake.  We are so soured that RAC on RedHat the next implementation.
Condemnation for ZFS?  No way.  I still love.  I think this is an isolated incident and I won't be deploying it the same fashion.  I'll be hard pressed to be allowed to purchase any more Sparc systems though.  I'll have to settle for x86.  Could be worse I guess.  Could dealing with HPUX.
 
I want my Sun server, Swede.
ReplyDeleteBurke