Tuesday, October 3, 2006

CEC: RAS Trends From Andromeda to ZFS

Richard Elling, PAE


Performance, Availability & Architecture Engineering


He remembered to do the session recording start
information...So far the second session I have attended to do so.


(We forgot to do it, bad presenters)


Internal changes to the process in the past few years have
made a big difference in both pre-release quality and post-release
support.


MTBF, Heat, Moving Parts


Starting with some

MTBSI == Mean Time Between Systems Interruption


Similar to previous usage of MTBF, now Failure does not have to mean
System/Service Interruption


More Parts --> lower MTBF (Bad)


[1] Integration == fewer parts == Higher MTBF


Most common failures Fans, Power Supplies, Disks, Memory


(glad he said that, I say it too...woot validation)


PowerSupply -- Built in surge supressor


Designed for 1M hours MTBF (Niagara and Galaxy vs PC at ~83K hours)


Use fewer Better more expensive power supplies


Fans -- Fail too often moving parts


Hard to implement fans in small systems


System policies can affect MTBSI...e.g. T1000 firmware starts
a shutdown on fan failure, not on system temperature


This may not be the preferred method but it is in the spec.


Heat Kills, run cool, run for a long time


Disk numbers from Seagate, rule of thumb


+15deg C == MTBF / 2


Smaller disks run cooler and have faster seek (seagate
website)


Boot Disk Trends


Nearly 100% of Sun's customers mirror boot disks in the DC


Software raid is popular but built in HW raid is becoming more
popular, ZFS boot disks coming


Long term: solid state, CF slot in CP3010 and CP3020 also
Moore's law


SAS vs Parallel SCSI...system now knows if the disk disappears
before it tries to access it.


In case you weren't sure More parts == lower MTBF

See 1


Memory


Memory Page Retirement -- FMA


What about mirrored memory?


Will anyone use it? Hey double memory sales!


(Probably only in the highest RAS demand environments)


See 1


Processors


More transistors -- more self test circuits -- More redundancy


See 1


OPL


Mainframe class reliability in a Solaris system


Software


Solaris FMA (Fault Management Architecture)


(Really cool, useful messages, diagnosis codes, get a system
and break things to see it


you could pull a disk to see an example however this is not Richard's
top choice in terms of display)


SMF


Before: Panic


SMF: Restart


Ability to define relationships

(See my previous posts and/or Liane
Praza's Weblog
)


ZFS


Data reliability...we know if we get corrupted data before
we try to use it.


(This is also cool, I have been running ZFS at home for at
least 9 months, although I don't own any huge storage hardware for the
complex case)




Data recovery time based on data space used not size of FS.


Not yet fully integrated with FMA, depends on driver, disks and system
implementation.


Virtulization




Good



Really fast boot! == lower MTTR


Guest OS portability


Hide faults from OS


Bad


Hides faults from OS (If the OS knows there is a fault
it can respond appropriately)


Scheduling...Real Time app or time-sensitive devices


Fail
in Place


Possibly a better term "Deferred Repair
Strategy"


(We use this in Grid...with 1000s of nodes do you really need
a service call for a single failure? We say no)



From here we go from MTBF to MTBSI, non-service interrupting failure


Scrubbing


We scrub just about everything, looking for
faults and corruption before we try to use bad data.


Reduction
in Parts



X4100 Rack


78 Power supplies


E25K


6 Power supplies


Blade 8000


6 Power Supplies


(Similar for fans)


Why then would you deploy a rack of X4100s instead of Blades
unless you have a specific need.


Get
More Information


OpenSolaris Forums...Participate


b.s.c/relling

No comments:

Post a Comment