Shawn Ferry: November 2007

Tuesday, November 27, 2007

How to get experimental rw ZFS support AFTER upgrading to 10.5.1

The alternate title of this entry is "Force install of ZFS Beta Seed v1.1 on Leopard"

This is performed at your own risk. The steps described remove any logical restriction for installation of the package.

There is more than one way to implement this particular hack. This method uses the package installer (a cleaner more friendly hack than manually copying files). Other recommendations included installing 10.5 on a different partition and subsequently installing the patch and copying the package files.

A simple alternative left to the reader would be extract the files from the package Payload an manually copy the files into place (cat /tmp/ZFSseed1/ZFSBetaSeed1.pkg/Payload | pax -z -r -v).

On to the actual Implementation:

Download the dmg from developer.apple.com

Mount the DMG:

open ~/Desktop/Inbox/leopard_9a559_zfsbetaseed1_0613523123.dmg

Expand the package:

pkgutil --expand /Volumes/ZFS\ 1/ZFSBetaSeed1.pkg /tmp/ZFSseed1

Edit the Distribution file and comment out the line that actually checks the requirements (and "causes" the failure):

vi /tmp/ZFSseed1/Distribution
//

Flatten the edited expanded package directory back into package format:

pkgutil --flatten /tmp/ZFSseed1 /tmp/ZFSrw.pkg

Open the package installer:

open /tmp/ZFSrw.pkg

Install the package and reboot.

Again this is performed at your own risk. The steps described remove any logical restriction for installation of the package and may cause you system to explode or you cat to catch fire.

References:

ZFS Beta Seed v1.1 will not install on Leopard (10.5.1)

Edit: I guess I should mention that it does actually appear to work :) Next I'm going to try and switch access between Leopard and Solaris under Parallels

Edit1: Changed comment marks from // to ; // works if you are commenting in the embedded script part not the XML part. Thanks to Colin Seymor for catching that.

Monday, November 26, 2007

SGE quick and dirty how to find jobs on 'bad' slots

I occasionally have a need to find queues in Sun Grid Engine that are in one of the possibly problematic states which have an occupied slot. It is just infrequent enough that I don't remember exactly how I did it the last time.

qstat -f | awk '$6~/[cdsuE]/ && $3!~/^[0]/'
queuename                      qtype used/tot. load_avg arch          states
zone.q@r130c24z0.network.com   BIP   1/1       -NA-     sol-amd64     adu
zone.q@r130c24z1.network.com   BIP   1/1       -NA-     sol-amd64     adu

An alternate is "qstat -f | awk '$6~/[cdsuE]/ && $3~/^[1-9]/'" which also avoids printing the header line. In the example above 'state' in $6 matches 's' and 'used' does not begin with '0'.

The possibly more elegant 'qstat -f -qs cdsuE' still requires a second comparison in awk of '$0!~/--/' to filter out the queue separator lines. (qstat -f -qs acduE | awk '$0!~/--/ && $3!~/^[0]/')

Finally because I can never remember what exactly all the queue states are and the qstat man page doesn't have the nice table:

aoACD #8211 Number of queue instances that are in at least one of the following states:
a #8211 Load threshold alarm
o #8211 Orphaned
A #8211 Suspend threshold alarm
C #8211 Suspended by calendar
D #8211 Disabled by calendar

cdsuE #8211 Number of queue instances that are in at least one of the following states:
c #8211 Configuration ambiguous
d #8211 Disabled
s #8211 Suspended
u #8211 Unknown
E #8211 Error

Job State/Status:
d(eletion), E(rror), h(old), r(unning), R(estarted), s(uspended), S(uspended), t(ransfering), T(hreshold) or w(aiting).

References: SGE (N1GE 6.0) -- Monitoring and Controlling Queues

Edit: Added Job Status, literally couldn't find that in any of the online docs (notwithstanding ~40% through the qstat(1) man page, targeted google searches do a poor job finding the link)

Wednesday, November 21, 2007

OS X Leopard, Tiger X11 and SGD (How I downgraded to Tiger's X11 and got SGD working again)

We use SGD to provide access to few applications. After installing Leopard I could focus and click with a mouse but all keyboard input was ignored for SGD applications.

Searching online I found some indications that X11 in Leopard has some application interaction issues. The solution presented in a number of different forums for various applications was to downgrade to Tigers X11.app. I tried methods from a couple of posts and didn't have success. Instead I mixed and matched the steps from a couple of suggestions and found a solution that worked for me.

I have not tried to recover from this change, You can PROBABLY re-install X from the leopard DVD.

When this is complete you will probably have two X icons in the Dock when X11.app is running.

The steps can be summarized as:

Download X11 Update 2006 1.1.3: "http://www.apple.com/support/downloads/x11update2006113.html"
Destroy your current X11 installation
Install X11 update 2006
Change the path to your window manager in xinitrc
reboot

wget 'http://wsidecar.apple.com/cgi-bin/nph-reg3rdpty2.pl/product=12045&cat=60&platform=osx&method=sa/X11Update2006.dmg'
open X11Update2006.dmg
sudo launchctl unload -w /System/Library/LaunchAgents/org.x.X11.plist
sudo rm -R /usr/X11R6
sudo ditto -Vx --noqtn /Volumes/X11\ Update\ 2006/X11Update2006.pkg/Contents/Archive.pax.gz /
sudo perl -i -p -e 's:exec quartz-wm:exec /usr/X11R6/bin/quartz-wm:g'

The instructions I found online indicate that a log out/log in should do it. I found that it didn't seem to start working until after I rebooted.

The instructions I based the above steps on:
Bring Back Tiger's X11 to Leopard in 3 Steps
easier instructions to install Tiger's X11.app

Friday, November 16, 2007

Two Storage Commands I Don't Know How I Lived Without

I am working on an issue which involves a 3510 and two dual connected hosts (Home grown Active/Active configuration). The customer's equipment has just been moved within the cage. When the systems were rebooted one of them reported multipath failures and both SCSI errors.

While I was investigating the problems I used cfgadm, luxadm and sccli the first two are common to Solaris the last is an additional package for management of arrays including the 3510. These commands are not new to me; while I was searching for alternative solutions to my problems I found fcinfo and mpathadm two fairly new (and definitely new to me) commands.

Using luxadm to display the state of the ports:
(One of the ports was not connected but I wasn't thinking about blogging it so I missed my chance to capture the output.)

luxadm -e port
/devices/pci@1d,0/pci1022,7450@1/pci1077,100@1/fp@0,0:devctl       CONNECTED
/devices/pci@1d,0/pci1022,7450@2/pci1077,100@1/fp@0,0:devctl       CONNECTED

Using luxadm to show link errors:

luxadm -e rdls /dev/es/ses0 

Link Error Status information for loop:/dev/es/ses0
al_pa   lnk fail    sync loss   signal loss   sequence err   invalid word   CRC
9e      0           1           1             0              2794           0           
9f      0           0           0             0              243            0           
1       0           0           0             0              0              0           

Link Error Status information for loop:/dev/es/ses0
al_pa   lnk fail    sync loss   signal loss   sequence err   invalid word   CRC
a3      0           2           2             0              65535          0           
a5      0           0           0             0              28481          0           
1       0           0           0             0              0              0

Using cfgadm to see the configuration of the devices:
(In the original investigation c2 was displaying type fc and Occupant unconfigured)

cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
c0                             scsi-bus     connected    configured   unknown
c0::dsk/c0t0d0                 disk         connected    configured   unknown
c0::dsk/c0t2d0                 disk         connected    configured   unknown
c0::dsk/c0t3d0                 disk         connected    configured   unknown
c0::es/ses1                    processor    connected    configured   unknown
c1                             fc-private   connected    configured   unknown
c1::256000c0ffc86cfb           disk         connected    configured   unknown
c1::256000c0ffd86cfb           ESI          connected    configured   unknown
c2                             fc-private   connected    configured   unknown
c2::226000c0ffa86cfb           ESI          connected    configured   unknown
c2::226000c0ffb86cfb           ESI          connected    configured   unknown

I tried using 'cfgadm -c configure c2' and 'cfgadm -f -c configure c2' and finally 'cfgadm -o force_update -c configure c2' none of which succeeded in letting me recover the path. I just now found a bug for path shows NOT CONNECTED. It appears that I might have been able to recover using 'luxadm -e forcelip'. Since I needed to clear the 3510 error counters it was decided to take the systems down and power cycle the 3510.

Now on to the hook for this post (fcinfo and mpathadm). While looking at various documentation I found fcinfo and mpathadm!
fcinfo was added in S10u1. Using fcinfo I saw the some of the same information that I got from 'luxadm -e rdls' and more.
Using fcinfo to see local hba-port info:

fcinfo hba-port -l
HBA Port WWN: 210000e08b1cdb34
        OS Device Name: /dev/cfg/c1
        Manufacturer: QLogic Corp.
        Model: QLA2340
        Firmware Version: 3.3.117
        FCode/BIOS Version: N/A
        Type: L-port
        State: online
        Supported Speeds: 1Gb 2Gb 
        Current Speed: 2Gb 
        Node WWN: 200000e08b1cdb34
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0
HBA Port WWN: 210000e08b1124bf
        OS Device Name: /dev/cfg/c2
        Manufacturer: QLogic Corp.
        Model: QLA2340
        Firmware Version: 3.3.117
        FCode/BIOS Version: N/A
        Type: L-port
        State: online
        Supported Speeds: 1Gb 2Gb 
        Current Speed: 2Gb 
        Node WWN: 200000e08b1124bf
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0

Nothing extremely interesting on the hba-port side, however fcinfo also shows remote-port information.
Using fcinfo to see remote-port info: (the -p option shows information visible from the hba-port WWNs seen above)

fcinfo remote-port -l -p 210000e08b1124bf
Remote Port WWN: 226000c0ffb86cfb
        Active FC4 Types: 
        SCSI Target: yes
        Node WWN: 206000c0ff086cfb
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 2
                Loss of Signal Count: 2
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 65535
                Invalid CRC Count: 0
Remote Port WWN: 226000c0ffa86cfb
        Active FC4 Types: 
        SCSI Target: yes
        Node WWN: 206000c0ff086cfb
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 28481
                Invalid CRC Count: 0

fcinfo remote-port -l -p 210000e08b1cdb34
Remote Port WWN: 256000c0ffd86cfb
        Active FC4 Types: 
        SCSI Target: yes
        Node WWN: 206000c0ff086cfb
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 1
                Loss of Signal Count: 1
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 2794
                Invalid CRC Count: 0
Remote Port WWN: 256000c0ffc86cfb
        Active FC4 Types: 
        SCSI Target: yes
        Node WWN: 206000c0ff086cfb
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 243
                Invalid CRC Count: 0

Fcinfo and luxadm are clearly showing me that there are problems reported for the remote port in the 'Invalid Tx Word Count'.
The primary recommendation is to reseat the cables SPFs and blow out the ports. We are moving along with the process now, having replaced one of the cables, reseated everything and blown out the ports.

On to mpathadm, it was added in S10u3 and lets you discover and manage multipathing (shocking given its name).
I am using mpathadm to display information about the current configuration. Prior to the reboot the system with only one link in CONNECTED state showed only one path to all devices.
Output from 'mpathadm list lu':

mpathadm list lu
        /scsi_vhci/enclosure@g600c0ff000000000086cfb0000000000
                Total Path Count: 3
                Operational Path Count: 3
        /dev/rdsk/c3t600C0FF000000000086CFB359771241Bd0s2
                Total Path Count: 1
                Operational Path Count: 1
        /dev/rdsk/c3t600C0FF000000000086CFB359771241Ad0s2
                Total Path Count: 1
                Operational Path Count: 1
        /dev/rdsk/c3t600C0FF000000000086CFB3597712419d0s2
                Total Path Count: 1
                Operational Path Count: 1
        /dev/rdsk/c3t600C0FF000000000086CFB3597712418d0s2
                Total Path Count: 1
                Operational Path Count: 1
        /dev/rdsk/c3t600C0FF000000000086CFB3597712417d0s2
                Total Path Count: 2
                Operational Path Count: 2
        /dev/rdsk/c3t600C0FF000000000086CFB3597712416d0s2
                Total Path Count: 2
                Operational Path Count: 2

Specific detail from 'mpathadm show lu' on path:

mpathadm show lu /dev/rdsk/c3t600C0FF000000000086CFB3597712416d0s2
Logical Unit:  /dev/rdsk/c3t600C0FF000000000086CFB3597712416d0s2
        mpath-support:  libmpscsi_vhci.so
        Vendor:  SUN     
        Product:  StorEdge 3510   
        Revision:  415G
        Name Type:  unknown type
        Name:  600c0ff000000000086cfb3597712416
        Asymmetric:  no
        Current Load Balance:  round-robin
        Logical Unit Group ID:  NA
        Auto Failback:  on
        Auto Probing:  NA

        Paths:  
                Initiator Port Name:  210000e08b1cdb34
                Target Port Name:  256000c0ffc86cfb
                Override Path:  NA
                Path State:  OK
                Disabled:  no

                Initiator Port Name:  210000e08b1124bf
                Target Port Name:  226000c0ffb86cfb
                Override Path:  NA
                Path State:  OK
                Disabled:  no

        Target Ports:
                Name:  256000c0ffc86cfb
                Relative ID:  0

                Name:  226000c0ffb86cfb
                Relative ID:  0

This is in no way a full exploration of the capabilities of fcinfo and mpathadm.
I hope that the next time you are (or I am) looking at FC or multipath issues these commands will be helpful.
Please see the links to the manual pages below for more specific information and examples from the fcinfo and mpathadm commands.

References:
fcinfo #8211 Fibre Channel HBA Port Command Line Interface
mpathadm #8211 multipath discovery and administration

EDIT: Fixed some strange formatting issues
EDIT1: A bit more touch up

Monday, November 12, 2007

Hockey: We won

We are now officially the "Rink Rats" we couldn't get the "Farging Ice Holes", I was also partial to pylons.

We won 2-1, nearly shocking. Next game is a nice early 11:00pm.

My foot didn't bother me and thankfully the Refs were a little late so I had time to stretch out on the ice. A little short on stamina but I haven't been on the ice for at least a month and I really slacked off on the running and other generally health things in Vegas and China.

One Laptop Per Child: G1G1 (Give One Get One) Ordered

Donated to One Laptop Per Child at Laptop Giving so I will be getting an XO Laptop which I think should stand up nicely to having friends with kids over. Although I will surely have to take it to the office or out on the town. I wonder how functional I could be using it for work.

The mission of the One Laptop per Child (OLPC) movement is to ensure that all school-aged children in the developing world are able to engage effectively with their own personal laptop, networked to the world, so that they, their families and their communities can openly learn and learn about learning.

The OLPC Association focuses on designing, manufacturing, and distributing laptops to children in lesser developed countries, initially concentrating on those governments that have made commitments for the funding and program support required to ensure that all of their children own and can effectively use a laptop.

Initial focus is on the launch of the One Laptop per Child program. In the future, the OLPC Foundation will focus on the grassroots, #8220bottom-up#8221 aspects of the OLPC mission.

Give One Get One: 15 days left to Order/Donate one yourself.

If you live in the USA or Canada, and during a brief period of time, you will be able to pay USD 399 for two XO laptops. The first laptop is yours to keep (the get 1 part) and the second one is donated to the program to be distributed in one of OLPC's partner countries (the give 1 part).

The G1G1 program also includes 1 year of T-Mobile Hot Spot access. If you don't give away the 1 you get you can set up shop at any T-Mobile HotSpot with your distinctive laptop and be a living advertisement for the OLPC project.

More information at the OLPC Wiki

olpc

Back on the health wagon

Went for a run Saturday morning in part because I need to start getting more exercise again and in part as a dry run for the Leesburg, VA 5th annual Freeze Your Gizzard race.

6mi in 1:09:00, over did it a little but I was running to the office and back to pick up my laptop power supply that I had left on Friday. I should have slowed down on the return leg instead of speeding up. A little sore and although I stretched the crap out of my soleus my Right 5th Metatarsal is aching.

Some heat to keep it loose and some ibuprofen and I am good for my nice early 10PM hockey game tonight. First game of the Winter season; glad I had that early meeting this morning.

Monday, November 5, 2007

First try with Sun Service Tags and SXDE 09/07

I just installed SXDE 09/07 and decided to give Service Tags another shot. The installation of the Service Tags packages doesn't take a special effort.

Unfortunately I can't get the product registration agent to find anything. I checked with one of my co-workers to make sure that he had success before spending any time on the issue.

He confirmed that he was getting discovery for OS installs on systems without servicetag supported products.

Service Tag Discovery: No Products Found

So clearly I have just installed the packages. This is obnoxious.

So under preferences I enabled FINEST logging and tried again.

FINE: Checking ip addresses: 10.211.55.10
Nov 5, 2007 11:57:17 AM com.sun.scn.client.ui.RegClient getSystems
FINE: Getting ip addresses: 10.211.55.10
Nov 5, 2007 11:57:17 AM com.sun.scn.client.ui.RegClient getSystems
FINE: Checking: 10.211.55.10
Nov 5, 2007 11:57:17 AM com.sun.scn.client.ui.RegClient checkIPAddress
FINE: Checking if valid ip address: 10.211.55.10
Nov 5, 2007 11:57:17 AM com.sun.scn.client.comm.TCPProbe run
FINER: sending message to: 10.211.55.10
Nov 5, 2007 11:57:17 AM com.sun.scn.client.comm.Communicator$1 run
FINER: communicating with: /10.211.55.10:6481
Nov 5, 2007 11:57:17 AM com.sun.scn.client.comm.Communicator getFromAgent
FINE: Getting agent: http://10.211.55.10:6481/stv1/agent/

That look like communication to me, but wait a URI...trying in a browser returns:

ld.so.1: in.stlisten: fatal: libcrypto.so.0.9.7: open failed: No such file or directory

When I look on my system I see libcrypto.so.0.9.8. I know an easy first try to "fix" that.

ln -s /usr/sfw/lib/ibcrypto.so.0.9.8 /usr/sfw/lib/ibcrypto.so.0.9.7

It seems to me that we could do with some sort of error detection or a host level smf service failure.
e.g. We got a response from the polled host but it wasn't anything that we were expecting.

Trying again in a browser returns a result that looks a lot like I would expect given my understanding of Service Tags.




  urn:st:28e87b4a-a625-c5ec-b64e-a5c64a1a9f65
  1.1
  1.0
  
    SunOS
    splat
    5.11
    i386
    i86pc::snv_70b
    Parallels Software International Inc.
    GenuineIntel
    Parallels-18 F2 11 FF 3E 85 43 C5 B4 CC D7 85 04 84 A9 AD
    39721385

Now everything is working as expected. That is what I am expecting to see!

Taking a look at the Sun Connection Inventory Channel I can see the host I just registered!
Sun Connection: Viewing Registered Hosts

Sun Connection: Viewing Registered Hosts

Now everyone should go and install the Service Tags agent and register their devices (It's tied to an extra bonus for us next year :) ).

References: Sun Connection on BigAdmin

Thursday, November 1, 2007

Indiana IPS (Image Packaging System)

Boldly forward with minimal reading of the docs.

I can't obviously find where I would be downloading additional packages from. I feel like I am running in circles. It would seem that nearly the first thing I should be able to find would be download the rest from HERE. The single CD installer rocks, but we are up to 1 DVD for the normal full install.

I want the Firefox default home to prominently show me:
"Now that you have completed the Slim install get the rest of the packages. Use pkg list/status/something someargs"

The Preview includes the Image Packaging System. With IPS, you can select versioned builds of components to manage or create your own custom OpenSolaris distribution.

IPS packages that are not included in the Slim Install installation image, such as developer tools, can be downloaded after the installation. This prototype uses new IPS commands to access packages from the network repositories. Both IPS packages and SVR4 packages are supported.

The OpenSolaris Project: Image Packaging System project page contains man pages for the new IPS commands and a link to the IPS download site.

OK, pkg(5) has some indications of a repository (or authority) pkg://opensolaris.org. No examples, is pkg://opensolaris.org actually a valid and running authority? OK http://opensolaris.org/os/project/pkg/documents/ says pkg.opensolaris.org. Looking at the list of available packages I now see that they are basically ALL installed. (maybe pkg(5) should reference pkg.opensolaris.org)

I can also see that when I pkg uninstall SUNWbind and pkg install SUNWbind the counters on the site increment. When I drop ni0 the install fails and with it enabled and snoop running there is http traffic downloading the package. Clearly I am hitting the remote authority. Somewhere I think I should be able to see what the default authority is or where to find it (without digging in /var/pkg and guessing that cfg_cache should be the source, or looking at snoop to see where my traffic is going)

pkg install/uninstall is fast and also easy (somewhat dependent on network bandwidth I would guess). It wouldn't suffer from the some sort of optional feedback that work is progressing.

Errors from pkg are quite ugly and straight out of python.

At this point I call the install a success simple, easy, fast. The post install experience is still missing something. The packages that are not included on the live CD, how did they get installed on my instance when I had no net during the install.

So we are down to

Is pkg.opensolaris.org the only authority?
are all the packages that are available installed by default?
If pkgadd can't add packages (or is this a bug) how do I add standard SYSV packages now

In the end this was fun and interesting, I have a much better understanding of where we are going but it doesn't look like Indiana is going to be my everyday system for a while yet,

OpenSolaris/Indiana first thoughts

(See Update 1 for how to get functional networking in Indiana under Parallels)

As ThinGuy just mentioned on twitter, no registration or login to download the image. Woot!

I am running the installer iso on top of Parallels on Mac OS X 10.5 (leopard)

Boot speed is fabulous
Installer is straight forward
- so few steps it seems like I must be forgetting something
Base install is FAST

On the down side:

I can't seem to get an external network interface to plumb, but Parallels seems a bit flaky on Leopard
single disk contention is the long pole install is not as fast as it could be, should have burned a CD

So now as long as I can get a network interface working one way or another I am set. Fingers crossed. I have already nuked my other local install so I had space to play. I would rather not have to recover it.

Update 1:

Networking: It helps if you remember to install the Parallels provided network interface driver.

See the comment about so few steps I must be forgetting something (like installing the driver)

Parallels: Installing the Beta got rid of some VM related network error messages and appears to have fixed shared networking (my default for my simple test instance).

I guess I could have provided a link: http://opensolaris.org/os/project/indiana/resources/getit/

Issues with the base slim install No /usr/ccs/bin/make (or any make as far as I can tell in the default install). To install the required driver the following manual steps are required.

untar the ni*.tgz into /tmp
in the /tmp/ni*/i386 directory
- cp ni dp8390 to /kernel/drv
- cd ..
- addni.sh
  (or you can change /etc/path_to_inst yourself)
- modload /kernel/drv/ni
Assuming you are using dhcp wait a moment and get the popup telling you that you have an address.

Appendix C: is way out of date although it would have reminded me to install the network driver.

Shawn Ferry