Monday, November 26, 2007

SGE quick and dirty how to find jobs on 'bad' slots

I occasionally have a need to find queues in Sun Grid Engine that are in one of the possibly problematic states which have an occupied slot. It is just infrequent enough that I don't remember exactly how I did it the last time.

qstat -f | awk '$6~/[cdsuE]/ && $3!~/^[0]/'
queuename qtype used/tot. load_avg arch states BIP 1/1 -NA- sol-amd64 adu BIP 1/1 -NA- sol-amd64 adu

An alternate is "qstat -f | awk '$6~/[cdsuE]/ && $3~/^[1-9]/'" which also avoids printing the header line. In the example above 'state' in $6 matches 's' and 'used' does not begin with '0'.

The possibly more elegant 'qstat -f -qs cdsuE' still requires a second comparison in awk of '$0!~/--/' to filter out the queue separator lines. (qstat -f -qs acduE | awk '$0!~/--/ && $3!~/^[0]/')

Finally because I can never remember what exactly all the queue states are and the qstat man page doesn't have the nice table:

aoACD #8211 Number of queue instances that are in at least one of the following states:
a #8211 Load threshold alarm
o #8211 Orphaned
A #8211 Suspend threshold alarm
C #8211 Suspended by calendar
D #8211 Disabled by calendar


cdsuE #8211 Number of queue instances that are in at least one of the following states:
c #8211 Configuration ambiguous
d #8211 Disabled
s #8211 Suspended
u #8211 Unknown
E #8211 Error


Job State/Status:

d(eletion),  E(rror), h(old), r(unning), R(estarted), s(uspended), S(uspended), t(ransfering), T(hreshold) or w(aiting).

References: SGE (N1GE 6.0) -- Monitoring and Controlling Queues

Edit: Added Job Status, literally couldn't find that in any of the online docs (notwithstanding ~40% through the qstat(1) man page, targeted google searches do a poor job finding the link)

No comments:

Post a Comment