Rocks Grid Engine error states
Posted on June 27, 2017 • 3 minutes • 579 words • Suggest Changes
I have not found easy explanation on what kind of error states are in Rocks, Grid Engine, so I am collecting them here as I find them.
Show states
First let’s find the overview of the nodes; this can be done using qstaf -f
qstat -f
Result should be something like :
# qstat -f queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 ---------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 ---------------------------------------------------------------- [email protected] BIP 0/0/24 0.02 linux-x64 ---------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 ---------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 ---------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 ----------------------------------------------------------------
Error state : E
E stands for (hard) error, which means something bad is going on. The result is a decision by the headnode to not use this node anymore until manual intervention. This happens, to make sure there is not a job sinkhole created.
Example :
qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------- [email protected] BIP 0/0/24 0.01 linux-x64 E --------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------- [email protected] BIP 0/0/24 0.02 linux-x64 --------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 E --------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 E ---------------------------------------------------------------
The error states here where probably due to a full root disk on these nodes. There is a tool for finding out which jobs failed to find out what was happening at the time (-explain E)
qstat -f -explain E
queuename qtype resv/used/tot. load_avg arch states ------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 E queue all.q marked QERROR as result of job 542032's failure at host compute-0-0.local queue all.q marked QERROR as result of job 542033's failure at host compute-0-0.local queue all.q marked QERROR as result of job 542036's failure at host compute-0-0.local
What is more important, is the fact that Error state will survive a reboot. So we should clean it up if the underlying issue has been resolved : (this will clear all errors)
qmod -c '*'
Disabled state : d
_d _means the node has been disabled, this normally should not happen automatically. We can disable a node from getting anymore jobs, but the running jobs will continue to run.
You can disable a node from further jobs using the qmod command.
qmod -d [email protected]
You can re-enable a node again using
qmod -e [email protected]
Example :
[[email protected] ~]# qmod -d [email protected] [email protected] changed state of "[email protected]" (disabled) [[email protected] ~]# qmod -e [email protected] [email protected] changed state of "[email protected]" (enabled)
au : Alarm, Unreachable
The state au, u means unreachable this happens when _sge_execd on the node does not respond to the sge_qmaster on the headnode _within a configured timeout window. The a state is alarm, this will happen when the node does not report the load, in which case a load of 99.99 is assumed. This results in the scheduler to not assign more work to the node. The au state can happen when a NFS server is being hammered and the complete node is waiting for the “slow” network disk. (when hard mounted nfs) This state can resolve itself if the problem gets resolved.
[[email protected] ~]# qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 d --------------------------------------------------------------------------------- [email protected] BIP 0/3/24 3.25 linux-x64 d --------------------------------------------------------------------------------- [email protected] BIP 0/6/24 6.32 linux-x64 d --------------------------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 dE --------------------------------------------------------------------------------- [email protected] BIP 0/5/24 5.28 linux-x64 d --------------------------------------------------------------------------------- [email protected] BIP 0/0/24 0.00 linux-x64 dE --------------------------------------------------------------------------------- [email protected] BIP 0/0/24 -NA- linux-x64 adu
Useful link :
- Grid Engine Troubleshooting (pdf) (dead :( )