Rocks Grid Engine error states
Posted on June 27, 2017 • 3 minutes • 579 words • Suggest Changes
I have not found easy explanation on what kind of error states are in Rocks, Grid Engine, so I am collecting them here as I find them.
Show states
First let’s find the overview of the nodes; this can be done using qstaf -f
qstat -f
Result should be something like :
# qstat -f queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------- all.q@compute-0-0.local BIP 0/0/24 0.00 linux-x64 ---------------------------------------------------------------- all.q@compute-0-1.local BIP 0/0/24 0.00 linux-x64 ---------------------------------------------------------------- all.q@compute-0-2.local BIP 0/0/24 0.02 linux-x64 ---------------------------------------------------------------- all.q@compute-0-3.local BIP 0/0/24 0.00 linux-x64 ---------------------------------------------------------------- all.q@compute-0-4.local BIP 0/0/24 0.00 linux-x64 ---------------------------------------------------------------- all.q@compute-0-5.local BIP 0/0/24 0.00 linux-x64 ----------------------------------------------------------------
Error state : E
E stands for (hard) error, which means something bad is going on. The result is a decision by the headnode to not use this node anymore until manual intervention. This happens, to make sure there is not a job sinkhole created.
Example :
qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------- all.q@compute-0-0.local BIP 0/0/24 0.01 linux-x64 E --------------------------------------------------------------- all.q@compute-0-1.local BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------- all.q@compute-0-2.local BIP 0/0/24 0.02 linux-x64 --------------------------------------------------------------- all.q@compute-0-3.local BIP 0/0/24 0.00 linux-x64 E --------------------------------------------------------------- all.q@compute-0-4.local BIP 0/0/24 0.00 linux-x64 --------------------------------------------------------------- all.q@compute-0-5.local BIP 0/0/24 0.00 linux-x64 E ---------------------------------------------------------------
The error states here where probably due to a full root disk on these nodes. There is a tool for finding out which jobs failed to find out what was happening at the time (-explain E)
qstat -f -explain E
queuename qtype resv/used/tot. load_avg arch states ------------------------------------------------------------- all.q@compute-0-0.local BIP 0/0/24 0.00 linux-x64 E queue all.q marked QERROR as result of job 542032's failure at host compute-0-0.local queue all.q marked QERROR as result of job 542033's failure at host compute-0-0.local queue all.q marked QERROR as result of job 542036's failure at host compute-0-0.local
What is more important, is the fact that Error state will survive a reboot. So we should clean it up if the underlying issue has been resolved : (this will clear all errors)
qmod -c '*'
Disabled state : d
_d _means the node has been disabled, this normally should not happen automatically. We can disable a node from getting anymore jobs, but the running jobs will continue to run.
You can disable a node from further jobs using the qmod command.
qmod -d all.q@compute-0-5.local
You can re-enable a node again using
qmod -e all.q@compute-0-5.local
Example :
[root@server ~]# qmod -d all.q@compute-0-5.local root@server.local changed state of "all.q@compute-0-5.local" (disabled) [root@server ~]# qmod -e all.q@compute-0-5.local root@server.local changed state of "all.q@compute-0-5.local" (enabled)
au : Alarm, Unreachable
The state au, u means unreachable this happens when _sge_execd on the node does not respond to the sge_qmaster on the headnode _within a configured timeout window. The a state is alarm, this will happen when the node does not report the load, in which case a load of 99.99 is assumed. This results in the scheduler to not assign more work to the node. The au state can happen when a NFS server is being hammered and the complete node is waiting for the “slow” network disk. (when hard mounted nfs) This state can resolve itself if the problem gets resolved.
[root@server ~]# qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@compute-0-0.local BIP 0/0/24 0.00 linux-x64 d --------------------------------------------------------------------------------- all.q@compute-0-1.local BIP 0/3/24 3.25 linux-x64 d --------------------------------------------------------------------------------- all.q@compute-0-2.local BIP 0/6/24 6.32 linux-x64 d --------------------------------------------------------------------------------- all.q@compute-0-3.local BIP 0/0/24 0.00 linux-x64 dE --------------------------------------------------------------------------------- all.q@compute-0-4.local BIP 0/5/24 5.28 linux-x64 d --------------------------------------------------------------------------------- all.q@compute-0-5.local BIP 0/0/24 0.00 linux-x64 dE --------------------------------------------------------------------------------- all.q@compute-0-7.local BIP 0/0/24 -NA- linux-x64 adu
Useful link :
- Grid Engine Troubleshooting (pdf) (dead :( )