Rocks Grid Engine error states

Posted on June 27, 2017 • 3 minutes • 579 words

I have not found easy explanation on what kind of error states are in Rocks, Grid Engine, so I am collecting them here as I find them.

Show states

First let’s find the overview of the nodes; this can be done using qstaf -f

qstat -f

Result should be something like :

# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
----------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/0/24         0.00     linux-x64 
----------------------------------------------------------------
all.q@compute-0-1.local        BIP   0/0/24         0.00     linux-x64
----------------------------------------------------------------
all.q@compute-0-2.local        BIP   0/0/24         0.02     linux-x64
----------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/0/24         0.00     linux-x64 
----------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/0/24         0.00     linux-x64
----------------------------------------------------------------
all.q@compute-0-5.local        BIP   0/0/24         0.00     linux-x64    
----------------------------------------------------------------

Error state : E

E stands for (hard) error, which means something bad is going on. The result is a decision by the headnode to not use this node anymore until manual intervention. This happens, to make sure there is not a job sinkhole created.

Example :

qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/0/24         0.01     linux-x64     E
---------------------------------------------------------------
all.q@compute-0-1.local        BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------
all.q@compute-0-2.local        BIP   0/0/24         0.02     linux-x64
---------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/0/24         0.00     linux-x64     E
---------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------
all.q@compute-0-5.local        BIP   0/0/24         0.00     linux-x64     E
---------------------------------------------------------------

The error states here where probably due to a full root disk on these nodes. There is a tool for finding out which jobs failed to find out what was happening at the time (-explain E)

qstat -f -explain E

queuename                      qtype resv/used/tot. load_avg arch          states
-------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/0/24         0.00     linux-x64     E
        queue all.q marked QERROR as result of job 542032's failure at host compute-0-0.local
        queue all.q marked QERROR as result of job 542033's failure at host compute-0-0.local
        queue all.q marked QERROR as result of job 542036's failure at host compute-0-0.local

What is more important, is the fact that Error state will survive a reboot. So we should clean it up if the underlying issue has been resolved : (this will clear all errors)

qmod -c '*'

Disabled state : d

_d _means the node has been disabled, this normally should not happen automatically. We can disable a node from getting anymore jobs, but the running jobs will continue to run.

You can disable a node from further jobs using the qmod command.

qmod -d all.q@compute-0-5.local

You can re-enable a node again using

qmod -e all.q@compute-0-5.local

Example :

[root@server ~]# qmod -d all.q@compute-0-5.local
root@server.local changed state of "all.q@compute-0-5.local" (disabled)
[root@server ~]# qmod -e all.q@compute-0-5.local
root@server.local changed state of "all.q@compute-0-5.local" (enabled)

au : Alarm, Unreachable

The state au, u means unreachable this happens when _sge_execd on the node does not respond to the sge_qmaster on the headnode _within a configured timeout window. The a state is alarm, this will happen when the node does not report the load, in which case a load of 99.99 is assumed. This results in the scheduler to not assign more work to the node. The au state can happen when a NFS server is being hammered and the complete node is waiting for the “slow” network disk. (when hard mounted nfs) This state can resolve itself if the problem gets resolved.

[root@server ~]# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/0/24         0.00     linux-x64     d
---------------------------------------------------------------------------------
all.q@compute-0-1.local        BIP   0/3/24         3.25     linux-x64     d
---------------------------------------------------------------------------------
all.q@compute-0-2.local        BIP   0/6/24         6.32     linux-x64     d
---------------------------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/0/24         0.00     linux-x64     dE
---------------------------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/5/24         5.28     linux-x64     d
---------------------------------------------------------------------------------
all.q@compute-0-5.local        BIP   0/0/24         0.00     linux-x64     dE
---------------------------------------------------------------------------------
all.q@compute-0-7.local        BIP   0/0/24         -NA-     linux-x64     adu

Useful link :

Grid Engine Troubleshooting (pdf) (dead :( )