SvennD
March 28, 2022

SGE on Ubuntu 20.04 LTS

Posted on March 28, 2022  •  4 minutes  • 670 words  •  Suggest Changes

Setting up SGE on Ubuntu 20.04 LTS, should be easy right? Well, thanks to the internet it was do-able. Not a fun experience at all. Rate 0/5 ⭐ !

Before I complain too much, there are some very valuable resources available online; The one I relied on most is from peteris.rocks, sun grid engine installation before anything you should read that document.

After that perhaps continue here for some more debugging / troubleshooting 😉!

Installing issues

Install all required packages, the setup will ask a few items (if you didn’t follow peteris.rocks guide to install headless). The values I used :

apt-get install gridengine-master gridengine-client gridengine-exec gridengine-qmon cpp

The first issue was at step one, installing gridengine-master package will actually not install on a clean installation. This is a known bug since 2018, sadly unresolved.

This error can be in the output :

dpkg: error processing package gridengine-master (--configure):
installed gridengine-master package post-installation script subprocess returned error exit status 139
[...]
Errors were encountered while processing:
 gridengine-master
E: Sub-process /usr/bin/dpkg returned an error code (1)

However near the end of this bug report, you can see that a single “bad” file is the cause of it, after some of my own debugging. (a few hours) I figured out what was breaking the setup. Lucky for me cloudsatoz described the method in which they could work around this bug. His method works but I don’t think its needed (anylonger?) to manually compile jemalloc.

We can find the error, if we rerun the init_cluster script :

sudo -u sgeadmin /usr/share/gridengine/scripts/init_cluster /var/lib/gridengine default /var/spool/gridengine/spooldb sgeadmin

The ultimate cause is in spooldefaults.bin; So we need to replace that one. That can be done by downloading an older package and replacing the file.

wget http://ftp.debian.org/debian/pool/main/g/gridengine/gridengine-client_8.1.9+dfsg-9_amd64.deb
dpkg -x gridengine-client_8.1.9+dfsg-9_amd64.deb a
cd a/usr/lib/gridengine/
cp spooldefaults.bin /usr/lib/gridengine/
cp libspool*.so /usr/lib/gridengine/
systemctl restart gridengine-master

The cloudsatoz method goes and runs ./install_qmaster from /var/lib/gridengine. This requires a extra binary : qmake. I’m not sure if this is required or not, but I installed it and linked the current qmake binary in the folder of gridengine.

apt-get install qt5-qmake

ln -s /usr/bin/qmake /usr/lib/gridengine/qmake

cd /var/lib/gridengine
./install_qmaster

qhost

After the master installation, I couldn’t connect/query on the master.

error: commlib error: access denied (client IP resolved to host name "localhost". This is not identical to clients host name "master")
error: unable to send message to qmaster using port 6444 on host "master": got send error

Fix by removing from /etc/hosts to :

127.0.1.1 server  # <-- remove

After this, I pretty much could follow the guide.

AD users connected

After installing the cluster and making it a submit host; I connected via samba to our AD environment. After logging in to one of the accounts over AD I could submit jobs, but they would directly fail. Upon closer inspection this came out :

[email protected]:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     37 0.50000 STDIN      svennd       Eqw   03/24/2022 16:30:21                                    1        
     42 0.50000 hostname   svennd       Eqw   03/28/2022 09:25:26                                    1        

[email protected]:~$ qstat -explain c -j 42
==============================================================
job_number:                 42
exec_file:                  job_scripts/42
submission_time:            Mon Mar 28 09:25:26 2022
owner:                      svennd
uid:                        100172585
group:                      domain
gid:                        100000514
sge_o_home:                 /home/AD/svennd
sge_o_log_name:             svennd
sge_o_path:                 /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/AD/svennd
sge_o_host:                 server
account:                    sge
mail_list:                  [email protected]
notify:                     FALSE
job_name:                   hostname
jobshare:                   0
env_list:                   TERM=NONE
script_file:                hostname
binding:                    NONE
job_type:                   binary
error reason          1:      can't get password entry for user "svennd". Either user does not exist or error with NIS/LDAP etc.
scheduling info:            Job is in error state

The key here is : can't get password entry for user. Either user does not exist or error with NIS/LDAP etc.

This could be resolved by restarting gridengine-exec; However how lasting this issue will be, I don’t know. Source, webcache of google

service gridengine-exec restart

Valuable sources during the debugging :

Image by tvick

Support

If you enjoyed this website, consider buying me a Dr. Pepper

Buy me a Dr PepperBuy me a Dr Pepper