SGE on Ubuntu 20.04 LTS
Posted on March 23, 2022 • 4 minutes • 670 words • Suggest Changes
Setting up SGE on Ubuntu 20.04 LTS, should be easy right? Well, thanks to the internet it was do-able. Not a fun experience at all. Rate 0/5 ⭐ !
Before I complain too much, there are some very valuable resources available online; The one I relied on most is from peteris.rocks, sun grid engine installation before anything you should read that document.
After that perhaps continue here for some more debugging / troubleshooting 😉!
Installing issues
Install all required packages, the setup will ask a few items (if you didn’t follow peteris.rocks guide to install headless). The values I used :
- local mail only,
- cluster name (free to pick),
- default cell : default (change only if you know why),
- master hostname : (a dns resolvable hostname)
apt-get install gridengine-master gridengine-client gridengine-exec gridengine-qmon cpp
The first issue was at step one, installing gridengine-master package will actually not install on a clean installation. This is a known bug since 2018, sadly unresolved.
This error can be in the output :
dpkg: error processing package gridengine-master (--configure):
installed gridengine-master package post-installation script subprocess returned error exit status 139
[...]
Errors were encountered while processing:
gridengine-master
E: Sub-process /usr/bin/dpkg returned an error code (1)
However near the end of this bug report, you can see that a single “bad” file is the cause of it, after some of my own debugging. (a few hours) I figured out what was breaking the setup. Lucky for me cloudsatoz described the method in which they could work around this bug. His method works but I don’t think its needed (anylonger?) to manually compile jemalloc.
We can find the error, if we rerun the init_cluster script :
sudo -u sgeadmin /usr/share/gridengine/scripts/init_cluster /var/lib/gridengine default /var/spool/gridengine/spooldb sgeadmin
The ultimate cause is in spooldefaults.bin
; So we need to replace that one.
That can be done by downloading an older package and replacing the file.
wget http://ftp.debian.org/debian/pool/main/g/gridengine/gridengine-client_8.1.9+dfsg-9_amd64.deb
dpkg -x gridengine-client_8.1.9+dfsg-9_amd64.deb a
cd a/usr/lib/gridengine/
cp spooldefaults.bin /usr/lib/gridengine/
cp libspool*.so /usr/lib/gridengine/
systemctl restart gridengine-master
The cloudsatoz method goes and runs ./install_qmaster from /var/lib/gridengine
. This requires a extra binary : qmake. I’m not sure if this is required or not, but I installed it and linked the current qmake binary in the folder of gridengine.
apt-get install qt5-qmake
ln -s /usr/bin/qmake /usr/lib/gridengine/qmake
cd /var/lib/gridengine
./install_qmaster
qhost
After the master installation, I couldn’t connect/query on the master.
error: commlib error: access denied (client IP resolved to host name "localhost". This is not identical to clients host name "master")
error: unable to send message to qmaster using port 6444 on host "master": got send error
Fix by removing from /etc/hosts
to :
127.0.1.1 server # <-- remove
After this, I pretty much could follow the guide.
AD users connected
After installing the cluster and making it a submit host; I connected via samba to our AD environment. After logging in to one of the accounts over AD I could submit jobs, but they would directly fail. Upon closer inspection this came out :
svennd@server:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
37 0.50000 STDIN svennd Eqw 03/24/2022 16:30:21 1
42 0.50000 hostname svennd Eqw 03/28/2022 09:25:26 1
svennd@server:~$ qstat -explain c -j 42
==============================================================
job_number: 42
exec_file: job_scripts/42
submission_time: Mon Mar 28 09:25:26 2022
owner: svennd
uid: 100172585
group: domain
gid: 100000514
sge_o_home: /home/AD/svennd
sge_o_log_name: svennd
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
sge_o_shell: /bin/bash
sge_o_workdir: /home/AD/svennd
sge_o_host: server
account: sge
mail_list: [email protected]
notify: FALSE
job_name: hostname
jobshare: 0
env_list: TERM=NONE
script_file: hostname
binding: NONE
job_type: binary
error reason 1: can't get password entry for user "svennd". Either user does not exist or error with NIS/LDAP etc.
scheduling info: Job is in error state
The key here is : can't get password entry for user. Either user does not exist or error with NIS/LDAP etc.
This could be resolved by restarting gridengine-exec; However how lasting this issue will be, I don’t know. Source, webcache of google
service gridengine-exec restart
Valuable sources during the debugging :
Image by tvick