SGE on Ubuntu 20.04 LTS
Posted on March 28, 2022 • 4 minutes • 670 words • Suggest Changes
Setting up SGE on Ubuntu 20.04 LTS, should be easy right? Well, thanks to the internet it was do-able. Not a fun experience at all. Rate 0/5 ⭐ !
Before I complain too much, there are some very valuable resources available online; The one I relied on most is from peteris.rocks, sun grid engine installation before anything you should read that document.
After that perhaps continue here for some more debugging / troubleshooting 😉!
Install all required packages, the setup will ask a few items (if you didn’t follow peteris.rocks guide to install headless). The values I used :
- local mail only,
- cluster name (free to pick),
- default cell : default (change only if you know why),
- master hostname : (a dns resolvable hostname)
apt-get install gridengine-master gridengine-client gridengine-exec gridengine-qmon cpp
The first issue was at step one, installing gridengine-master package will actually not install on a clean installation. This is a known bug since 2018, sadly unresolved.
This error can be in the output :
dpkg: error processing package gridengine-master (--configure): installed gridengine-master package post-installation script subprocess returned error exit status 139 [...] Errors were encountered while processing: gridengine-master E: Sub-process /usr/bin/dpkg returned an error code (1)
However near the end of this bug report, you can see that a single “bad” file is the cause of it, after some of my own debugging. (a few hours) I figured out what was breaking the setup. Lucky for me cloudsatoz described the method in which they could work around this bug. His method works but I don’t think its needed (anylonger?) to manually compile jemalloc.
We can find the error, if we rerun the init_cluster script :
sudo -u sgeadmin /usr/share/gridengine/scripts/init_cluster /var/lib/gridengine default /var/spool/gridengine/spooldb sgeadmin
The ultimate cause is in
spooldefaults.bin; So we need to replace that one.
That can be done by downloading an older package and replacing the file.
wget http://ftp.debian.org/debian/pool/main/g/gridengine/gridengine-client_8.1.9+dfsg-9_amd64.deb dpkg -x gridengine-client_8.1.9+dfsg-9_amd64.deb a cd a/usr/lib/gridengine/ cp spooldefaults.bin /usr/lib/gridengine/ cp libspool*.so /usr/lib/gridengine/ systemctl restart gridengine-master
The cloudsatoz method goes and runs ./install_qmaster from
/var/lib/gridengine. This requires a extra binary : qmake. I’m not sure if this is required or not, but I installed it and linked the current qmake binary in the folder of gridengine.
apt-get install qt5-qmake ln -s /usr/bin/qmake /usr/lib/gridengine/qmake cd /var/lib/gridengine ./install_qmaster
After the master installation, I couldn’t connect/query on the master.
error: commlib error: access denied (client IP resolved to host name "localhost". This is not identical to clients host name "master") error: unable to send message to qmaster using port 6444 on host "master": got send error
Fix by removing from
/etc/hosts to :
127.0.1.1 server # <-- remove
After this, I pretty much could follow the guide.
AD users connected
After installing the cluster and making it a submit host; I connected via samba to our AD environment. After logging in to one of the accounts over AD I could submit jobs, but they would directly fail. Upon closer inspection this came out :
[email protected]:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 37 0.50000 STDIN svennd Eqw 03/24/2022 16:30:21 1 42 0.50000 hostname svennd Eqw 03/28/2022 09:25:26 1 [email protected]:~$ qstat -explain c -j 42 ============================================================== job_number: 42 exec_file: job_scripts/42 submission_time: Mon Mar 28 09:25:26 2022 owner: svennd uid: 100172585 group: domain gid: 100000514 sge_o_home: /home/AD/svennd sge_o_log_name: svennd sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin sge_o_shell: /bin/bash sge_o_workdir: /home/AD/svennd sge_o_host: server account: sge mail_list: [email protected] notify: FALSE job_name: hostname jobshare: 0 env_list: TERM=NONE script_file: hostname binding: NONE job_type: binary error reason 1: can't get password entry for user "svennd". Either user does not exist or error with NIS/LDAP etc. scheduling info: Job is in error state
The key here is :
can't get password entry for user. Either user does not exist or error with NIS/LDAP etc.
This could be resolved by restarting gridengine-exec; However how lasting this issue will be, I don’t know. Source, webcache of google
service gridengine-exec restart
Valuable sources during the debugging :
Image by tvick