FAQ¶
General¶
Restoring files
If you accidentally delete a file from the compute servers, you may try retrieving it from our twice-a-week ZFS snapshots.
For details, see Snapshots.
本文档的语言?
本文档最早由英文编写,后来发现 99% 的同学并不能像阅读中文那样流畅地阅读英文,所以我们又逐渐把文档换回中文了(逃
MPI¶
Specify backend compiler for mpicc
and mpic++
For MPICH, set the environment variables MPICH_CC
and MPICH_CXX
. For example, if you want to use GCC 9 as backend:
export MPICH_CC=gcc-9 MPICH_CXX=g++-9
mpicc [...]
For Open MPI, the corresponding environment variables are OMPI_CC
and OMPI_CXX
:
export OMPI_CC=gcc-9 OMPI_CXX=g++-9
mpicc [...]
CUDA¶
Specify exact libcudart
version for linking
On systems with multiple CUDA version installed, the linker by default selects the latest minor version (e.g. 11.8 for libcudart.so.11
).
This version selection is done at runtime by the dynamic linker. To specify an exact version, use LD_LIBRARY_PATH
. For example, to specify CUDA 11.3:
export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64
./my-app
Make sure to change the path to point to the actual CUDA path.
Login and user management¶
How to change my password?
Log in to any host and run passwd
. You will be prompted for your current password, then your new password twice.
-
For LDAP users, you normally don't have a passwd when your account is created, since lab servers only accept ssh key login now. If you need(e.g change default shell, use OpenVPN service), you can ask admin to set an initial password for you and you change it later on your own.
Your new password is effective on ALL hosts, including our Synology DiskStation.
The change may require up to 10 minutes to propagate to other hosts due to
nscd
caching. This does not apply to the Synology DiskStation and our OpenVPN service (which runs on Synology DS). -
For non-LDAP users, your new password is specific to the host you performed the password change on.
-
Note that our Kunpeng 920 nodes (
brainiac1
andbrainiac2
) are not currently enrolled into LDAP.
If you forget your password, an administrator can reset it for you.
How to change my login shell?
LDAP users: Use chsh.ldap
on any Ubuntu 22.04 host. This command is broken on Ubuntu 20.04 hosts (because of upstream bugs).
Non-LDAP users: Use chsh
.
Public key is registered in authorized_keys
but still can't log in
Ask an administrator to inspect /var/log/auth.log
. One of the lines should contain this:
userauth_pubkey: key type ssh-rsa not in PubkeyAcceptedAlgorithms [preauth]
Client fix: Update your SSH client software.
- Known incompatible clients include OpenSSH (8.2 and older) and Xshell (all versions).
Server fix: Add PubkeyAcceptedAlgorithms +ssh-rsa
to /etc/ssh/sshd_config.d/acsa.conf
and reload the SSH service.
-
This is because starting at version 8.8, OpenSSH disabled the old insecure
ssh-rsa
algorithm (that uses SHA-1 hash) by default. This does not correspond with thessh-rsa
key type. (Confusingly, the keywordssh-rsa
refers to multiple similar things in SSH.)Other key types are not affected (ECDSA and Ed25519 keys).
The scp
command does not copy any file and gives no output
scp
uses SSH only as a transport layer, and transmits data in its own representation. If your .bashrc
or .zshrc
is invoked on non-interactive sessions (which it shouldn't), it produces output that interferes with scp
's data stream, causing scp
to fail.
Fix: Check your .bashrc
. Make sure it starts with [[ $- == *i* ]] || return
or something similar. Prepend your .bashrc
with that line if it doesn't.
NFS Mounting¶
Just create an appropriate /etc/fstab
entry. Example:
192.0.2.1:/home /home nfs4 soft,nofail 0 0
The soft
and nofail
mount options are important so as to prevent system halting when NFS mounts fail.
Do not try to work around this. Systemd will automatically handle the ordering and dependencies of NFS mounts and, in most cases, systemd is smarter than your crafted mount-acsa-nfs.service
or what have you.
NFS mount appears laggy on one host but not others¶
If mountpoint works normally on other hosts then it's most likely the particular host's fault.
Try dropping the system's local directory cache:
echo 3 > /proc/sys/vm/drop_caches
NFS-over-RDMA kernel module¶
Install the latest MLNX_OFED package. Sometimes the mlnx-nfsrdma-dkms
package isn't included in a downloaded ISO or TGZ file but don't worry, it can still be retrieved from https://linux.mellanox.com/public/repo/mlnx_ofed/latest/(version)/(arch)/mlnx-nfsrdma-dkms_*.deb. Just download the file and install. Make sure the version of mlnx-nfsrdma-dkms
matches that of mlnx-ofed-kernel-dkms
or the DKMS build will fail.
After successful installation, try modprobe svcrdma
and modprobe xprtrdma
. Both should produce no output and an exit code of 0 (success).
In case of unsuccessful installation, follow the logs as directed by DKMS, e.g. /var/lib/dkms/mlnx-ofed-kernel/5.7/build/make.log
. Identify where the build failed and try to fix it.
Log keyword Module*.symvers
On Ubuntu and other Debian-derived distributions, the Module.symvers
file is available at
/usr/src/linux-headers-$(uname -r)/Module.symvers
Just copy this file into /usr/src/<mlnx-ofed directory>
.
Log keyword __PEDIT_CMD_MAX
In file included from /var/lib/dkms/mlnx-ofed-kernel/5.7/build/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c:7:
/var/lib/dkms/mlnx-ofed-kernel/5.7/build/drivers/net/ethernet/mellanox/mlx5/core/en/tc_priv.h:41:35: error: ‘__PEDIT_CMD_MAX’ undeclared here (not in a function); did you mean ‘__DEVLINK_CMD_MAX’?
41 | struct pedit_headers_action hdrs[__PEDIT_CMD_MAX];
| ^~~~~~~~~~~~~~~
| __DEVLINK_CMD_MAX
make[3]: *** [scripts/Makefile.build:297: /var/lib/dkms/mlnx-ofed-kernel/5.7/build/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.o] Error 1
Diagnosis: With help from Google we can learn that __PEDIT_CMD_MAX
is defined in <kernel header>/include/uapi/linux/tc_act/tc_pedit.h
.
Solution: Add #include <linux/tc_act/tc_pedit.h>
to <mlnx-ofed source>/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
.
MLNX OFED suite download
URL format:
https://content.mellanox.com/ofed/MLNX_OFED-<version>/MLNX_OFED_LINUX-<version>-<distro version>-<arch>.tgz
For example, https://content.mellanox.com/ofed/MLNX_OFED-23.10-1.1.9.0/MLNX_OFED_LINUX-23.10-1.1.9.0-ubuntu22.04-x86_64.tgz
NFS over RDMA: Protocol error¶
mount.nfs4: Protocol error
Try adding -v
to the mount command and look for trying text-based options. Double-check if the values make sense.
Also examine /etc/netplan/*.yaml
and the output of ip a
. Fixing any discrepancy should help.
NFS mount.nfs4: an incorrect mount option was specified¶
Mount nfs failed, logs:
root@icarus0:~# mount -a -v
mount.nfs4: timeout set for Tue Jan 23 14:46:45 2024
mount.nfs4: trying text-based options 'soft,proto=rdma,port=2050,vers=4.2,addr=10.1.13.1,clientaddr=10.1.13.59'
mount.nfs4: mount(2): Invalid argument
mount.nfs4: trying text-based options 'soft,proto=rdma,port=2050,vers=4,minorversion=1,addr=10.1.13.1,clientaddr=10.1.13.59'
mount.nfs4: mount(2): Invalid argument
mount.nfs4: trying text-based options 'soft,proto=rdma,port=2050,vers=4,addr=10.1.13.1,clientaddr=10.1.13.59'
mount.nfs4: mount(2): Invalid argument
mount.nfs4: an incorrect mount option was specified
It's probably the Mellanox driver is not compatible with the new kernel, try install new Mellanox driver. See Install InfiniBand drivers.
SSH Troubleshooting¶
Client side¶
You can try ssh -v
to learn what SSH is doing. This includes which public key files are available or reject for what reason, and usually this is enough for you to identify the problem.
One -v
is enough
It's very rare that debug1
doesn't provide enough information. Extra verbosity from -vv
and -vvv
is only really useful to OpenSSH developers, or if you have modified the OpenSSH source code.
If it doesn't (or you believe something's wrong on the other side), ask an admin to investigate the server.
Server side¶
Edit /etc/ssh/sshd_config
and change LogLevel
to DEBUG1
(default INFO
) and reload SSH service. Ask the user to make another login attempt, then check either journalctl -eu ssh
or /var/log/auth.log
.
Don't forget to restore LogLevel
afterwards as it tends to bloat system log.
Common problems¶
- Permission denied (publickey)
-
If you've confirmed that your public key is valid but still get this error, it's likely that the NFS mount on the server is broken. Ask an administrator to fix it.
InfiniBand cards¶
On NFS server the IB interface may show NO-CARRIER
even if it's otherwise fine. This should be solved by starting the opensmd
service somewhere (not necessarily the host with the problem).
Permanent fix
OpenSM only needs to run once somewhere in the network, so we're running it on the NFS server. No other servers need to run OpenSM.
Miscellaneous quirks¶
NVIDIA drivers
NVIDIA driver should preferably be installed from Ubuntu resository, i.e. apt install linux-modules-nvidia-xxx-generic
(or on older OS, nvidia-driver-xxx
). This is both easier and more reliable than installing from NVIDIA's official installer, particularly across kernel and OS upgrades.
Mellanox InfiniBand drivers
On upgrading kernel to 5.15.0-83-generic, the InfiniBand DKMS driver failed to build. This is because the Mellanox driver is not compatible with the new kernel.
The DKMS build log indicated an unknown field xpo_release_rqst
for struct svc_xprt_ops
. We inspected the source and found a breaking change in the upstream kernel (compare svc_xprt.h
between v5.15.112 and v5.15.113).
The solution is to revert to the old kernel 5.15.0-79-generic and abstain from upgrading until the Mellanox driver is updated.
Server going unresponsive after a while
If the server has a desktop environment installed, chances are "Automatic Suspend" is not disabled (where it is turned on by default). Desktop managers are handling these ACPI triggers, disabling them could fix, but they could be automatically enabled after a software upgrade.
Fix: Log in to the graphical interface via IPMI/KVM, open the Settings app and select Power on the left, as shown below.
Also disable sleep-related systemd targets:
systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target
Permanent Fix: Change system boot target, use systemctl get-default
to see the running config, change graphical.target
to multi-user.target
by excuting systemctl set-default multi-user.target
. System should automatically disable desktop managers after reboot.
CUDA nvcc: unsupported GNU version! gcc [*] and up are not supported!
Install a compatible GCC version (e.g. for CUDA 11.3 install GCC 9 or 10)
Then head to the CUDA directory (e.g. /usr/local/cuda-11.3/bin
) and symlink the desired GCC version to gcc
and g++
:
cd /usr/local/cuda-11.3/bin
sudo ln -s /usr/bin/gcc-9 gcc
sudo ln -s /usr/bin/g++-9 g++
Ref: Stack Overflow