Setup Guide¶
Install CUDA without a full set of desktop environment¶
Newer Ubuntu provides pre-compiled NVIDIA drivers
For Ubuntu 22.04 LTS or newer, you can replace nvidia-headless-470
with linux-modules-nvidia-470-generic
(or if the server is running an HWE kernel, replace the suffix -generic
with -hwe-22.04-generic
).
This saves some time and CPU for compiling the driver.
Follow the official installation guide, but stop before apt install cuda
, and do this instead:
apt install --no-install-recommends nvidia-cuda-toolkit nvidia-headless-470 nvidia-utils-470
Replace 470 with the version you want to install.
Note
R470 is the last version to support Kepler GPUs (GTX 600, GTX 700, Quadro K, Tesla K).
Install NVIDIA container runtime¶
Add repository key:
wget -O /etc/apt/trusted.gpg.d/nvidia-container-runtime.asc https://nvidia.github.io/nvidia-docker/gpgkey
Add repository:
cat > /etc/apt/sources.list.d/nvidia-container-runtime.list << EOF
deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
EOF
Note
While the URL indicates ubuntu18.04
, the same repository is in fact shared across Ubuntu 20.04 and 22.04.
Update and install:
apt update
apt install nvidia-container-runtime
Configure Docker to use NVIDIA container runtime:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Restart Docker service and we're done:
systemctl restart docker
Install InfiniBand drivers¶
We recommend sticking to these steps to ensure consistency across different servers.
-
Download the latest LTS drivers (either ISO or TGZ are fine - you'll extract them anyway) to somewhere and extract them.
Do not run the install script, it WILL break your environment.
During ASC '24, we found that running the installer script
mlnxofedinstall
would remove (and purge) some of Intel OneAPI's packages. Notably, this includes the entireintel-hpckit
andintel-basekit
.If that happens, inspect
/var/log/apt/history.log
to see what was removed and reinstall them. -
Move
mlnx-ofed-keyring.gpg
to/usr/share/keyrings/
and move theDEBS
directory to/usr/local/share/mellanox
.You should have
/usr/local/share/mellanox/Release
among other files, NOT/usr/local/share/mellanox/DEBS/Release
. -
Edit APT sources:
/etc/apt/sources.list.d/mellanox.listdeb [signed-by=/usr/share/keyrings/mlnx-ofed-keyring.gpg] file:/usr/local/share/mellanox ./
-
Run
apt update
and then installmlnx-ofed-kernel-dkms
andmlnx-nfsrdma-dkms
. -
Reboot.
Mellanox Firmware
There's one important thing that the installer script mlnxofedinstall
does: Updating the NIC firmware if available.
During ASC '24 we ran into arcane issue with IB cards, which was eventually resolved by resorting to the installer script, only to discover that updating the firmware was the key.
In that case, installing mlnx-fw-updater
and running /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl
should do the trick.
Network¶
Since we use VM Gateway
to excess Internet, we need to bridge local interface to that VLAN using VXLAN
, it could be done easily with netplan
on Ubuntu 22.04
or newer version.
network:
tunnels:
vxlan0:
mode: vxlan
id: 1
link: ibs21
remote: 239.1.1.1
addresses:
- 192.0.0.xx/24 # For easier management, the host address should be the same as ib
nameservers:
addresses:
- 192.0.0.2
routes:
- to: 0.0.0.0/0
via: 192.0.0.2
table: 253
metric: 50
But for 20.04
, we have to use systemd-networkd
to implement the configuration. There are 3 files: vxlan0.netdev
, vxlan0.network
, 10-netplan-ibs1.network.d/vxlan.conf
.
[NetDev]
Name=vxlan0
Kind=vxlan
[VXLAN]
VNI=1
Group=239.1.1.1
[Match]
Name=vxlan0
[Network]
LinkLocalAddressing=ipv6
Address=192.0.0.xx/24
ConfigureWithoutCarrier=yes
[Route]
Destination=0.0.0.0/0
Gateway=192.0.0.2
Metric=50
Table=253
[Network]
VXLAN=vxlan0