Skip to content

Setup Guide

Install CUDA without a full set of desktop environment

Newer Ubuntu provides pre-compiled NVIDIA drivers

For Ubuntu 22.04 LTS or newer, you can replace nvidia-headless-470 with linux-modules-nvidia-470-generic (or if the server is running an HWE kernel, replace the suffix -generic with -hwe-22.04-generic). This saves some time and CPU for compiling the driver.

Follow the official installation guide, but stop before apt install cuda, and do this instead:

apt install --no-install-recommends nvidia-cuda-toolkit nvidia-headless-470 nvidia-utils-470

Replace 470 with the version you want to install.

Note

R470 is the last version to support Kepler GPUs (GTX 600, GTX 700, Quadro K, Tesla K).

Install NVIDIA container runtime

Add repository key:

wget -O /etc/apt/trusted.gpg.d/nvidia-container-runtime.asc https://nvidia.github.io/nvidia-docker/gpgkey

Add repository:

cat > /etc/apt/sources.list.d/nvidia-container-runtime.list << EOF
deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
EOF

Note

While the URL indicates ubuntu18.04, the same repository is in fact shared across Ubuntu 20.04 and 22.04.

Update and install:

apt update
apt install nvidia-container-runtime

Configure Docker to use NVIDIA container runtime:

/etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Restart Docker service and we're done:

systemctl restart docker

Install InfiniBand drivers

We recommend sticking to these steps to ensure consistency across different servers.

  1. Download the latest LTS drivers (either ISO or TGZ are fine - you'll extract them anyway) to somewhere and extract them.

    Do not run the install script, it WILL break your environment.

    During ASC '24, we found that running the installer script mlnxofedinstall would remove (and purge) some of Intel OneAPI's packages. Notably, this includes the entire intel-hpckit and intel-basekit.

    If that happens, inspect /var/log/apt/history.log to see what was removed and reinstall them.

  2. Move mlnx-ofed-keyring.gpg to /usr/share/keyrings/ and move the DEBS directory to /usr/local/share/mellanox.

    You should have /usr/local/share/mellanox/Release among other files, NOT /usr/local/share/mellanox/DEBS/Release.

  3. Edit APT sources:

    /etc/apt/sources.list.d/mellanox.list
    deb [signed-by=/usr/share/keyrings/mlnx-ofed-keyring.gpg] file:/usr/local/share/mellanox ./
    
  4. Run apt update and then install mlnx-ofed-kernel-dkms and mlnx-nfsrdma-dkms.

  5. Reboot.

Mellanox Firmware

There's one important thing that the installer script mlnxofedinstall does: Updating the NIC firmware if available.

During ASC '24 we ran into arcane issue with IB cards, which was eventually resolved by resorting to the installer script, only to discover that updating the firmware was the key.

In that case, installing mlnx-fw-updater and running /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl should do the trick.

Network

Since we use VM Gateway to excess Internet, we need to bridge local interface to that VLAN using VXLAN, it could be done easily with netplan on Ubuntu 22.04 or newer version.

/etc/netplan/acsa.yaml
network:
    tunnels:
    vxlan0:
      mode: vxlan
      id: 1
      link: ibs21
      remote: 239.1.1.1
      addresses:
        - 192.0.0.xx/24 # For easier management, the host address should be the same as ib
      nameservers:
        addresses:
          - 192.0.0.2
      routes:
        - to: 0.0.0.0/0
          via: 192.0.0.2
          table: 253
          metric: 50

But for 20.04, we have to use systemd-networkd to implement the configuration. There are 3 files: vxlan0.netdev, vxlan0.network, 10-netplan-ibs1.network.d/vxlan.conf.

vxlan0.netdev
[NetDev]
Name=vxlan0
Kind=vxlan

[VXLAN]
VNI=1
Group=239.1.1.1
vxlan0.network
[Match]
Name=vxlan0

[Network]
LinkLocalAddressing=ipv6
Address=192.0.0.xx/24
ConfigureWithoutCarrier=yes

[Route]
Destination=0.0.0.0/0
Gateway=192.0.0.2
Metric=50
Table=253
10-netplan-ibs1.network.d/vxlan.conf
[Network]
VXLAN=vxlan0