NFS Server¶

DNS: nfs.acsalab.com
Operating System: Proxmox VE 7 (Debian 11)

Server notes¶

The perpetual iDRAC 9 Enterprise License was purchased from Taobao on March 13, 2024 for ￥60, so we can use the remote console feature (finally!). A copy of the license file is included in this documentation in case of future needs.

In order to allow Telegraf to collect sensor data from IPMI, the telegraf user is added to group sys, and a corresponding udev rule is created following Telegraf documentation:

/etc/udev/rules.d/52-telegraf-ipmi.rules

KERNEL=="ipmi*", MODE="660", GROUP="sys"

ZFS¶

While we previously used the disk controller's built-in RAID capabilities, it offers very basic features and is not a good fit for our use case. We have now changed that controller into HBA mode and use ZFS as a feature-rich, all-in-one solution for RAID + LVM + caching.

RAID¶

We choose RAID-10 over RAID-6 for performance reasons (see this article). In ZFS this is accomplished using "mirror vdev group" setup:

zpool create rpool mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh

This way /dev/sdc and /dev/sdd are mirrored together as one vdev (RAID-1), and three vdevs like this are added together to form a large pool (RAID-0, with minor differences).

Caching¶

By default, ZFS uses up to half the systems' total memory as an Adaptive Replacement Cache (ARC). ARC statistics may be viewed using the arc_summary command.

ARC memory usage can be configured through ZFS module parameters zfs_arc_min and zfs_arc_max. Since the NFS server has a lot of memory unused, we set the ARC range from 4 GB to 80 GB. To persist settings we use a modprobe config file.

/etc/modprobe.d/zfs.conf

options zfs zfs_dmu_offset_next_sync=1

options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=85899345920

options zfs l2arc_noprefetch=0
options zfs l2arc_headroom=8

ZFS also supports adding extra fast disks (SSDs) as read caches (write cache / write buffer is another topic), called Level-2 ARC (L2ARC). To add a cache device, use zpool add rpool cache /dev/sdb.

Write caching (buffering) is called a "separate log" device (SLOG). It doesn't need to be large in size so we use part of the system disk's LVM space for this purpose.

Compression¶

ZFS supports transparent compression on volumes and datasets. We apply a moderate level of compression on the NFS-shared home directories.

zfs set compression=zstd rpool/home

According to a preliminary test, Zstd level 6 provides decent compression ratio while still being I/O bound. Higher levels like Zstd level 8 may become CPU-bound under particularly heavy writes with slightly better compression ratio. So we just choose Zstd-6 for our home directories.

According to page 7 of this slide, low Zstd levels make little difference in terms of compression ratio, but differs vastly in performance. So we use the default compression level (compression=zstd, equivalent to level 3) for our data.

The common compression=on setting uses an old algorithm, LZ4, which is no longer considered modern, though still fast.

Snapshots¶

ZFS supports instant snapshots. We use this feature to protect against accidental data loss. Snapshots are read-only and can be accessed through a hidden .zfs directory at the root of each dataset. For example:

ibug@snode0:~$ ls /staff/ibug/.zfs/snapshot/
20240408  20240412  20240415  20240419  20240422  20240426  20240429

Files under these snapshot directories can be copied back to the original location, and they're read-only.

We take a snapshot for the entire home directory twice a week (at 5:17 AM each Monday and Friday), and keep 7 snapshots. This is done with cron and a custom script.

#!/bin/sh

DATASET=rpool/home
DATE=$(date +%Y%m%d)

SNAPSHOT="$DATASET@$DATE"
if zfs list "$SNAPSHOT" >/dev/null 2>&1; then
  echo "Already taken snapshot today"
  exit 0
fi

zfs snapshot -r "$SNAPSHOT"

# retain latest snapshots
RETENTION=7
zfs list -t snapshot "$DATASET" |
  tail -n +2 | head -n -$RETENTION | awk '{print $1}' |
  xargs -rn 1 zfs destroy -rv

Extra: MegaCli¶

Outdated information

MegaCli is most useful when using the disk controller in RAID mode. In HBA mode, we can access individual disks directly, and smartctl provides more details on disks.

Download the ZIP from here. There's an RPM package under Linux directory. Use rpm2cpio to convert it to a CPIO archive, then cpio -idv < package.cpio to extract.

The package contains 3 files under /opt/MegaRAID/MegaCli/. Other than the bundled files, libncurses5 is required (install from apt). You can symlink MegaCli64 to /usr/local/sbin for typing less.

Usage:

All information: MegaCli64 -AdpAllInfo -aAll
Physical disk information: MegaCli64 -PdList -aAll
Logical disk and physical disk information: MegaCli64 -LdPdInfo -aAll

NFS Tuning¶

To allow better read/write concurrency, we set the NFS server to use 256 total threads (default 16).

/etc/default/nfs-kernel-server

RPCNFSDCOUNT=256

Networking¶

Proxmox VE uses Debian's ifupdown config system, but ships with an improved version: ifupdown2.

We also use NFS as a "gateway" to provide internet access to other hosts, so we need to move the default main routing rule from priority 32766 to priority 2, hence the settings on interface lo.

auto lo
iface lo inet loopback
        up ip rule add table main pref 2 || true
        up ip rule delete table main pref 32766 || true

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto vmbr0
iface vmbr0 inet static
        address 222.195.72.127/24
        gateway 222.195.72.254
        bridge-ports eno1 eno2
        bridge-stp off
        bridge-fd 0
iface vmbr0 inet6 static
        address 2001:da8:d800:112::127/64
        gateway 2001:da8:d800:112::1

Because InfiniBand cannot be bridged, we use it directly. Also IB needs opensmd to work correctly, so we add a pre-up line to start the service.

auto ibp175s0
iface ibp175s0 inet static
        pre-up systemctl start opensmd
        address 10.1.13.1/24
        mtu 2044

Mounting NFS over RDMA¶

The nfs-kernel-server by default doesn't export NFS over RDMA, and RDMA connections cannot live on the same port as existing TCP/UDP, so we use port 2050 for RDMA.

To enable NFS over RDMA, we edit /etc/nfs.conf and uncomment the two lines containing rdma:

rdma=y
rdma-port=2050

On NFS clients, load the xprtrdma kernel module and add these mount options proto=rdma,port=2050 to the NFS mountpoint.

Error: could not insert 'rpcrdma': invalid argument

Just install the mlnx-nfsrdma-dkms package. No reboot needed.

Routing service for other hosts¶

See /etc/wireguard/wg0.conf for details.

iptables fails randomly

For unknown reasons, Proxmox VE switches back to iptables-legacy randomly, which causes the routing service to fail. A temporary fix is to switch back to iptables-nft and restart the service.

update-alternatives --set iptables /usr/sbin/iptables-nft
update-alternatives --set ip6tables /usr/sbin/ip6tables-nft
systemctl restart iptables

We've deployed a permanent fix

By inserting the update-alternatives commands before the iptables service starts, we can ensure that the correct iptables is used.

systemctl edit iptables.service

[Service]
ExecStartPre=-/usr/bin/update-alternatives --set iptables /usr/sbin/iptables-nft
ExecStartPre=-/usr/bin/update-alternatives --set ip6tables /usr/sbin/ip6tables-nft

We also mask pve-firewall to prevent it from interfering with our setup.

systemctl mask pve-firewall.service

Records¶

2024-03-24 SSD replacement¶

Poweroff the server
Insert the new Intel Optane 900p 208G, with its half-height bracket pre-installed
Power on the server, but boot PXE instead. Select Arch Linux to get a shell
- e2fsck -f /dev/pve/root
- resize2fs -p /dev/pve/root 14G
- lvreduce -L 16G pve/root
- resize2fs -p /dev/pve/root
Now reboot into the OS and prepare the new disk
- fdisk /dev/nvme0n1
  - g to create a new GPT partition table
  - n, Enter, Enter, +100M to create a 100M EFI partition
  - t, uefi to set the partition type
  - n, Enter, Enter, Enter to create a new partition with the rest of the space
  - t, Enter, lvm to set the partition type to LVM
  - w save and exit
- Configure the new EFI system partition:
  - mkfs.vfat /dev/nvme0n1p1
  - vim /etc/fstab and change the UUID of /boot/efi to the new one
  - umount /boot/efi, mount /boot/efi
  - grub-install --target=x86_64-efi to install GRUB
- Migrate the root partition to the new disk:
  - pvcreate /dev/nvme0n1p2
  - vgextend pve /dev/nvme0n1p2
  - pvmove /dev/sdb2 /dev/nvme0n1p2 - this one takes some time, but with SSD it should be fast
  - vgreduce pve /dev/sdb2
  - (Optional) pvremove /dev/sdb2
  - (Optional) blkdiscard -f /dev/sdb
Pull out the old disk from the server
Insert the two 18 TB WD Gold disks
Add them to the ZFS pool:
- zpool add rpool mirror sda sdb