NFS Server¶
- DNS:
nfs.acsalab.com
- Operating System: Proxmox VE 7 (Debian 11)
Server notes¶
The perpetual iDRAC 9 Enterprise License was purchased from Taobao on March 13, 2024 for ¥60, so we can use the remote console feature (finally!). A copy of the license file is included in this documentation in case of future needs.
In order to allow Telegraf to collect sensor data from IPMI, the telegraf
user is added to group sys
, and a corresponding udev rule is created following Telegraf documentation:
KERNEL=="ipmi*", MODE="660", GROUP="sys"
ZFS¶
While we previously used the disk controller's built-in RAID capabilities, it offers very basic features and is not a good fit for our use case. We have now changed that controller into HBA mode and use ZFS as a feature-rich, all-in-one solution for RAID + LVM + caching.
RAID¶
We choose RAID-10 over RAID-6 for performance reasons (see this article). In ZFS this is accomplished using "mirror vdev group" setup:
zpool create rpool mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh
This way /dev/sdc
and /dev/sdd
are mirrored together as one vdev (RAID-1), and three vdevs like this are added together to form a large pool (RAID-0, with minor differences).
Caching¶
By default, ZFS uses up to half the systems' total memory as an Adaptive Replacement Cache (ARC). ARC statistics may be viewed using the arc_summary
command.
ARC memory usage can be configured through ZFS module parameters zfs_arc_min
and zfs_arc_max
. Since the NFS server has a lot of memory unused, we set the ARC range from 4 GB to 80 GB. To persist settings we use a modprobe config file.
options zfs zfs_dmu_offset_next_sync=1
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=85899345920
options zfs l2arc_noprefetch=0
options zfs l2arc_headroom=8
ZFS also supports adding extra fast disks (SSDs) as read caches (write cache / write buffer is another topic), called Level-2 ARC (L2ARC). To add a cache device, use zpool add rpool cache /dev/sdb
.
Write caching (buffering) is called a "separate log" device (SLOG). It doesn't need to be large in size so we use part of the system disk's LVM space for this purpose.
Compression¶
ZFS supports transparent compression on volumes and datasets. We apply a moderate level of compression on the NFS-shared home directories.
zfs set compression=zstd rpool/home
According to a preliminary test, Zstd level 6 provides decent compression ratio while still being I/O bound. Higher levels like Zstd level 8 may become CPU-bound under particularly heavy writes with slightly better compression ratio. So we just choose Zstd-6 for our home directories.
According to page 7 of this slide, low Zstd levels make little difference in terms of compression ratio, but differs vastly in performance. So we use the default compression level (compression=zstd
, equivalent to level 3) for our data.
The common compression=on
setting uses an old algorithm, LZ4, which is no longer considered modern, though still fast.
Snapshots¶
ZFS supports instant snapshots. We use this feature to protect against accidental data loss. Snapshots are read-only and can be accessed through a hidden .zfs
directory at the root of each dataset. For example:
ibug@snode0:~$ ls /staff/ibug/.zfs/snapshot/
20240408 20240412 20240415 20240419 20240422 20240426 20240429
Files under these snapshot directories can be copied back to the original location, and they're read-only.
We take a snapshot for the entire home directory twice a week (at 5:17 AM each Monday and Friday), and keep 7 snapshots. This is done with cron and a custom script.
#!/bin/sh
DATASET=rpool/home
DATE=$(date +%Y%m%d)
SNAPSHOT="$DATASET@$DATE"
if zfs list "$SNAPSHOT" >/dev/null 2>&1; then
echo "Already taken snapshot today"
exit 0
fi
zfs snapshot -r "$SNAPSHOT"
# retain latest snapshots
RETENTION=7
zfs list -t snapshot "$DATASET" |
tail -n +2 | head -n -$RETENTION | awk '{print $1}' |
xargs -rn 1 zfs destroy -rv
Extra: MegaCli¶
Outdated information
MegaCli is most useful when using the disk controller in RAID mode. In HBA mode, we can access individual disks directly, and smartctl
provides more details on disks.
Download the ZIP from here. There's an RPM package under Linux
directory. Use rpm2cpio
to convert it to a CPIO archive, then cpio -idv < package.cpio
to extract.
The package contains 3 files under /opt/MegaRAID/MegaCli/
. Other than the bundled files, libncurses5
is required (install from apt
). You can symlink MegaCli64
to /usr/local/sbin
for typing less.
Usage:
- All information:
MegaCli64 -AdpAllInfo -aAll
- Physical disk information:
MegaCli64 -PdList -aAll
- Logical disk and physical disk information:
MegaCli64 -LdPdInfo -aAll
NFS Tuning¶
To allow better read/write concurrency, we set the NFS server to use 256 total threads (default 16).
RPCNFSDCOUNT=256
Networking¶
Proxmox VE uses Debian's ifupdown
config system, but ships with an improved version: ifupdown2
.
We also use NFS as a "gateway" to provide internet access to other hosts, so we need to move the default main
routing rule from priority 32766 to priority 2, hence the settings on interface lo
.
auto lo
iface lo inet loopback
up ip rule add table main pref 2 || true
up ip rule delete table main pref 32766 || true
auto eno1
iface eno1 inet manual
auto eno2
iface eno2 inet manual
auto vmbr0
iface vmbr0 inet static
address 222.195.72.127/24
gateway 222.195.72.254
bridge-ports eno1 eno2
bridge-stp off
bridge-fd 0
iface vmbr0 inet6 static
address 2001:da8:d800:112::127/64
gateway 2001:da8:d800:112::1
Because InfiniBand cannot be bridged, we use it directly. Also IB needs opensmd
to work correctly, so we add a pre-up
line to start the service.
auto ibp175s0
iface ibp175s0 inet static
pre-up systemctl start opensmd
address 10.1.13.1/24
mtu 2044
Mounting NFS over RDMA¶
The nfs-kernel-server
by default doesn't export NFS over RDMA, and RDMA connections cannot live on the same port as existing TCP/UDP, so we use port 2050 for RDMA.
To enable NFS over RDMA, we edit /etc/nfs.conf
and uncomment the two lines containing rdma
:
rdma=y
rdma-port=2050
On NFS clients, load the xprtrdma
kernel module and add these mount options proto=rdma,port=2050
to the NFS mountpoint.
Error: could not insert 'rpcrdma': invalid argument
Just install the mlnx-nfsrdma-dkms
package. No reboot needed.
Routing service for other hosts¶
See /etc/wireguard/wg0.conf
for details.
iptables
fails randomly
For unknown reasons, Proxmox VE switches back to iptables-legacy
randomly, which causes the routing service to fail. A temporary fix is to switch back to iptables-nft
and restart the service.
update-alternatives --set iptables /usr/sbin/iptables-nft
update-alternatives --set ip6tables /usr/sbin/ip6tables-nft
systemctl restart iptables
We've deployed a permanent fix
By inserting the update-alternatives
commands before the iptables service starts, we can ensure that the correct iptables is used.
[Service]
ExecStartPre=-/usr/bin/update-alternatives --set iptables /usr/sbin/iptables-nft
ExecStartPre=-/usr/bin/update-alternatives --set ip6tables /usr/sbin/ip6tables-nft
We also mask pve-firewall
to prevent it from interfering with our setup.
systemctl mask pve-firewall.service
Records¶
2024-03-24 SSD replacement¶
- Poweroff the server
- Insert the new Intel Optane 900p 208G, with its half-height bracket pre-installed
- Power on the server, but boot PXE instead. Select Arch Linux to get a shell
e2fsck -f /dev/pve/root
resize2fs -p /dev/pve/root 14G
lvreduce -L 16G pve/root
resize2fs -p /dev/pve/root
- Now reboot into the OS and prepare the new disk
fdisk /dev/nvme0n1
g
to create a new GPT partition tablen
, Enter, Enter,+100M
to create a 100M EFI partitiont
,uefi
to set the partition typen
, Enter, Enter, Enter to create a new partition with the rest of the spacet
, Enter,lvm
to set the partition type to LVMw
save and exit
- Configure the new EFI system partition:
mkfs.vfat /dev/nvme0n1p1
vim /etc/fstab
and change the UUID of/boot/efi
to the new oneumount /boot/efi
,mount /boot/efi
grub-install --target=x86_64-efi
to install GRUB
- Migrate the root partition to the new disk:
pvcreate /dev/nvme0n1p2
vgextend pve /dev/nvme0n1p2
pvmove /dev/sdb2 /dev/nvme0n1p2
- this one takes some time, but with SSD it should be fastvgreduce pve /dev/sdb2
- (Optional)
pvremove /dev/sdb2
- (Optional)
blkdiscard -f /dev/sdb
- Pull out the old disk from the server
- Insert the two 18 TB WD Gold disks
- Add them to the ZFS pool:
zpool add rpool mirror sda sdb