Skip to content

ASC Resources

Power monitoring

Because ASC typically sets a power limit of 3,000 W, we need to monitor our servers' power during training and testing. For our servers, we do our monitoring with the following components.

Dashboard: https://monitor.acsalab.com/d/asc

Database and Visualization

We reuse our monitoring infrastructure, namely InfluxDB and Grafana for this part.

We create a separate user asc and a separate database asc in InfluxDB, and a dedicated Grafana dashboard named ASC.

IB switch adjustment

The APC PDU also powers an InfiniBand switch in addition to the five compute servers, so the "Power" graph has an extra line that simply adds an adjustment (that defaults to 135 W for ASC 2023) to the raw power reading to account for the switch.

Whole-server power monitoring

We collect power data from the whole server through IPMI with iBug's wrapper daemon for ipmitool: ipmi-sdr.

This daemon can collect IPMI sensor readings from multiple servers at once, so we run a single instance on our monitor VM, inside a Docker container named ipmi-sdr.

/root/ipmi-sdr/Dockerfile
FROM alpine:latest
RUN apk add --no-cache ipmitool
/root/ipmi-sdr/run.sh
#!/bin/sh

NAME=ipmi-sdr
SRC="$(realpath "$(dirname "$0")")"

docker build -t ipmitool --network=host "$SRC"
docker rm -f "$NAME"
docker run -itd --name="$NAME" --restart=always \
  --net=host \
  -w /app \
  -v "$SRC":/app:ro \
  ipmitool \
  /app/ipmi-sdr

For details on how this tool works, consult its README.

APC PDU current reading

Our enterprise-grade server PDU, produced by APC, provides current readings through SNMP. iBug wrote another Go daemon to continuously pull data into InfluxDB: apc-monitor.

Similarly, we run this daemon on our monitor VM, inside a Docker container named apc-monitor, but without a dedicated image (since it doesn't need ipmitool):

/root/apc-monitor/run.sh
#!/bin/sh

NAME=apc-monitor
SRC="$(realpath "$(dirname "$0")")"

docker rm -f "$NAME"
docker run -itd --name="$NAME" --restart=always \
  --net=host \
  -w /app \
  -v "$SRC":/app:ro \
  alpine \
  /app/apc-monitor

The "input current" reading has an SNMP OID of 1.3.6.1.4.1.318.1.1.12.2.3.1.1.2.1. To read it one-off from the shell, use snmpget:

$ snmpget -v2c -c public <IPaddress> 1.3.6.1.4.1.318.1.1.12.2.3.1.1.2.1
iso.3.6.1.4.1.318.1.1.12.2.3.1.1.2.1 = Gauge32: 137

Note that the data is given as a 32-bit interger with unit 1/10 A (100 mA), so the above example represents a reading of 13.7 A.

Voltage

The datacenter's power supply is around 235 V, which is how the "APC power" in the Grafana dashboard is calculated.

NVIDIA GPU power monitoring

iBug wrote yet another wrapper for the command nvidia-smi dmon -s pm to collect power data from NVIDIA GPUs: nvidia-dmon.

Unlike the other two daemons, this one is run on each compute hosts. So it is currently deployed on icarus0-4 under the systemd service ibug-nvidia-dmon.service.

Troubleshooting

Chinese character under Linux VT

Install fbterm and a suitable font:

apt install fbterm fonts-noto-cjk

Then login as root and start an fbterm:

fbterm -s 16 # 16 is font size, change if you want

You can now see Chinese characters for any command, e.g. Vim.

Note that fbterm is a terminal itself, so it needs to be started for every TTY login.

Debug info

For applications using Intel MPI, set these environment variables to produce debug output:

I_MPI_PLATFORM=auto
I_MPI_DEBUG=10

You may increase I_MPI_DEBUG to 20 if you need.