Skip to content

Monitoring infrastructure

Tip

If you're looking for the dashboard, it's at monitor.acsalab.com.

Our monitoring infrastructure uses a typical suite consisting of:

Component Function Where
InfluxDB Storing metrics monitor VM
Telegraf Collecting metrics Each monitored machine
Grafana Visualization monitor VM, inside a Docker container
MariaDB Database for Grafana monitor VM

Monitor VM

The monitor VM is a Debian VM, with rootfs and InfluxDB data directory on two separate volumes:

  • rootfs: 4 GB
  • InfluxDB data: 32 GB (may increase as needed), mounted on /mnt/data
    • /var/lib/mysql is symlinked to /mnt/data/mysql
    • /var/lib/influxdb is symlinked to /mnt/data/influxdb

Note

The following steps refer to the installation procedure, which do not need to be repeated, but may be of interest to maintainers.

Install InfluxDB

InfluxDB 1.8 is installed from the official APT repository. Note that we use InfluxDB v1 for better compatibility with other components.

wget -O /etc/apt/trusted.gpg.d/influxdb.asc https://mirrors.ustc.edu.cn/influxdata/influxdata-archive_compat.key
echo "deb https://mirrors.ustc.edu.cn/influxdata/debian bullseye stable" > /etc/apt/sources.list.d/influxdb.list
apt update
apt install --no-install-recommends influxdb

Notes on the commands:

  • apt recognizes GPG keys in armored format (i.e. ASCII), but the file name must end with .asc.
  • At the time of writing, Influxdata has not introduced the bookworm distribution yet, so we use the bullseye repository instead.

Then we create the databases and users as needed:

CREATE USER admin WITH PASSWORD 'redacted' WITH ALL PRIVILEGES;

CREATE DATABASE monitor;
CREATE USER telegraf WITH PASSWORD 'redacted';
CREATE USER grafana WITH PASSWORD 'redacted';
GRANT WRITE ON monitor TO telegraf;
GRANT READ ON monitor TO grafana;

Then we enable authentication ( which is off by default) by uncommenting and changing auth-enabled = true in /etc/influxdb/influxdb.conf, then systemctl restart influxdb.

In case we need to recover the admin user, change auth-enabled = false in influxdb.conf and restart InfluxDB. Don't forget to re-enable authentication after recovery work.

«Admin mode» script

I left a script in /root/run-influx.sh to log in as the admin user conveniently. Its content is as simple as it needs to be:

/root/run-influx.sh
#!/bin/sh
exec influx -username admin -password redacted

Install MariaDB

MariaDB installation is easier, as Debian provides the package mariadb-server in the official repository.

apt install --no-install-recommends mariadb-server

Run with mysql shell:

CREATE DATABASE grafana;
GRANT ALL PRIVILEGES ON grafana.* TO 'grafana'@'127.0.0.1' IDENTIFIED BY 'redacted';

Install Grafana

Grafana, however, is installed and managed with Docker.

First we create the config file:

/srv/grafana/grafana.ini
[server]
protocol = http
http_addr = 127.0.0.1
http_port = 3000
root_url = https://monitor.acsalab.com/

[database]
url = mysql://grafana:[email protected]:3306/grafana
ssl_mode = false

[analytics]
reporting_enabled = false

[security]
cookie_secure = true

[auth.anonymous]
enabled = true
org_name = ACSA

[log]
mode = console
level = debug

Then we launch the Docker container:

/root/docker-grafana.sh
#!/bin/sh

NAME=grafana
docker rm -f "$NAME"
docker run -itd \
  --name="$NAME" \
  --restart=always \
  --net=host \
  -v /srv/grafana:/etc/grafana:ro \
  grafana/grafana:latest

Updating Grafana

Because we store Grafana data in MariaDB, the container can be destroyed and recreated at any time without data loss. This makes updating Grafana very easy: Just pull the latest image and recreate the container.

docker pull grafana/grafana:latest
/root/docker-grafana.sh
# optional
docker image prune -f

Install Cloudflared

Cloudflared is used to expose Grafana to the Internet. It is installed from the official APT repository. For brevity, the rest of the steps are omitted, just follow the official guide.

monitor.acsalab.com is the public URL for Grafana.

Configure Grafana

Grafana needs to be configured on two parts: Datasource and Dashboard.

Add the InfluxDB datasource:

  • Go to Connections → Add new connection → Select InfluxDB from datasources
  • Fill in information as required. Use grafana user for InfluxDB, and set the HTTP method to POST

Create a dashboard:

  • Go to Dashboard → New dashboard → Import dashboard
  • Enter 928 20268 so we'll import this one, press Load
  • Enter a suitable UID, which will become the URL segment as in https://monitor.acsalab.com/d/UID

Clients

Telegraf is the collector agent to be installed on monitored machines. It is also a project from InfluxData so it shares the same APT repository as InfluxDB.

wget -O /etc/apt/trusted.gpg.d/influxdb.asc https://mirrors.ustc.edu.cn/influxdata/influxdata-archive_compat.key
echo "deb https://mirrors.ustc.edu.cn/influxdata/debian bullseye stable" > /etc/apt/sources.list.d/influxdb.list
apt update
apt install --no-install-recommends telegraf

The only difference from installing InfluxDB is that we install the telegraf package. All other commands remain identical.

We need to add our custom configuration file set, which is stored in the GitHub repository ACSAlab/telegraf-config. To apply the configuration from the repository:

  • Clear the default configuration file /etc/telegraf/telegraf.conf. Use truncate -s 0 or :> (if you know what this does) to clear the file content without deleting it.
  • Clone the repository to /etc/telegraf/repo. You can configure a Deploy Key for the repository (more on that later) for convenience.
  • Look at the files in the repositories, and symlink the ones you need to /etc/telegraf/telegraf.d.

    • Every host should include base.conf, disk-default.conf and influxdb-acsa.conf.
    • The NFS server should additionally include disk-nfs.conf.
    • Any GPU server should additionally include nvidia.conf.
    • Example steps:

      cd /etc/telegraf/telegraf.d
      ln -sf ../repo/{base,disk-default,influxdb-acsa}.conf .
      
  • Restart Telegraf with systemctl restart telegraf.

You should now see stats from the host in Grafana after a refresh.

ASC

There are a few Docker containers running monitoring software for ASC. See ASC for more details.