Building a Raspberry Pi HPC cluster
Introduction
This document describes how to build a fully functioning high-performance computing (HPC) cluster using Raspberry Pi 4 single-board computers, model B, circa May 2022. The purpose of the Raspberry Pi HPC cluster is to create a teaching environment for computational science (e.g. computational chemistry) where students can learn about the Linux operating system and the tools and workflow in scientific computing. This project has been funded by the 2022 Teaching Innovation Project calls (INIE-PINN-2022-23-124538) at the University of Oviedo, who provided the necessary funds to acquire the hardware.
For this installation, we need at least two Raspberry Pi 4 model B (rPI) computers, an equivalent amount of SD cards, ethernet cables, USB-C cables for powering the rPIs, and an SD card reader. In addition, a monitor, keyboard, and mouse are required to perform the head node installation, as well as a desktop computer running Linux to flash the SDs and perform other ancillary operations. If you plan on incorporating more than two rPIs into the cluster, you also need an ethernet switch for the local network, probably a USB power bank for powering all the rPIs, and a case to keep things organized. The head node connects to a wireless network (perhaps at home or at your workplace) and the rPIs use their ethernet interfaces to communicate with each other over the local network. The instructions below apply to rPI 4 model B, but they can probably be adapted to other models as well as other computer types, single-board or not.
The compute nodes in our cluster run in diskless mode. This means that
they have no OS installed and, instead, boot over the network. The
head node acting as server. This simplifies the administration of a
(small) cluster considerably as there is no need to install an OS
on the compute nodes or worry about how they are provisioned. In the
rest of this document, commands issued as root
in the head node are
denoted by:
head# <command>
Commands issued as root
on any of the compute nodes are:
compute# <command>
The contents of the disk image served to the compute nodes are modified by using a chroot environment on the head node. Whenever we issue commands inside the chroot, it appears as:
chroot# <command>
Commands issued on the auxiliary desktop PC are:
aux# <command>
Lastly, we also sometimes issue commands as unprivileged users (for
instance, submitting a job), which are denoted by the prompts
head$
, compute$
, etc.
The Debian operating system is used for the installation. Debian has an excellent and well-curated repository of high-quality packages, is stable and reliable, and has been tested on the rPI architecture. The ultimate reason for this choice, however, comes down to personal preference. The SLURM workload manager is used as scheduling system.
Step 1: Flash the SD card
In the auxiliary PC, download the latest Debian OS image from the raspi Debian website. Insert your SD card and find the device file it corresponds to. Then, uncompress the image and flash it to the SD card:
aux# unxz 20220121_raspi_4_bookworm.img.xz
aux# dd if=20220121_raspi_4_bookworm.img of=/dev/sde
Warning: Make sure you get the device file correct and the SD card contains nothing you care about before flashing.
Insert the SD card into the rPI that will act as the head node. Connect it to the monitor, keyboard, and mouse, and boot it up into Debian.
Step 2: Create users and set root password
In the rPI you can now log in as root (requires no password) and add as many users as you need. You should also change the root password:
head# adduser --ingroup users user1
...
head# passwd
Step 3: Set up the network
Next we configure the wireless network to be able to talk to the
outside. The wpa_supplicant
program can be used for this, as
described in this
guide. First,
create the configuration file with your wifi ESSID and password:
head# wpa_passphrase <ESSID> <password> > /etc/wpa_supplicant/wpa_supplicant.conf
and then add the following lines to the same file:
#### /etc/wpa_supplicant/wpa_supplicant.conf
...
ctrl_interface=/run/wpa_supplicant
update_config=1
For the sake of security, make sure there are not any clear-text passwords in the file. Now find the wireless interface name with:
# iw dev
In my case, the interface is wlan0
. Connect to the wifi and enable
the wpa_supplicant
on startup with:
head# systemctl enable wpa_supplicant.service
head# service wpa_supplicant restart
head# wpa_supplicant -B -Dwext -i wlan0 -c/etc/wpa_supplicant/wpa_supplicant.conf
head# dhclient -v
Verify the connection is working:
head# ping www.debian.org
Set up the interfaces file for wlan0
so the system connects to the
same network when booted:
#### /etc/network/interfaces.d/wlan0
allow-hotplug wlan0
iface wlan0 inet dhcp
wpa-ssid <from-wpa_supplicant.conf>
wpa-psk <from-wpa_supplicant.conf>
where the wpa-ssid
and wpa-psk
can be copied from the
wpa_supplicant.conf
configuration file above.
Now that we have internet access, we install the net-tools
package,
which simplifies configuring the network:
apt update
apt install net-tools
and finally, restart the network:
ifdown wlan0
ifup wlan0
and check that it can ping the outside.
Step 4: Configure the package manager and upgrade the system
Modify the /etc/apt/sources.list
file to set up the source of your
packages and the particular debian version you want. I like the
testing
distribution because I find it is reasonably stable but also
has somewhat recent packages.
#### /etc/apt/sources.list
# testing
deb http://deb.debian.org/debian/ testing main non-free contrib
deb-src http://deb.debian.org/debian/ testing main non-free contrib
# testing/updates
deb http://deb.debian.org/debian/ testing-updates main contrib non-free
deb-src http://deb.debian.org/debian/ testing-updates main contrib non-free
# testing security
deb http://security.debian.org/debian-security testing-security main
deb-src http://security.debian.org/debian-security testing-security main
Update the package list and upgrade your distribution:
head# apt update
head# apt full-upgrade
This may take a while.
Step 5: Install a few basic packages
We need the SSH server and client for connecting to the nodes, as
well as editors and a few basic utilities. We do not need e-mail
utilities or anacron
, since the system will be running continuously:
head# apt install openssh-server openssh-client
head# apt install emacs nfs-common openssh-client openssh-server xz-utils \
locales bind9-host dbus man apt-install unzip file
head# apt remove 'exim4*'
head# apt remove anacron
You should now have access to the head node from your auxiliary PC. To find the IP of the head node, do:
head# /sbin/ifconfig -a
If this is the case, you can remove the monitor, keyboard, and mouse,
and work remotely to minimize clutter. You can also copy your SSH key
from the auxiliary PC, if you have one, or generate one with
ssh-keygen
:
aux# ssh-copy-id 192.168.1.135
aux# ssh 192.168.1.135
to simplify connecting to it.
Step 6: Create the compute node image
Now we start configuring the image that will be served to the compute
nodes when they boot up. The image for the nodes resides in
/image/pi4
. If you have several kinds of rPIs (or other computers)
you can have different images served for them, according to their MAC
address.
head# mkdir /image
head# chmod go-rX /image
Create the debian base system for the image by using the debootstrap
program from the repository:
head# apt install debootstrap
head# debootstrap testing /image/pi4
It is important that you do NOT set go-rX
permissions on this
directory or it won’t be possible to chroot into it. Lastly, copy the
package manager configuration over to the compute node image:
head# cp /etc/apt/sources.list /image/pi4/etc/apt/sources.list
Step 7: Configure the hostname and the names file
To configure the hostname in the head node:
head# echo "sebe" > /etc/hostname
Then edit the /etc/hosts
file and fill in the names of the head node
and the compute nodes. In my cluster the head node is sebe
(or
b01
in the local network) and the compute nodes are b02
,
b03
, etc.
#### /etc/hosts
127.0.0.1 localhost
10.0.0.1 sebe b01
10.0.0.2 b02
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Note that we are using the IPs 10.0.0.1
, 10.0.0.2
,… for the
local network, with the first one being the IP for the head node
(b01
).
Finally, propagate the hosts file to the compute node image:
head# cp /etc/hosts /image/pi4/etc/hosts
Step 8: Configure the local (wired) network
To configure the local network, first find the ethernet interface name of the head node with:
head# /sbin/ifconfig
Mine is eth0
. Then, create the corresponding interfaces file:
#### /etc/network/interfaces.d/eth0
allow-hotplug eth0
iface eth0 inet static
address 10.0.0.1
netmask 255.0.0.0
Step 9: Set up and test the chroot environment
To create the chroot environment, you need to mount the /dev
,
/proc
and /sys
directories from the host system, in this case the
head node:
head# mount --rbind /dev /image/pi4/dev
head# mount --make-rslave /image/pi4/dev
head# mount -t proc /proc /image/pi4/proc
head# mount --rbind /sys /image/pi4/sys
head# mount --make-rslave /image/pi4/sys
head# mount --rbind /tmp /image/pi4/tmp
More information about these commands can be found in the kernel documentation. Now you can chroot into the compute node image:
head# chroot /image/pi4
Exit the chroot with exit
or control-d.
Since we will be chrooting into the image quite a number of times, it is best if the system mounts the above directories automatically on start-up. To do this, put the following lines in the head node’s fstab file:
#### /etc/fstab
...
/dev /image/pi4/dev none defaults,rbind,rslave 0 0
/proc /image/pi4/proc proc defaults 0 0
/sys /image/pi4/sys none defaults,rbind,rslave 0 0
/tmp /image/pi4/tmp none defaults,rbind 0 0
Step 10: Configure the root bashrc
Edit the head node’s root bashrc (/root/.bashrc
) and enter as many
aliases and tricks as you need for comfortable work. Mine are:
# alias
alias mv='mv -f'
alias cp='cp -p'
alias rm='rm -f'
alias ls='ls --color=auto'
alias ll='ls -lrth'
alias la='ls -a'
alias m='less -R'
alias mi='less -i -R'
alias s='cd ..'
alias ss='cd ../.. '
alias sss='cd ../../.. '
alias grep='grep --color=auto'
alias fgrep='fgrep --color=auto'
alias egrep='egrep --color=auto'
alias ec='rm -f *~ .*~ >& /dev/null'
alias ecr="find . -name '*~' -delete"
alias b='cd $OLDPWD'
alias last='last -a'
alias p="pwd | sed 's/^\/home\/'$USER'/~/'"
# add paths
export PATH=./:/root/bin:/sbin:/usr/sbin/:$PATH
# emacs
alias e='emacs -nw --no-splash --no-x-resources'
export VISUAL='emacs -nw --no-splash --no-x-resources'
export EDITOR='emacs -nw --no-splash --no-x-resources'
export ALTERNATE_EDITOR='emacs -nw --no-splash --no-x-resources'
# pager
PAGER=/usr/bin/less
export PAGER
Copy over the same bashrc to the image:
head# cp /root/.bashrc /image/pi4/root/.bashrc
In the image, add the following at the end of the bashrc:
#### /image/pi4/root/.bashrc
...
if [[ "$HOSTNAME" != "sebe" && "$HOSTNAME" != "b01" ]] ; then
export PS1="\[\]\[\]\h:\W#\[\] "
else
export PS1="chroot:\W# "
fi
This way, you will be able to tell when you are in the chroot and when you are connected to a compute node via SSH. (Replace “sebe” and “b01” with your own names for the head node.)
Lastly, add the following alias to your head node bashrc:
#### /root/.bashrc
...
alias pi4='chroot /image/pi4'
With this, you will be able to access the image chroot by simply doing:
head# pi4
Step 11: Configure the locale
Configure the locale in the head node with:
head# locale-gen en_US.UTF-8
head# dpkg-reconfigure locales
Install your own locale (instead of en_US.UTF-8) and select it in the drop-down menu. Repeat in the image after installing the corresponding package:
chroot# apt install locales
chroot# locale-gen en_US.UTF-8
chroot# dpkg-reconfigure locales
Step 12: Upgrade and install packages in the image
Update and upgrade the distribution in the image:
chroot# apt update
chroot# apt full-upgrade
and then install the kernel image, the firmware, the SSH server, and
the utility packages. Remove exim
and anacron
:
chroot# apt install linux-image-arm64 firmware-linux firmware-linux-nonfree firmware-misc-nonfree
chroot# apt install raspi-firmware wireless-regdb firmware-brcm80211 bluez-firmware
chroot# apt install net-tools emacs nfs-common openssh-client openssh-server xz-utils bind9-host dbus
chroot# apt install openssh-client openssh-server apt-file
chroot# apt remove 'exim4*'
chroot# apt remove anacron
Step 13: Blacklist the wifi and bluetooth in the image
Blacklisting the wifi and bluetooth prevents starting unnecessary
services and also boot-up errors when operating diskless. Add the
following lines to the blacklist.conf
file in the image:
### /image/pi4/etc/modprobe.d/blacklist.conf
...
blacklist brcmfmac
blacklist brcmutil
blacklist btbcm
blacklist hci_uart
Step 14: Modify the initramfs in the image to be able to network boot
To have the compute nodes boot over the network, we need to configure their initial ramdisk and filesystem. Modify the image initramfs configuration file:
#### /image/pi4/etc/initramfs-tools/initramfs.conf
MODULES=netboot
BUSYBOX=y
KEYMAP=n
COMPRESS=xz
DEVICE=
NFSROOT=auto
BOOT=nfs
and then re-generate the initial ramdisk inside the chroot:
chroot# update-initramfs -u
Make a note of the initrd file that has been generated. In my
case, this was /boot/initrd.img-5.17.0-1-arm64
.
Step 15: Configure ramdisk for temporary files in the image
The compute nodes will have a read-only root filesystem (/
), so
temporary files cannot be written to /tmp
in the usual way. To
prevent this from causing errors, create a RAM partition for temporary
files, so they will be written to memory instead. To do this, insert
the following lines in the image configuration files:
head# echo ASYNCMOUNTNFS=no >> /image/pi4/etc/default/rcS
head# echo RAMTMP=yes >> /image/pi4/etc/default/tmpfs
Step 16: Configure the FTP server
The image is served to the compute nodes via an FTP server installed on the head node. First make the directory to contain the served files:
head# mkdir /srv/tftp
and then install the FTP server itself:
head# apt install tftpd-hpa tftp-hpa
The FTP server configuration resides in the /etc/default/tftpd-hpa
file of the head node:
#### /etc/default/tftpd-hpa
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/srv/tftp"
TFTP_ADDRESS=":69"
TFTP_OPTIONS="-s -v"
Lastly, restart the FTP server:
head# systemctl restart tftpd-hpa
To check that the FTP server works, you can run a simple test. First create a temporary file in the FTP server directory, then try to copy it via FTP:
head# cd
head# uname -a > /srv/tftp/bleh
head# chmod a+r /srv/tftp/bleh
head# tftp 10.0.0.1
tftp> get bleh
Check that the bleh
file was downloaded and then clean up:
head# rm bleh /srv/tftp/bleh
At any point in this process you can check what the server is doing with:
head# tail -f /var/log/syslog
This command will also be useful to check whether the image is booting over the network and accessing the FTP server files correctly.
Step 17: Configure SSH access to the compute nodes
First, we will allow root access to the compute nodes via SSH, for ease of use:
#### /image/pi4/etc/ssh/sshd_config
...
PermitRootLogin yes
TCPKeepAlive yes
...
and then we create an ssh key for root and copy it over to the image:
head# ssh-keygen
head# cp -r /root/.ssh/ /image/pi4/root/
head# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
head# cp /root/.ssh/authorized_keys /image/pi4/root/.ssh/
Step 18: Prepare the compute nodes to boot over the network
To have the compute nodes boot over the network, their EEPROM needs to be flashed with the appropriate image. The easiest way of doing this is to flash an SD card with raspberry OS, then use the tools that come with this distro to flash the EEPROM with the image we want. First download the latest raspberry OS. Then, burn it to a clean SD card:
aux# unzip 2020-02-13-raspbian-buster.zip
aux# dd if=2020-02-13-raspbian-buster.img of=/dev/sde
Warning: Once again, make sure you get the device file for the SD card right before using dd.
Insert the new SD card into the compute node, connect it to the monitor, mouse, and keyboard, and boot it up. Then, as root, copy the most recent EEPROM image from the firmware directory:
compute# cd
compute# ls /lib/firmware/raspberrypi/bootloader/beta
compute# cp /lib/firmware/raspberrypi/bootloader/beta/pieeprom-2020-01-18.bin .
and write its configuration to a file:
compute# rpi-eeprom-config pieeprom-2020-01-18.bin > boot.conf
Edit the file and add the line:
#### boot.conf
...
BOOT_ORDER=0xf21
to prioritize booting over the network. Then, write the image to the EEPROM:
compute# rpi-eeprom-config --out new.bin --config boot.conf pieeprom-2020-01-18.bin
compute# rpi-eeprom-update -d -f new.bin
Lastly, find the MAC address of the compute node by doing:
compute# ip addr show eth0
and write it down for when we configure the head node DHCP server in
the next step. In my case, the MAC address of the compute node is
dc:a6:32:f6:e0:b4
.
Repeat flashing the EEPROM for all the compute nodes you have. From
the instructions above, you only need to run the rpi-eeprom-update
command, since the new.bin
image we want to flash has already been
saved in the SD card. Remember to write down the MAC addresses of all
compute nodes.
The MAC address for the head node is easily found with the same command:
head# ip addr show eth0
Mine is e4:5f:01:0b:7f:2a
.
Step 18: Install and configure the DHCP server
Install the DHCP server in the head node from the package repository:
head# apt install isc-dhcp-server
Then configure the server by modifying first the
/etc/default/isc-dhcp-server
file to indicate only the
IPv4 interface is used:
#### /etc/default/isc-dhcp-server
INTERFACESv4="eth0"
INTERFACESv6=""
and then the /etc/dhcp/dhcpd.conf
for the details on how the compute
nodes will boot over the network:
#### /etc/dhcp/dhcpd.conf
ddns-update-style none;
use-host-decl-names on;
default-lease-time 600;
max-lease-time 7200;
subnet 10.0.0.0 netmask 255.0.0.0 {
option routers 10.0.0.1;
group {
option tftp-server-name "10.0.0.1"; # option 66
option vendor-class-identifier "PXEClient"; # option 60
option vendor-encapsulated-options "Raspberry Pi Boot"; # option 43
filename "pxelinux.0";
host b01 {
hardware ethernet 38:68:DD:5C:2B:D9; ## sebe/b01
fixed-address 10.0.0.1;
option host-name "b01";
}
host b02 {
hardware ethernet 38:68:DD:5C:2A:C1; ## b02
fixed-address 10.0.0.2;
option host-name "b02";
option root-path "10.0.0.1:/image/pi4,v3,tcp,hard,rsize=8192,wsize=8192";
}
}
}
Note we used the MAC address of each node and associated the corresponding IPs and names to them. We also indicated the location of the NFS root filesystem of the compute nodes. If you have different computers on your cluster that require different images, they can be served with different image directories by modifying this file.
If you want to do some testing, a simple alternative that gives out the IPs dynamically is:
####/etc/dhcp/dhcpd.conf
ddns-update-style none;
use-host-decl-names on;
default-lease-time 600;
max-lease-time 7200;
subnet 10.0.0.0 netmask 255.0.0.0 {
option routers 10.0.0.1;
option tftp-server-name "10.0.0.1"; # option 66
option vendor-class-identifier "PXEClient"; # option 60
option vendor-encapsulated-options "Raspberry Pi Boot"; # option 43
range 10.0.0.1 10.0.0.255;
filename "pxelinux.0";
}
but naturally this will not be the final configuration we use, since each particular compute node needs to be associated with a particular IP.
Now, restart the DHCP server:
head# systemctl restart isc-dhcp-server
(Note: the DHCP server won’t start if the /var/run/dhcpd.pid
file
exists. Remove this file if necessary.) At any point, you can check
the activity of the DHCP server in the same way as for the FTP server:
head# tail -f /var/log/syslog
This is useful for checking what the compute nodes are doing when they boot up over the network. You can also use this to find the MAC address of the compute nodes, as it will be shown in the logs.
Step 19: Prepare the network boot image
Copy the rPI firmware files over to the FTP server directory:
head# cp /boot/firmware/* /srv/tftp
Then edit the cmdline.txt
file:
#### /srv/tftp/cmdline.txt
console=tty0 ip=dhcp root=/dev/nfs ro nfsroot=10.0.0.1:/image/pi4,vers=3,nolock panic=60 ipv6.disable=1 rootwait systemd.gpt_auto=no
Remove the SD card from the compute node, connect it to the server with an ethernet cable, and reboot it. If you have more than one compute nodes, then you should install the ethernet switch now and connect the head node and all compute nodes to it. Keep an eye on the syslog in the server to make sure the files are being served properly:
head# tail -f /var/log/syslog
The compute node should boot only up to some point, but it should not start completely because no root filesystem could be served, since the NFS server has not been installed yet.
Step 20: Set up the NFS server
Install the NFS server in the head node:
head# apt install nfs-kernel-server nfs-common
Then configure the exports file to serve the image as read-only root
filesystem for the compute nodes and the /home
directory as
read-write home:
#### /etc/exports
/image/pi4 10.0.0.0/24(ro,sync,no_root_squash,no_subtree_check)
/home 10.0.0.0/24(rw,sync,no_root_squash,no_subtree_check)
Begin exporting by doing:
head# exportfs -rv
Then, verify that the NFS server is operative by mounting the image somewhere:
head# cd
head# mkdir temp
head# mount 10.0.0.1:image/pi4 temp/
head# df -h
Make sure the NFS mount is in the output of df -h
, then clean up:
head# umount temp/
head# rm -r temp
Finally, install the NFS client in the image:
chroot# apt install nfs-common
Step 21: Final tweaks and compute node boot up
Configure the mtab in the image by doing:
chroot# ln -s /proc/self/mounts /etc/mtab
and then comment out the corresponding line the
/usr/lib/tmpfiles.d/debian.conf
of the image:
#### /image/pi4/usr/lib/tmpfiles.d/debian.conf
...
#L+ /etc/mtab - - - - ../proc/self/mounts
...
Next, disable rcpbind and remove the hostname in the image or the boot-up may be stuck:
chroot# systemctl disable rpcbind
chroot# rm /etc/hostname
If you boot up the compute node now, it should be able to fetch the image and mount the necessary partitions over NFS, and get you to the login screen. Congratulations!
Step 22: Add the users to the image
The user and group IDs in the head node and the compute nodes must be
synchronized. Although there are more heavy-duty tools for this, a
simple way is to copy any relevant users and the root lines, as well
as the corresponding groups from the /etc/passwd
, /etc/shadow
, and
/etc/group
to the corresponding files in the /image/pi4
directory
tree. If you add a new user, you need to propagate the changes in
those files from the head node to the image.
Lastly, copy over the home directory for the users you created in step 2 over to the image:
head# cp -r /home/user1 /image/pi4/home/
head# ...
Step 23: Configure the fstab file and the volatile directories in the image
The root filesystem in the compute nodes is mounted read-only, so we need to provide RAM partitions for every directory where the compute node’s OS needs to write. Note that the contents of the RAM mounts will disappear once the compute node is rebooted so things like, for instance, log files, will not survive a reboot. First set up the temporary filesystems in the fstab file, so the compute node mounts them automatically on boot-up:
#### /image/pi4/etc/fstab
# <file system> <mount point> <type> <options> <dump> <pass>
10.0.0.1:/image/pi4 / nfs tcp,nolock,ro,v4 1 1
proc /proc proc defaults 0 0
none /tmp tmpfs defaults 0 0
none /media tmpfs defaults 0 0
none /var tmpfs defaults 0 0
none /run tmpfs defaults 0 0
10.0.0.1:/home /home nfs tcp,nolock,rw,v4 0 0
Then configure systemd in the image to create some directories on start-up. Even though /var is mounted as tmpfs, various services running on the compute nodes will complain if these directories do not exist:
#### /image/pi4/etc/tmpfiles.d/vardirs.conf
d /var/backups 0755 root root -
d /var/cache 0755 root root -
d /var/lib 0755 root root -
d /var/local 4755 root root -
d /var/log 0755 root root -
d /var/mail 4755 root root -
d /var/opt 0755 root root -
d /var/spool 0755 root root -
d /var/spool/rsyslog 0755 root root -
d /var/tmp 1777 root root -
Step 24: Time synchronization between head and compute nodes
It is important to keep the head and compute nodes synchronized
because files are written by both on the shared filesystems. If they
are not in sync, tools that depend on timestamps (like make
) will
have a hard time working. First install the ntp
server in the head
node:
head# apt install ntp
and then the timesyncd service from systemd in the image:
chroot# apt install systemd-timesyncd
Configure the timesyncd service in the image to get its time from the NTP server in the head node:
#### /image/pi4/etc/systemd/timesyncd.conf
[Time]
FallbackNTP=10.0.0.1
and then set up your time zone in both the head node and the image:
head# rm /etc/localtime
head# ln -s /usr/share/zoneinfo/Europe/Madrid /etc/localtime
chroot# rm /etc/localtime
chroot# ln -s /usr/share/zoneinfo/Europe/Madrid /etc/localtime
Step 25: Compute node reboot and final checks
Reboot the compute node now. It should boot up all the way to the login prompt. Check the messages in the journal and verify that there are no outstanding errors:
compute# journalctl -xb
Also, check the system status and verify there are no errors as well:
compute# systemctl status
compute# systemctl list-units --type=service
Check if the volatile directory creation was successful with:
compute# systemctl status systemd-tmpfiles-setup.service
(Note: You can run the volatile directory create again at any point
with systemd-tmpfiles --create
.) Lastly, check if the time is
synchronized between the compute node and the head node.
compute# timedatectl status
compute# timedatectl show-timesync --all
Step 26: Set up SD cards as scratch space in the compute nodes
The spare SD cards can be mounted in the compute nodes and used as additional local scratch space for the occasional calculation. Insert the SD card in the compute node and do:
compute# fdisk -l
to identify the local disk. To format it, use fdisk
:
compute# fdisk
and choose to delete all partitions (d
), make a new partition table
if necessary (GPT type, with g
). Then, make a new partition (n
)
and finally write the partitions and exit (w
). Once the scratch
partition is ready, create an EXT4 filesystem in it:
compute# mkfs.ext4 /dev/mmcblk1p1
where the device file is identified from the output of fdisk
. You
can now try to mount it in a temporary file:
compute# cd /home
compute# mkdir temp
compute# mount /dev/mmcblk1p1 temp
Check that the new partition works and then clean up:
compute# umount temp
compute# rm -r temp
The local SD card partition will be mounted automatically in
/scratch
on start-up. For this, create the scratch directory in the
image:
head# mkdir /image/pi4/scratch
head# chmod a+rX /image/pi4/scratch
and add the corresponding line in the image’s fstab file:
#### /image/pi4/etc/fstab
...
/dev/disk/by-path/platform-fe340000.mmc-part1 /scratch ext4 defaults,nofail 0 2
where the use of the “by-path” links ensures that the SD card is
mounted regardless of the device name. You can check on the compute
node the particular name your system uses by listing the contents of
the by-path
directory. Finally, reboot the client and verify the
disk is mounted.
Step 27: Create the scratch and opt partitions on the head node
Shut down both nodes and take out the SD card for the head
node. Connect the SD card to the auxiliary desktop PC and use
gparted
to resize the /
partition. Then, create two new
partitions: one for /opt
and one for /scratch
. The /opt
partition will hold the SLURM configuration as well as any computing
software you may want to install. Calculate how much room you will
need for /opt
and then use the rest for /scratch
.
Insert the SD card into the head node again and boot it up. Now you create the entries in the head node’s fstab for the two new partitions:
#### /etc/fstab
...
/dev/disk/by-path/platform-fe340000.mmc-part3 /scratch ext4 defaults,nofail 0 2
/dev/disk/by-path/platform-fe340000.mmc-part4 /opt ext4 defaults,nofail 0 2
and make the new /scratch
and /opt
directories:
mkdir /scratch
mkdir /opt
chmod a+rX /scratch
chmod a+rX /opt
Share the /opt
partition via NFS by modifying the exports file:
#### /etc/exports
...
/opt 10.0.0.0/24(ro,sync,no_root_squash,no_subtree_check)
and re-export with:
head# exportfs -rv
Enter the corresponding opt line in the fstab of the image, so the compute nodes mount it over the NFS on start-up:
#### /image/pi4/etc/fstab
...
10.0.0.1:/opt /opt nfs tcp,nolock,ro,v4 0 2
Optionally, create a software directory in the opt mount to place your programs:
head# mkdir /opt/software
Finally, reboot the head node and check the two partitions are mounted.
Step 28: Install SLURM
The rPI cluster is now completely installed and ready for operation. Now we need to install the scheduling system (SLURM). Boot up the head node. Once it is up, boot up all the compute nodes. In the head node, install munge:
head# apt install libmunge-dev libmunge2 munge
Then generate the MUNGE key and set the permissions:
head# dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key
head# chown munge:munge /etc/munge/munge.key
head# chmod 400 /etc/munge/munge.key
Restart munge:
head# systemctl restart munge
To check that munge is working on the head node, generate a credential on stdout
head# munge -n
Also, check if a credential can be locally decoded:
head# munge -n | unmunge
and you can also run a quick benchmark with:
head# remunge
On the head node, install SLURM from the package repository:
head# apt install slurm-wlm slurm-wlm-doc slurm-client
Copy over the “munge” and “slurm” users and groups over from the head
node to the compute nodes by propagating them from /etc/group
,
/etc/passwd
, and /etc/shadow
into the corresponding files in the
image directory tree (/image/pi4
). In the image, install the munge
packages:
chroot# apt install libmunge-dev libmunge2 munge
Configure new volatile directories using the tmpfiles service of
systemd so that munge can find its directories /var/log/munge
and
/var/lib/munge
in the compute nodes:
chroot# echo "d /var/log/munge 0700 munge munge -" >> /etc/tmpfiles.d/vardirs.conf
chroot# echo "d /var/lib/munge 0700 munge munge -" >> /etc/tmpfiles.d/vardirs.conf
Copy the munge.key
file from the head node to the image:
head# cp /etc/munge/munge.key /image/pi4/etc/munge/
and change the permissions of the munge key in the image:
chroot# chown munge.munge /etc/munge/munge.key
chroot# chmod 400 /etc/munge/munge.key
Boot up (or reboot) the compute node and check that munge is working
by decoding a credential remotely. (You can also restart munge without
rebooting with systemctl restart munge
if the compute node is
already up.):
head# munge -n | ssh 10.0.0.2 unmunge
where 10.0.0.2
is the IP of the corresponding compute node.
Finally, install the SLURM daemon in the image:
chroot# apt install slurmd slurm-client
Step 29: Configure SLURM
The SLURM configuration file is in /etc/slurm/slurm.conf
and all
nodes in the cluster must share the same configuration file. To make
this easier, we will direct this file to read the contents of another
file that is shared via NFS, in /opt/slurm/slurm.conf
:
head# mkdir /opt/slurm
head# touch /opt/slurm/slurm.conf
head# echo "include /opt/slurm/slurm.conf" > /etc/slurm/slurm.conf
head# echo "include /opt/slurm/slurm.conf" > /image/pi4/etc/slurm/slurm.conf
and now we put our configuration in the configuration file that resides in the NFS-shared opt partition instead.
To make a configuration file, visit the
SLURM configurator
and fill the entries. For simplicity, I did not use proctracktype
cgroup, because setting up the cgroups would be required. Once the
configuration is generated, write it to /opt/slurm/slurm.conf
.
The most important part of the SLURM configuration file is at the end, where the node and partition characteristics are specified. You can get the configuration details of a particular node by SSHing to it and doing:
head# slurmd -C
compute# slurmd -C
Replace the line in the slurm.conf
with the relevant information
from the output of these commands. In my cluster, I will use the head
node (b01
) also as a compute node, since there aren’t very many
nodes in the cluster anyway. However, I would like the head node to be
used only when all the compute nodes are busy. For this to happen, the
weight of the head node needs to be higher:
#### /opt/slurm/slurm.conf
...
NodeName=b01 CPUs=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7807 weight=100 State=UNKNOWN
NodeName=b02 CPUs=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=3789 weight=1 State=UNKNOWN
PartitionName=pmain Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Note that, in my case, the head node has 8GB memory but the compute node only 4GB.
Finally, set the permissions, create the SLURM working directory, and enable and restart the SLURM server:
head# chmod a+r /etc/slurm/slurm.conf
head# mkdir /var/spool/slurmctld
head# chown -R slurm.slurm /var/spool/slurmctld
head# systemctl enable slurmctld
head# systemctl restart slurmctld
head# systemctl restart slurmd
You can check that the server is running with:
head# systemctl status slurmctld
head# systemctl status slurmd
head# scontrol ping
(Note: if there is an error, you can look up the problem in the logs
located in /var/log/slurm*
. You can also run the slurm daemon with
verbose options slurmd -Dvvvv
in the node to obtain more
information.)
Enable slurm in the client:
chroot# systemctl enable slurmd
If the compute node is down, reboot it. Otherwise, log into it and restart the SLURM server:
compute# systemctl restart slurmd
and verify it works:
compute# systemctl status slurmd
Finally, check that all the nodes are up by doing:
head# sinfo -N
A quirk of SLURM is that, by default, nodes that are down (because of a crash or because they were rebooted) are not automatically brought back into production. They need to be resumed with:
head# scontrol update NodeName=b02 State=RESUME
If you want, you can do a power cycle of the whole HPC cluster by
shutting down all the nodes. Then boot up the head node and, once it
is up, start all the compute nodes. Resume the compute nodes and check
the scheduler is working with sinfo -N
.
Step 30: Submit a test job
As a non-privileged user, submit a test job on the head node. The job can be:
### bleh.sub
#! /bin/bash
#SBATCH -t 28-00:00
#SBATCH -J test
#SBATCH -N 1
#SBATCH -n 4
#SBATCH -c 1
cat /dev/urandom > /dev/null
and to submit it, do:
head$ sbatch bleh.sub
head$ sbatch bleh.sub
head$ squeue
You should have seen the first job be picked by the compute node and the second job by the head node. Cancel them with:
head$ scancel 1
head$ scancel 2
Step 31: SLURM prolog, epilog, and taskprolog scripts
These scripts create the temporary directory in the scratch partition and set the appropriate permissions and environment variables. First, set up the script names in the SLURM configuration:
#### /opt/slurm/slurm.conf
...
Epilog=/opt/slurm/epilog.sh
...
Prolog=/opt/slurm/prolog.sh
...
TaskProlog=/opt/slurm/taskprolog.sh
...
We place these scripts in the shared opt partition so the compute
nodes have access to exactly the same version as he head node. The
scripts create the SLURM_TMPDIR
directory in the scratch partition
(the SD card each compute node has) and set the appropriate
environment variable. When the job finishes, the temporary directory
is removed. They are:
#### /opt/slurm/epilog.sh
#! /bin/bash
# remove the temporary directory
export SLURM_TMPDIR=/scratch/${SLURM_JOB_USER}-${SLURM_JOBID}
if [ -d "$SLURM_TMPDIR" ] ; then
rm -rf "$SLURM_TMPDIR"
fi
exit 0
#### /opt/slurm/prolog.sh
#! /bin/bash
# prepare the temporary directory.
export SLURM_TMPDIR=/scratch/${SLURM_JOB_USER}-${SLURM_JOBID}
mkdir $SLURM_TMPDIR
chown ${SLURM_JOB_USER}:users $SLURM_TMPDIR
chmod 700 $SLURM_TMPDIR
exit 0
#### /opt/slurm/taskprolog.sh
#! /bin/bash
echo export SLURM_TMPDIR=/scratch/${SLURM_JOB_USER}-${SLURM_JOBID}
exit 0
Finally, make the three scripts executable with:
head# chmod a+rx /opt/slurm/*.sh
And re-read the SLURM configuration with:
head# scontrol reconfigure
Step 32: Install Quantum ESPRESSO
To show how the rPI cluster can be used to run scientific calculations, we now install one of the most popular packages for electronic structure calculations in periodic solids, Quantum ESPRESSO (QE). QE has been packaged in the Debian repository, so installing the program can be done via the package manager in the head node and in the chroot:
head# apt install quantum-espresso
chroot# apt install quantum-espresso
Now we run a simple total energy calculation on the silicon crystal as an unprivileged user. First copy over the pseudopotential:
head$ cd
head$ cp /usr/share/espresso/pseudo/Si.pbe-rrkj.UPF si.UPF
and then write the input file:
#### ~/si.scf.in
&control
title='crystal',
prefix='crystal',
pseudo_dir='.',
/
&system
ibrav=0,
nat=2,
ntyp=1,
ecutwfc=40.0,
ecutrho=400.0,
/
&electrons
conv_thr = 1d-8,
/
ATOMIC_SPECIES
Si 28.085500 si.UPF
ATOMIC_POSITIONS crystal
Si 0.12500000 0.12500000 0.12500000
Si 0.87500000 0.87500000 0.87500000
K_POINTS automatic
4 4 4 1 1 1
CELL_PARAMETERS bohr
0.000000000000 5.131267854931 5.131267854931
5.131267854931 0.000000000000 5.131267854931
5.131267854931 5.131267854931 0.000000000000
Next the submission script:
#### ~/si.sub
#! /bin/bash
#SBATCH -t 28-00:00
#SBATCH -J si
#SBATCH -N 1
#SBATCH -n 4
#SBATCH -c 1
cd
mpirun -np 4 pw.x < si.scf.in > si.scf.out
and finally submit the calculation with:
head$ sbatch si.sub
You can follow the progress of the calculation with squeue
. It
should eventually finish and produce the si.scf.out
output file
containing the result of your calculation.