# Building a Raspberry Pi HPC cluster

### Introduction

This document describes how to build a fully functioning high-performance computing (HPC) cluster using Raspberry Pi 4 single-board computers, model B, circa May 2022. The purpose of the Raspberry Pi HPC cluster is to create a teaching environment for computational science (e.g. computational chemistry) where students can learn about the Linux operating system and the tools and workflow in scientific computing. This project has been funded by the 2022 Teaching Innovation Project calls (INIE-PINN-2022-23-124538) at the University of Oviedo, who provided the necessary funds to acquire the hardware.

For this installation, we need at least two Raspberry Pi 4 model B (rPI) computers, an equivalent amount of SD cards, ethernet cables, USB-C cables for powering the rPIs, and an SD card reader. In addition, a monitor, keyboard, and mouse are required to perform the head node installation, as well as a desktop computer running Linux to flash the SDs and perform other ancillary operations. If you plan on incorporating more than two rPIs into the cluster, you also need an ethernet switch for the local network, probably a USB power bank for powering all the rPIs, and a case to keep things organized. The head node connects to a wireless network (perhaps at home or at your workplace) and the rPIs use their ethernet interfaces to communicate with each other over the local network. The instructions below apply to rPI 4 model B, but they can probably be adapted to other models as well as other computer types, single-board or not.

The compute nodes in our cluster run in diskless mode. This means that they have no OS installed and, instead, boot over the network. The head node acting as server. This simplifies the administration of a (small) cluster considerably as there is no need to install an OS on the compute nodes or worry about how they are provisioned. In the rest of this document, commands issued as root in the head node are denoted by:

head# <command>


Commands issued as root on any of the compute nodes are:

compute# <command>


The contents of the disk image served to the compute nodes are modified by using a chroot environment on the head node. Whenever we issue commands inside the chroot, it appears as:

chroot# <command>


Commands issued on the auxiliary desktop PC are:

aux# <command>


Lastly, we also sometimes issue commands as unprivileged users (for instance, submitting a job), which are denoted by the prompts head$, compute$, etc.

The Debian operating system is used for the installation. Debian has an excellent and well-curated repository of high-quality packages, is stable and reliable, and has been tested on the rPI architecture. The ultimate reason for this choice, however, comes down to personal preference. The SLURM workload manager is used as scheduling system.

### Step 1: Flash the SD card

In the auxiliary PC, download the latest Debian OS image from the raspi Debian website. Insert your SD card and find the device file it corresponds to. Then, uncompress the image and flash it to the SD card:

aux# unxz 20220121_raspi_4_bookworm.img.xz
aux# dd if=20220121_raspi_4_bookworm.img of=/dev/sde


Warning: Make sure you get the device file correct and the SD card contains nothing you care about before flashing.

Insert the SD card into the rPI that will act as the head node. Connect it to the monitor, keyboard, and mouse, and boot it up into Debian.

### Step 2: Create users and set root password

In the rPI you can now log in as root (requires no password) and add as many users as you need. You should also change the root password:

head# adduser --ingroup users user1
...


### Step 3: Set up the network

Next we configure the wireless network to be able to talk to the outside. The wpa_supplicant program can be used for this, as described in this guide. First, create the configuration file with your wifi ESSID and password:

head# wpa_passphrase <ESSID> <password> > /etc/wpa_supplicant/wpa_supplicant.conf


and then add the following lines to the same file:

#### /etc/wpa_supplicant/wpa_supplicant.conf
...
ctrl_interface=/run/wpa_supplicant
update_config=1


For the sake of security, make sure there are not any clear-text passwords in the file. Now find the wireless interface name with:

# iw dev


In my case, the interface is wlan0. Connect to the wifi and enable the wpa_supplicant on startup with:

head# systemctl enable wpa_supplicant.service
head# wpa_supplicant -B -Dwext -i wlan0 -c/etc/wpa_supplicant/wpa_supplicant.conf


Verify the connection is working:

head# ping www.debian.org


Set up the interfaces file for wlan0 so the system connects to the same network when booted:

#### /etc/network/interfaces.d/wlan0
allow-hotplug wlan0
iface wlan0 inet dhcp
wpa-ssid <from-wpa_supplicant.conf>
wpa-psk <from-wpa_supplicant.conf>


where the wpa-ssid and wpa-psk can be copied from the wpa_supplicant.conf configuration file above.

Now that we have internet access, we install the net-tools package, which simplifies configuring the network:

apt update
apt install net-tools


and finally, restart the network:

ifdown wlan0
ifup wlan0


and check that it can ping the outside.

### Step 4: Configure the package manager and upgrade the system

Modify the /etc/apt/sources.list file to set up the source of your packages and the particular debian version you want. I like the testing distribution because I find it is reasonably stable but also has somewhat recent packages.

#### /etc/apt/sources.list
# testing
deb http://deb.debian.org/debian/ testing main non-free contrib
deb-src http://deb.debian.org/debian/ testing main non-free contrib

deb http://deb.debian.org/debian/ testing-updates main contrib non-free
deb-src http://deb.debian.org/debian/ testing-updates main contrib non-free

# testing security
deb http://security.debian.org/debian-security testing-security main
deb-src http://security.debian.org/debian-security testing-security main


head# apt update


This may take a while.

### Step 5: Install a few basic packages

We need the SSH server and client for connecting to the nodes, as well as editors and a few basic utilities. We do not need e-mail utilities or anacron, since the system will be running continuously:

head# apt install openssh-server openssh-client
head# apt install emacs nfs-common openssh-client openssh-server xz-utils \
locales bind9-host dbus man apt-install unzip file


head# /sbin/ifconfig -a


If this is the case, you can remove the monitor, keyboard, and mouse, and work remotely to minimize clutter. You can also copy your SSH key from the auxiliary PC, if you have one, or generate one with ssh-keygen:

aux# ssh-copy-id 192.168.1.135
aux# ssh 192.168.1.135


to simplify connecting to it.

### Step 6: Create the compute node image

Now we start configuring the image that will be served to the compute nodes when they boot up. The image for the nodes resides in /image/pi4. If you have several kinds of rPIs (or other computers) you can have different images served for them, according to their MAC address.

head# mkdir /image


Create the debian base system for the image by using the debootstrap program from the repository:

head# apt install debootstrap


It is important that you do NOT set go-rX permissions on this directory or it won’t be possible to chroot into it. Lastly, copy the package manager configuration over to the compute node image:

head# cp /etc/apt/sources.list /image/pi4/etc/apt/sources.list


### Step 7: Configure the hostname and the names file

To configure the hostname in the head node:

head# echo "sebe" > /etc/hostname


Then edit the /etc/hosts file and fill in the names of the head node and the compute nodes. In my cluster the head node is sebe (or b01 in the local network) and the compute nodes are b02, b03, etc.

#### /etc/hosts
127.0.0.1       localhost

10.0.0.1 sebe b01
10.0.0.2 b02

::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters


Note that we are using the IPs 10.0.0.1, 10.0.0.2,… for the local network, with the first one being the IP for the head node (b01).

Finally, propagate the hosts file to the compute node image:

head# cp /etc/hosts /image/pi4/etc/hosts


### Step 8: Configure the local (wired) network

To configure the local network, first find the ethernet interface name of the head node with:

head# /sbin/ifconfig


Mine is eth0. Then, create the corresponding interfaces file:

#### /etc/network/interfaces.d/eth0
allow-hotplug eth0
iface eth0 inet	static


### Step 9: Set up and test the chroot environment

To create the chroot environment, you need to mount the /dev, /proc and /sys directories from the host system, in this case the head node:

head# mount --rbind /dev /image/pi4/dev
head# mount -t proc /proc /image/pi4/proc


More information about these commands can be found in the kernel documentation. Now you can chroot into the compute node image:

head# chroot /image/pi4


Exit the chroot with exit or control-d.

Since we will be chrooting into the image quite a number of times, it is best if the system mounts the above directories automatically on start-up. To do this, put the following lines in the head node’s fstab file:

#### /etc/fstab
...
/dev /image/pi4/dev none defaults,rbind,rslave 0 0
/proc /image/pi4/proc proc defaults 0 0
/sys /image/pi4/sys none defaults,rbind,rslave 0 0
/tmp /image/pi4/tmp none defaults,rbind 0 0


### Step 10: Configure the root bashrc

Edit the head node’s root bashrc (/root/.bashrc) and enter as many aliases and tricks as you need for comfortable work. Mine are:

# alias
alias mv='mv -f'
alias cp='cp -p'
alias rm='rm -f'
alias ls='ls --color=auto'
alias ll='ls -lrth'
alias la='ls -a'
alias m='less -R'
alias mi='less -i -R'
alias s='cd ..'
alias ss='cd ../.. '
alias sss='cd ../../.. '
alias grep='grep --color=auto'
alias fgrep='fgrep --color=auto'
alias egrep='egrep --color=auto'
alias ec='rm -f *~ .*~ >& /dev/null'
alias ecr="find . -name '*~' -delete"
alias b='cd $OLDPWD' alias last='last -a' alias p="pwd | sed 's/^\/home\/'$USER'/~/'"

export PATH=./:/root/bin:/sbin:/usr/sbin/:$PATH # emacs alias e='emacs -nw --no-splash --no-x-resources' export VISUAL='emacs -nw --no-splash --no-x-resources' export EDITOR='emacs -nw --no-splash --no-x-resources' export ALTERNATE_EDITOR='emacs -nw --no-splash --no-x-resources' # pager PAGER=/usr/bin/less export PAGER  Copy over the same bashrc to the image: head# cp /root/.bashrc /image/pi4/root/.bashrc  In the image, add the following at the end of the bashrc: #### /image/pi4/root/.bashrc ... if [[ "$HOSTNAME" != "sebe" && "$HOSTNAME" != "b01" ]] ; then export PS1="\h:\W# " else export PS1="chroot:\W# " fi  This way, you will be able to tell when you are in the chroot and when you are connected to a compute node via SSH. (Replace “sebe” and “b01” with your own names for the head node.) Lastly, add the following alias to your head node bashrc: #### /root/.bashrc ... alias pi4='chroot /image/pi4'  With this, you will be able to access the image chroot by simply doing: head# pi4  ### Step 11: Configure the locale Configure the locale in the head node with: head# locale-gen en_US.UTF-8 head# dpkg-reconfigure locales  Install your own locale (instead of en_US.UTF-8) and select it in the drop-down menu. Repeat in the image after installing the corresponding package: chroot# apt install locales chroot# locale-gen en_US.UTF-8 chroot# dpkg-reconfigure locales  ### Step 12: Upgrade and install packages in the image Update and upgrade the distribution in the image: chroot# apt update chroot# apt full-upgrade  and then install the kernel image, the firmware, the SSH server, and the utility packages. Remove exim and anacron: chroot# apt install linux-image-arm64 firmware-linux firmware-linux-nonfree firmware-misc-nonfree chroot# apt install raspi-firmware wireless-regdb firmware-brcm80211 bluez-firmware chroot# apt install net-tools emacs nfs-common openssh-client openssh-server xz-utils bind9-host dbus chroot# apt install openssh-client openssh-server apt-file chroot# apt remove 'exim4*' chroot# apt remove anacron  ### Step 13: Blacklist the wifi and bluetooth in the image Blacklisting the wifi and bluetooth prevents starting unnecessary services and also boot-up errors when operating diskless. Add the following lines to the blacklist.conf file in the image: ### /image/pi4/etc/modprobe.d/blacklist.conf ... blacklist brcmfmac blacklist brcmutil blacklist btbcm blacklist hci_uart  ### Step 14: Modify the initramfs in the image to be able to network boot To have the compute nodes boot over the network, we need to configure their initial ramdisk and filesystem. Modify the image initramfs configuration file: #### /image/pi4/etc/initramfs-tools/initramfs.conf MODULES=netboot BUSYBOX=y KEYMAP=n COMPRESS=xz DEVICE= NFSROOT=auto BOOT=nfs  and then re-generate the initial ramdisk inside the chroot: chroot# update-initramfs -u  Make a note of the initrd file that has been generated. In my case, this was /boot/initrd.img-5.17.0-1-arm64. ### Step 15: Configure ramdisk for temporary files in the image The compute nodes will have a read-only root filesystem (/), so temporary files cannot be written to /tmp in the usual way. To prevent this from causing errors, create a RAM partition for temporary files, so they will be written to memory instead. To do this, insert the following lines in the image configuration files: head# echo ASYNCMOUNTNFS=no >> /image/pi4/etc/default/rcS head# echo RAMTMP=yes >> /image/pi4/etc/default/tmpfs  ### Step 16: Configure the FTP server The image is served to the compute nodes via an FTP server installed on the head node. First make the directory to contain the served files: head# mkdir /srv/tftp  and then install the FTP server itself: head# apt install tftpd-hpa tftp-hpa  The FTP server configuration resides in the /etc/default/tftpd-hpa file of the head node: #### /etc/default/tftpd-hpa TFTP_USERNAME="tftp" TFTP_DIRECTORY="/srv/tftp" TFTP_ADDRESS=":69" TFTP_OPTIONS="-s -v"  Lastly, restart the FTP server: head# systemctl restart tftpd-hpa  To check that the FTP server works, you can run a simple test. First create a temporary file in the FTP server directory, then try to copy it via FTP: head# cd head# uname -a > /srv/tftp/bleh head# chmod a+r /srv/tftp/bleh head# tftp 10.0.0.1 tftp> get bleh  Check that the bleh file was downloaded and then clean up: head# rm bleh /srv/tftp/bleh  At any point in this process you can check what the server is doing with: head# tail -f /var/log/syslog  This command will also be useful to check whether the image is booting over the network and accessing the FTP server files correctly. ### Step 17: Configure SSH access to the compute nodes First, we will allow root access to the compute nodes via SSH, for ease of use: #### /image/pi4/etc/ssh/sshd_config ... PermitRootLogin yes TCPKeepAlive yes ...  and then we create an ssh key for root and copy it over to the image: head# ssh-keygen head# cp -r /root/.ssh/ /image/pi4/root/ head# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys head# cp /root/.ssh/authorized_keys /image/pi4/root/.ssh/  ### Step 18: Prepare the compute nodes to boot over the network To have the compute nodes boot over the network, their EEPROM needs to be flashed with the appropriate image. The easiest way of doing this is to flash an SD card with raspberry OS, then use the tools that come with this distro to flash the EEPROM with the image we want. First download the latest raspberry OS. Then, burn it to a clean SD card: aux# unzip 2020-02-13-raspbian-buster.zip aux# dd if=2020-02-13-raspbian-buster.img of=/dev/sde  Warning: Once again, make sure you get the device file for the SD card right before using dd. Insert the new SD card into the compute node, connect it to the monitor, mouse, and keyboard, and boot it up. Then, as root, copy the most recent EEPROM image from the firmware directory: compute# cd compute# ls /lib/firmware/raspberrypi/bootloader/beta compute# cp /lib/firmware/raspberrypi/bootloader/beta/pieeprom-2020-01-18.bin .  and write its configuration to a file: compute# rpi-eeprom-config pieeprom-2020-01-18.bin > boot.conf  Edit the file and add the line: #### boot.conf ... BOOT_ORDER=0xf21  to prioritize booting over the network. Then, write the image to the EEPROM: compute# rpi-eeprom-config --out new.bin --config boot.conf pieeprom-2020-01-18.bin compute# rpi-eeprom-update -d -f new.bin  Lastly, find the MAC address of the compute node by doing: compute# ip addr show eth0  and write it down for when we configure the head node DHCP server in the next step. In my case, the MAC address of the compute node is dc:a6:32:f6:e0:b4. Repeat flashing the EEPROM for all the compute nodes you have. From the instructions above, you only need to run the rpi-eeprom-update command, since the new.bin image we want to flash has already been saved in the SD card. Remember to write down the MAC addresses of all compute nodes. The MAC address for the head node is easily found with the same command: head# ip addr show eth0  Mine is e4:5f:01:0b:7f:2a. ### Step 18: Install and configure the DHCP server Install the DHCP server in the head node from the package repository: head# apt install isc-dhcp-server  Then configure the server by modifying first the /etc/default/isc-dhcp-server file to indicate only the IPv4 interface is used: #### /etc/default/isc-dhcp-server INTERFACESv4="eth0" INTERFACESv6=""  and then the /etc/dhcp/dhcpd.conf for the details on how the compute nodes will boot over the network: #### /etc/dhcp/dhcpd.conf ddns-update-style none; use-host-decl-names on; default-lease-time 600; max-lease-time 7200; subnet 10.0.0.0 netmask 255.0.0.0 { option routers 10.0.0.1; group { option tftp-server-name "10.0.0.1"; # option 66 option vendor-class-identifier "PXEClient"; # option 60 option vendor-encapsulated-options "Raspberry Pi Boot"; # option 43 filename "pxelinux.0"; host b01 { hardware ethernet 38:68:DD:5C:2B:D9; ## sebe/b01 fixed-address 10.0.0.1; option host-name "b01"; } host b02 { hardware ethernet 38:68:DD:5C:2A:C1; ## b02 fixed-address 10.0.0.2; option host-name "b02"; option root-path "10.0.0.1:/image/pi4,v3,tcp,hard,rsize=8192,wsize=8192"; } } }  Note we used the MAC address of each node and associated the corresponding IPs and names to them. We also indicated the location of the NFS root filesystem of the compute nodes. If you have different computers on your cluster that require different images, they can be served with different image directories by modifying this file. If you want to do some testing, a simple alternative that gives out the IPs dynamically is: ####/etc/dhcp/dhcpd.conf ddns-update-style none; use-host-decl-names on; default-lease-time 600; max-lease-time 7200; subnet 10.0.0.0 netmask 255.0.0.0 { option routers 10.0.0.1; option tftp-server-name "10.0.0.1"; # option 66 option vendor-class-identifier "PXEClient"; # option 60 option vendor-encapsulated-options "Raspberry Pi Boot"; # option 43 range 10.0.0.1 10.0.0.255; filename "pxelinux.0"; }  but naturally this will not be the final configuration we use, since each particular compute node needs to be associated with a particular IP. Now, restart the DHCP server: head# systemctl restart isc-dhcp-server  (Note: the DHCP server won’t start if the /var/run/dhcpd.pid file exists. Remove this file if necessary.) At any point, you can check the activity of the DHCP server in the same way as for the FTP server: head# tail -f /var/log/syslog  This is useful for checking what the compute nodes are doing when they boot up over the network. You can also use this to find the MAC address of the compute nodes, as it will be shown in the logs. ### Step 19: Prepare the network boot image Copy the rPI firmware files over to the FTP server directory: head# cp /boot/firmware/* /srv/tftp  Then edit the cmdline.txt file: #### /srv/tftp/cmdline.txt console=tty0 ip=dhcp root=/dev/nfs ro nfsroot=10.0.0.1:/image/pi4,vers=3,nolock panic=60 ipv6.disable=1 rootwait systemd.gpt_auto=no  Remove the SD card from the compute node, connect it to the server with an ethernet cable, and reboot it. If you have more than one compute nodes, then you should install the ethernet switch now and connect the head node and all compute nodes to it. Keep an eye on the syslog in the server to make sure the files are being served properly: head# tail -f /var/log/syslog  The compute node should boot only up to some point, but it should not start completely because no root filesystem could be served, since the NFS server has not been installed yet. ### Step 20: Set up the NFS server Install the NFS server in the head node: head# apt install nfs-kernel-server nfs-common  Then configure the exports file to serve the image as read-only root filesystem for the compute nodes and the /home directory as read-write home: #### /etc/exports /image/pi4 10.0.0.0/24(ro,sync,no_root_squash,no_subtree_check) /home 10.0.0.0/24(rw,sync,no_root_squash,no_subtree_check)  Begin exporting by doing: head# exportfs -rv  Then, verify that the NFS server is operative by mounting the image somewhere: head# cd head# mkdir temp head# mount 10.0.0.1:image/pi4 temp/ head# df -h  Make sure the NFS mount is in the output of df -h, then clean up: head# umount temp/ head# rm -r temp  Finally, install the NFS client in the image: chroot# apt install nfs-common  ### Step 21: Final tweaks and compute node boot up Configure the mtab in the image by doing: chroot# ln -s /proc/self/mounts /etc/mtab  and then comment out the corresponding line the /usr/lib/tmpfiles.d/debian.conf of the image: #### /image/pi4/usr/lib/tmpfiles.d/debian.conf ... #L+ /etc/mtab - - - - ../proc/self/mounts ...  Next, disable rcpbind and remove the hostname in the image or the boot-up may be stuck: chroot# systemctl disable rpcbind chroot# rm /etc/hostname  If you boot up the compute node now, it should be able to fetch the image and mount the necessary partitions over NFS, and get you to the login screen. Congratulations! ### Step 22: Add the users to the image The user and group IDs in the head node and the compute nodes must be synchronized. Although there are more heavy-duty tools for this, a simple way is to copy any relevant users and the root lines, as well as the corresponding groups from the /etc/passwd, /etc/shadow, and /etc/group to the corresponding files in the /image/pi4 directory tree. If you add a new user, you need to propagate the changes in those files from the head node to the image. Lastly, copy over the home directory for the users you created in step 2 over to the image: head# cp -r /home/user1 /image/pi4/home/ head# ...  ### Step 23: Configure the fstab file and the volatile directories in the image The root filesystem in the compute nodes is mounted read-only, so we need to provide RAM partitions for every directory where the compute node’s OS needs to write. Note that the contents of the RAM mounts will disappear once the compute node is rebooted so things like, for instance, log files, will not survive a reboot. First set up the temporary filesystems in the fstab file, so the compute node mounts them automatically on boot-up: #### /image/pi4/etc/fstab # <file system> <mount point> <type> <options> <dump> <pass> 10.0.0.1:/image/pi4 / nfs tcp,nolock,ro,v4 1 1 proc /proc proc defaults 0 0 none /tmp tmpfs defaults 0 0 none /media tmpfs defaults 0 0 none /var tmpfs defaults 0 0 none /run tmpfs defaults 0 0 10.0.0.1:/home /home nfs tcp,nolock,rw,v4 0 0  Then configure systemd in the image to create some directories on start-up. Even though /var is mounted as tmpfs, various services running on the compute nodes will complain if these directories do not exist: #### /image/pi4/etc/tmpfiles.d/vardirs.conf d /var/backups 0755 root root - d /var/cache 0755 root root - d /var/lib 0755 root root - d /var/local 4755 root root - d /var/log 0755 root root - d /var/mail 4755 root root - d /var/opt 0755 root root - d /var/spool 0755 root root - d /var/spool/rsyslog 0755 root root - d /var/tmp 1777 root root -  ### Step 24: Time synchronization between head and compute nodes It is important to keep the head and compute nodes synchronized because files are written by both on the shared filesystems. If they are not in sync, tools that depend on timestamps (like make) will have a hard time working. First install the ntp server in the head node: head# apt install ntp  and then the timesyncd service from systemd in the image: chroot# apt install systemd-timesyncd  Configure the timesyncd service in the image to get its time from the NTP server in the head node: #### /image/pi4/etc/systemd/timesyncd.conf [Time] FallbackNTP=10.0.0.1  and then set up your time zone in both the head node and the image: head# rm /etc/localtime head# ln -s /usr/share/zoneinfo/Europe/Madrid /etc/localtime chroot# rm /etc/localtime chroot# ln -s /usr/share/zoneinfo/Europe/Madrid /etc/localtime  ### Step 25: Compute node reboot and final checks Reboot the compute node now. It should boot up all the way to the login prompt. Check the messages in the journal and verify that there are no outstanding errors: compute# journalctl -xb  Also, check the system status and verify there are no errors as well: compute# systemctl status compute# systemctl list-units --type=service  Check if the volatile directory creation was successful with: compute# systemctl status systemd-tmpfiles-setup.service  (Note: You can run the volatile directory create again at any point with systemd-tmpfiles --create.) Lastly, check if the time is synchronized between the compute node and the head node. compute# timedatectl status compute# timedatectl show-timesync --all  ### Step 26: Set up SD cards as scratch space in the compute nodes The spare SD cards can be mounted in the compute nodes and used as additional local scratch space for the occasional calculation. Insert the SD card in the compute node and do: compute# fdisk -l  to identify the local disk. To format it, use fdisk: compute# fdisk  and choose to delete all partitions (d), make a new partition table if necessary (GPT type, with g). Then, make a new partition (n) and finally write the partitions and exit (w). Once the scratch partition is ready, create an EXT4 filesystem in it: compute# mkfs.ext4 /dev/mmcblk1p1  where the device file is identified from the output of fdisk. You can now try to mount it in a temporary file: compute# cd /home compute# mkdir temp compute# mount /dev/mmcblk1p1 temp  Check that the new partition works and then clean up: compute# umount temp compute# rm -r temp  The local SD card partition will be mounted automatically in /scratch on start-up. For this, create the scratch directory in the image: head# mkdir /image/pi4/scratch head# chmod a+rX /image/pi4/scratch  and add the corresponding line in the image’s fstab file: #### /image/pi4/etc/fstab ... /dev/disk/by-path/platform-fe340000.mmc-part1 /scratch ext4 defaults,nofail 0 2  where the use of the “by-path” links ensures that the SD card is mounted regardless of the device name. You can check on the compute node the particular name your system uses by listing the contents of the by-path directory. Finally, reboot the client and verify the disk is mounted. ### Step 27: Create the scratch and opt partitions on the head node Shut down both nodes and take out the SD card for the head node. Connect the SD card to the auxiliary desktop PC and use gparted to resize the / partition. Then, create two new partitions: one for /opt and one for /scratch. The /opt partition will hold the SLURM configuration as well as any computing software you may want to install. Calculate how much room you will need for /opt and then use the rest for /scratch. Insert the SD card into the head node again and boot it up. Now you create the entries in the head node’s fstab for the two new partitions: #### /etc/fstab ... /dev/disk/by-path/platform-fe340000.mmc-part3 /scratch ext4 defaults,nofail 0 2 /dev/disk/by-path/platform-fe340000.mmc-part4 /opt ext4 defaults,nofail 0 2  and make the new /scratch and /opt directories: mkdir /scratch mkdir /opt chmod a+rX /scratch chmod a+rX /opt  Share the /opt partition via NFS by modifying the exports file: #### /etc/exports ... /opt 10.0.0.0/24(ro,sync,no_root_squash,no_subtree_check)  and re-export with: head# exportfs -rv  Enter the corresponding opt line in the fstab of the image, so the compute nodes mount it over the NFS on start-up: #### /image/pi4/etc/fstab ... 10.0.0.1:/opt /opt nfs tcp,nolock,ro,v4 0 2  Optionally, create a software directory in the opt mount to place your programs: head# mkdir /opt/software  Finally, reboot the head node and check the two partitions are mounted. ### Step 28: Install SLURM The rPI cluster is now completely installed and ready for operation. Now we need to install the scheduling system (SLURM). Boot up the head node. Once it is up, boot up all the compute nodes. In the head node, install munge: head# apt install libmunge-dev libmunge2 munge  Then generate the MUNGE key and set the permissions: head# dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key head# chown munge:munge /etc/munge/munge.key head# chmod 400 /etc/munge/munge.key  Restart munge: head# systemctl restart munge  To check that munge is working on the head node, generate a credential on stdout head# munge -n  Also, check if a credential can be locally decoded: head# munge -n | unmunge  and you can also run a quick benchmark with: head# remunge  On the head node, install SLURM from the package repository: head# apt install slurm-wlm slurm-wlm-doc slurm-client  Copy over the “munge” and “slurm” users and groups over from the head node to the compute nodes by propagating them from /etc/group, /etc/passwd, and /etc/shadow into the corresponding files in the image directory tree (/image/pi4). In the image, install the munge packages: chroot# apt install libmunge-dev libmunge2 munge  Configure new volatile directories using the tmpfiles service of systemd so that munge can find its directories /var/log/munge and /var/lib/munge in the compute nodes: chroot# echo "d /var/log/munge 0700 munge munge -" >> /etc/tmpfiles.d/vardirs.conf chroot# echo "d /var/lib/munge 0700 munge munge -" >> /etc/tmpfiles.d/vardirs.conf  Copy the munge.key file from the head node to the image: head# cp /etc/munge/munge.key /image/pi4/etc/munge/  and change the permissions of the munge key in the image: chroot# chown munge.munge /etc/munge/munge.key chroot# chmod 400 /etc/munge/munge.key  Boot up (or reboot) the compute node and check that munge is working by decoding a credential remotely. (You can also restart munge without rebooting with systemctl restart munge if the compute node is already up.): head# munge -n | ssh 10.0.0.2 unmunge  where 10.0.0.2 is the IP of the corresponding compute node. Finally, install the SLURM daemon in the image: chroot# apt install slurmd slurm-client  ### Step 29: Configure SLURM The SLURM configuration file is in /etc/slurm/slurm.conf and all nodes in the cluster must share the same configuration file. To make this easier, we will direct this file to read the contents of another file that is shared via NFS, in /opt/slurm/slurm.conf: head# mkdir /opt/slurm head# touch /opt/slurm/slurm.conf head# echo "include /opt/slurm/slurm.conf" > /etc/slurm/slurm.conf head# echo "include /opt/slurm/slurm.conf" > /image/pi4/etc/slurm/slurm.conf  and now we put our configuration in the configuration file that resides in the NFS-shared opt partition instead. To make a configuration file, visit the SLURM configurator and fill the entries. For simplicity, I did not use proctracktype cgroup, because setting up the cgroups would be required. Once the configuration is generated, write it to /opt/slurm/slurm.conf. The most important part of the SLURM configuration file is at the end, where the node and partition characteristics are specified. You can get the configuration details of a particular node by SSHing to it and doing: head# slurmd -C compute# slurmd -C  Replace the line in the slurm.conf with the relevant information from the output of these commands. In my cluster, I will use the head node (b01) also as a compute node, since there aren’t very many nodes in the cluster anyway. However, I would like the head node to be used only when all the compute nodes are busy. For this to happen, the weight of the head node needs to be higher: #### /opt/slurm/slurm.conf ... NodeName=b01 CPUs=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7807 weight=100 State=UNKNOWN NodeName=b02 CPUs=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=3789 weight=1 State=UNKNOWN PartitionName=pmain Nodes=ALL Default=YES MaxTime=INFINITE State=UP  Note that, in my case, the head node has 8GB memory but the compute node only 4GB. Finally, set the permissions, create the SLURM working directory, and enable and restart the SLURM server: head# chmod a+r /etc/slurm/slurm.conf head# mkdir /var/spool/slurmctld head# chown -R slurm.slurm /var/spool/slurmctld head# systemctl enable slurmctld head# systemctl restart slurmctld head# systemctl restart slurmd  You can check that the server is running with: head# systemctl status slurmctld head# systemctl status slurmd head# scontrol ping  (Note: if there is an error, you can look up the problem in the logs located in /var/log/slurm*. You can also run the slurm daemon with verbose options slurmd -Dvvvv in the node to obtain more information.) Enable slurm in the client: chroot# systemctl enable slurmd  If the compute node is down, reboot it. Otherwise, log into it and restart the SLURM server: compute# systemctl restart slurmd  and verify it works: compute# systemctl status slurmd  Finally, check that all the nodes are up by doing: head# sinfo -N  A quirk of SLURM is that, by default, nodes that are down (because of a crash or because they were rebooted) are not automatically brought back into production. They need to be resumed with: head# scontrol update NodeName=b02 State=RESUME  If you want, you can do a power cycle of the whole HPC cluster by shutting down all the nodes. Then boot up the head node and, once it is up, start all the compute nodes. Resume the compute nodes and check the scheduler is working with sinfo -N. ### Step 30: Submit a test job As a non-privileged user, submit a test job on the head node. The job can be: ### bleh.sub #! /bin/bash #SBATCH -t 28-00:00 #SBATCH -J test #SBATCH -N 1 #SBATCH -n 4 #SBATCH -c 1 cat /dev/urandom > /dev/null  and to submit it, do: head$ sbatch bleh.sub
head$sbatch bleh.sub head$ squeue


You should have seen the first job be picked by the compute node and the second job by the head node. Cancel them with:

head$scancel 1 head$ scancel 2


### Step 31: SLURM prolog, epilog, and taskprolog scripts

These scripts create the temporary directory in the scratch partition and set the appropriate permissions and environment variables. First, set up the script names in the SLURM configuration:

#### /opt/slurm/slurm.conf
...
Epilog=/opt/slurm/epilog.sh
...
Prolog=/opt/slurm/prolog.sh
...
...


We place these scripts in the shared opt partition so the compute nodes have access to exactly the same version as he head node. The scripts create the SLURM_TMPDIR directory in the scratch partition (the SD card each compute node has) and set the appropriate environment variable. When the job finishes, the temporary directory is removed. They are:

#### /opt/slurm/epilog.sh
#! /bin/bash

# remove the temporary directory
export SLURM_TMPDIR=/scratch/${SLURM_JOB_USER}-${SLURM_JOBID}
if [ -d "$SLURM_TMPDIR" ] ; then rm -rf "$SLURM_TMPDIR"
fi

exit 0

#### /opt/slurm/prolog.sh
#! /bin/bash

# prepare the temporary directory.
export SLURM_TMPDIR=/scratch/${SLURM_JOB_USER}-${SLURM_JOBID}
mkdir $SLURM_TMPDIR chown${SLURM_JOB_USER}:users $SLURM_TMPDIR chmod 700$SLURM_TMPDIR

exit 0

#### /opt/slurm/taskprolog.sh
#! /bin/bash

echo export SLURM_TMPDIR=/scratch/${SLURM_JOB_USER}-${SLURM_JOBID}

exit 0


Finally, make the three scripts executable with:

head# chmod a+rx /opt/slurm/*.sh


And re-read the SLURM configuration with:

head# scontrol reconfigure


### Step 32: Install Quantum ESPRESSO

To show how the rPI cluster can be used to run scientific calculations, we now install one of the most popular packages for electronic structure calculations in periodic solids, Quantum ESPRESSO (QE). QE has been packaged in the Debian repository, so installing the program can be done via the package manager in the head node and in the chroot:

head# apt install quantum-espresso
chroot# apt install quantum-espresso


Now we run a simple total energy calculation on the silicon crystal as an unprivileged user. First copy over the pseudopotential:

head$cd head$ cp /usr/share/espresso/pseudo/Si.pbe-rrkj.UPF si.UPF


and then write the input file:

#### ~/si.scf.in
&control
title='crystal',
prefix='crystal',
pseudo_dir='.',
/
&system
ibrav=0,
nat=2,
ntyp=1,
ecutwfc=40.0,
ecutrho=400.0,
/
&electrons
conv_thr = 1d-8,
/
ATOMIC_SPECIES
Si    28.085500 si.UPF

ATOMIC_POSITIONS crystal
Si    0.12500000     0.12500000     0.12500000
Si    0.87500000     0.87500000     0.87500000

K_POINTS automatic
4 4 4  1 1 1

CELL_PARAMETERS bohr
0.000000000000     5.131267854931     5.131267854931
5.131267854931     0.000000000000     5.131267854931
5.131267854931     5.131267854931     0.000000000000


Next the submission script:

#### ~/si.sub
#! /bin/bash
#SBATCH -t 28-00:00
#SBATCH -J si
#SBATCH -N 1
#SBATCH -n 4
#SBATCH -c 1

cd
mpirun -np 4 pw.x < si.scf.in > si.scf.out


and finally submit the calculation with:

head\$ sbatch si.sub


You can follow the progress of the calculation with squeue. It should eventually finish and produce the si.scf.out output file containing the result of your calculation.

Updated: