LogoPegasus Docs

Debugging

Debugging #

Remember to treat debugging like any interactive job: Use with care and sparingly. Please only run one session at a time and stop the job once you’re done debugging. Any long-running computation has to happen in standard batch jobs.

Running code on the cluster instead of your PC means you do not have direct access to the running processes and thus cannot easily connect a debugger. While it’s possible to run gdb python on the console with an interactive job the experience is subpar compared to modern, full-featured visual debuggers. Some IDEs like PyCharm Professional and VSCode support connecting to a remote machine and debugging there, but their configuration can be complex (more so since you cannot know in advance which node(s) your job will run on) and requires installation of a server component on the remote machine. We also cannot possibly give advice for all possible IDEs.

Instead, it’s nowadays actually easier to run the IDE itself on the cluster. Meet code-server, a lightly patched version of VSCode that can be run anywhere and accessed from a browser. Existing VSCode users can even copy over their config, since code-server is 100% compatible.

HTTPS Certificate #

Before you start, you need to generate a certificate for use with HTTPS. This certificate is valid for 1 year, so repeat the process when it expires.

mkdir -p ~/.cert
openssl req -x509 -newkey rsa:4096 -days 365 -nodes \
  -keyout ~/.cert/pegasus.key -out ~/.cert/pegasus.crt -sha256 \
  -subj "/C=DE/ST=Rheinland-Pfalz/L=Kaiserslautern/O=DFKI/OU=Pegasus/CN=kl.dfki.de"

Script template #

Anyone with access to the IDE will be able to run arbitrary commands in your name. Use the following command to generate the hash for your password:
echo -n <your password> | argon2 $(openssl rand -base64 32) -e
Change the HASHED_PASSWORD variable to set your password hash. Do not use your cluster password. chmod u+x,go-rwx your script so only you can read it.

Copy and modify the following shell script to start an IDE on the cluster:

#!/bin/bash
# Replace <PASSWORD HASH> below with the argon2 hash of the
# password of your choice.
# Run the following command and copy the output here:
#
# echo -n <your password> | argon2 $(openssl rand -base64 32) -e
#
export HASHED_PASSWORD='<PASSWORD HASH>'

# Choose a port based on the job id
export PORT=$(((${SLURM_JOB_ID} + 10007) % 16384 + 49152))

# Use the latest version of code server
export CODE_SERVER="$(find /netscratch/software/ -name 'code-server-*-linux-amd64.tar.gz' -printf "%T@ %p\n" | sort -n | cut -d ' ' -f2 | tail -1)"
if [ -z "$CODE_SERVER" ]
then
      echo "ERROR: no code server package found; check that /netscratch/software is in --container-mounts"
      exit 1
fi

# Print the URL where the IDE will become available
echo
echo =========================================
echo =========================================
echo =========================================
echo
echo using $CODE_SERVER
echo
echo IDE will be available at:
echo
echo $HOSTNAME.kl.dfki.de:$PORT
echo
echo Please wait for setup to finish.
echo
echo =========================================
echo =========================================
echo =========================================
echo

# Extract the IDE files
tar -f "$CODE_SERVER" -C /tmp/ -xz

# Install extensions
/tmp/code-server-*/bin/code-server \
    --user-data-dir=.code-server \
    --install-extension="ms-python.python" \
    # --install-extension="ms-python.vscode-pylance" \

# Start the IDE
/tmp/code-server-*/bin/code-server \
    --disable-telemetry \
    --disable-update-check \
    --bind-addr=$HOSTNAME.kl.dfki.de:$PORT \
    --auth password \
    --cert "/home/$SLURM_JOB_USER/.cert/pegasus.crt" \
    --cert-key "/home/$SLURM_JOB_USER/.cert/pegasus.key" \
    --user-data-dir=.code-server \
    "$(pwd)"
The script includes a step to install extensions. By default it installs Python, but you can easily add/remove extensions as needed. Check the VSCode marketplace and add a line with the “Unique Identifier” of the extension you want to install.

Save the script to a file, e.g., start_code_server.sh. Remember to chmod u+x,go-rwx the file to make it executable and keep others from accessing it.

Starting an IDE #

You can run your saved script through srun like any other script. The following command assumes the script is in the current directory and opens the IDE in the current directory (note the /netscratch/software mount, so we don’t have to download the code-server archive from GitHub all the time):

srun \
  --container-image=/enroot/nvcr.io_nvidia_pytorch_22.03-py3.sqsh \
  --container-mounts=/netscratch/software:/netscratch/software:ro,"`pwd`":"`pwd`" \
  --container-workdir="`pwd`" \
  --time=01:00:00 \
  start_code_server.sh

This will start the IDE inside the PyTorch 22.03 environment. The URL to access it will be printed on your console. You will probably need to make a security exception in your browser, since the certificate is self-signed. Login with your password and you should be presented with a website that is functionally almost identical to VSCode running on your PC, except it’s running on a cluster node and has access to the job’s resources. It is set to open the current directory when first loaded. You can change your code and - most importantly - use the visual debugger like you would with VSCode on your PC. Follow the official Python debugging tutorial if you are new to VSCode.

Overall, the experience is pretty good, if a bit clunky. E.g., the setup wizard is displayed every time (just ignore it) and breakpoints are not stored between runs.

Troubleshooting #

The example script is set up to use .code-server in the current directory for data storage. Some issues may be resolved by stopping the server and removing this folder.

Users who are running a remote VSCode SSH session have reported that startup may fail with an error similar to this:

[2022-10-24T07:44:58.646Z] error got error from Code
{"error":{"errno":-2,"code":"ENOENT","syscall":"connect","address":"..."}}

The error was resolved by stopping the VSCode session.