Skip to content

HPC Deployment

Overview

NIKI works well on HPC clusters. Key considerations:

  • Install in a conda/virtualenv environment
  • Run daemon on login node or dedicated node
  • Use niki attach for existing Slurm jobs

Installation on HPC

Terminal window
# Create environment
conda create -n niki python=3.10
conda activate niki
# Install NIKI
pip install niki[all]
# Configure
niki config init

Running the Daemon

Option 1: Screen/tmux session

Terminal window
# Start a screen session
screen -S niki
# Start daemon
niki daemon start --foreground
# Detach: Ctrl+A, D
# Reattach: screen -r niki

Option 2: Systemd user service

Terminal window
# Start as systemd service (if available)
niki daemon start
# Check status
systemctl --user status niki

Monitoring Slurm Jobs

Method 1: Wrap with niki watch

# In your SBATCH script
#!/bin/bash
#SBATCH --job-name=training
#SBATCH --time=24:00:00
niki watch --name "GPU Training" -- python train.py

Method 2: Attach to running job

Terminal window
# Find the main process PID
squeue -u $USER
ssh node-042 # SSH to the node
pgrep -f train.py
# Attach
niki attach <PID> --name "Training job"

Dealing with Compute Nodes

If compute nodes can’t reach the internet (for notifications):

  1. Run daemon on login node - daemon handles notifications
  2. Use SSH tunnel - forward daemon port to compute nodes
  3. Use internal network - configure daemon to listen on internal IP

SSH Tunnel Approach

Terminal window
# On compute node, in your job script
ssh -N -R 7432:localhost:7432 login-node &
# Now niki watch works normally
niki watch -- python train.py

Environment Variables

Remember to set API keys in your job environment:

Terminal window
# In ~/.bashrc or job script
export ANTHROPIC_API_KEY="sk-ant-..."

Example SBATCH Script

#!/bin/bash
#SBATCH --job-name=ml-training
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --output=logs/%j.out
source ~/.bashrc
conda activate niki
niki watch --name "ResNet Training" -- \
python train.py \
--model resnet50 \
--epochs 1000 \
--lr 0.001