HPC Deployment
Overview
NIKI works well on HPC clusters. Key considerations:
- Install in a conda/virtualenv environment
- Run daemon on login node or dedicated node
- Use
niki attachfor existing Slurm jobs
Installation on HPC
# Create environmentconda create -n niki python=3.10conda activate niki
# Install NIKIpip install niki[all]
# Configureniki config initRunning the Daemon
Option 1: Screen/tmux session
# Start a screen sessionscreen -S niki
# Start daemonniki daemon start --foreground
# Detach: Ctrl+A, D# Reattach: screen -r nikiOption 2: Systemd user service
# Start as systemd service (if available)niki daemon start
# Check statussystemctl --user status nikiMonitoring Slurm Jobs
Method 1: Wrap with niki watch
# In your SBATCH script#!/bin/bash#SBATCH --job-name=training#SBATCH --time=24:00:00
niki watch --name "GPU Training" -- python train.pyMethod 2: Attach to running job
# Find the main process PIDsqueue -u $USERssh node-042 # SSH to the nodepgrep -f train.py
# Attachniki attach <PID> --name "Training job"Dealing with Compute Nodes
If compute nodes can’t reach the internet (for notifications):
- Run daemon on login node - daemon handles notifications
- Use SSH tunnel - forward daemon port to compute nodes
- Use internal network - configure daemon to listen on internal IP
SSH Tunnel Approach
# On compute node, in your job scriptssh -N -R 7432:localhost:7432 login-node &
# Now niki watch works normallyniki watch -- python train.pyEnvironment Variables
Remember to set API keys in your job environment:
# In ~/.bashrc or job scriptexport ANTHROPIC_API_KEY="sk-ant-..."Example SBATCH Script
#!/bin/bash#SBATCH --job-name=ml-training#SBATCH --partition=gpu#SBATCH --gres=gpu:1#SBATCH --time=24:00:00#SBATCH --output=logs/%j.out
source ~/.bashrcconda activate niki
niki watch --name "ResNet Training" -- \ python train.py \ --model resnet50 \ --epochs 1000 \ --lr 0.001