HPC Deployment

Overview

NIKI works well on HPC clusters. Key considerations:

Install in a conda/virtualenv environment
Run daemon on login node or dedicated node
Use niki attach for existing Slurm jobs

Installation on HPC

# Create environment
conda create -n niki python=3.10
conda activate niki

# Install NIKI
pip install niki[all]

# Configure
niki config init

Running the Daemon

Option 1: Screen/tmux session

# Start a screen session
screen -S niki

# Start daemon
niki daemon start --foreground

# Detach: Ctrl+A, D
# Reattach: screen -r niki

Option 2: Systemd user service

# Start as systemd service (if available)
niki daemon start

# Check status
systemctl --user status niki

Monitoring Slurm Jobs

Method 1: Wrap with niki watch

# In your SBATCH script
#!/bin/bash
#SBATCH --job-name=training
#SBATCH --time=24:00:00

niki watch --name "GPU Training" -- python train.py

Method 2: Attach to running job

# Find the main process PID
squeue -u $USER
ssh node-042  # SSH to the node
pgrep -f train.py

# Attach
niki attach <PID> --name "Training job"

Dealing with Compute Nodes

If compute nodes can’t reach the internet (for notifications):

Run daemon on login node - daemon handles notifications
Use SSH tunnel - forward daemon port to compute nodes
Use internal network - configure daemon to listen on internal IP

SSH Tunnel Approach

# On compute node, in your job script
ssh -N -R 7432:localhost:7432 login-node &

# Now niki watch works normally
niki watch -- python train.py

Environment Variables

Remember to set API keys in your job environment:

# In ~/.bashrc or job script
export ANTHROPIC_API_KEY="sk-ant-..."

Example SBATCH Script

#!/bin/bash
#SBATCH --job-name=ml-training
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --output=logs/%j.out

source ~/.bashrc
conda activate niki

niki watch --name "ResNet Training" -- \
    python train.py \
    --model resnet50 \
    --epochs 1000 \
    --lr 0.001