PACE Setup

Work in Progress

Getting Access to PACE

Account Types and Limits

  • Student Accounts: Free tier with limited compute hours
  • Research Allocations: Group allocations with shared compute time
  • Storage: Home directory (50GB) + group storage allocation

Connecting to PACE via SSH

See the SSH and Git Setup guide for detailed instructions on configuring your SSH connection.

First Login Setup

# After first successful login
ssh pace

# Check your environment
hostname
whoami
pwd
df -h $HOME

# Check available modules
module avail

Understanding the PACE Environment

Cluster Architecture

PACE consists of multiple clusters:

  • Phoenix: Primary cluster with modern hardware

  • Login nodes: General access, file management, job submission

  • Compute nodes: CPU and GPU nodes for actual computation
  • Storage: High-performance parallel file systems

  • ICE: Specialized cluster for certain workloads

  • Different hardware configurations
  • May have different software availability

Node Types

Login Nodes

  • Purpose: File management, job submission, light development
  • Limitations:
  • No intensive computation (kills jobs after 30 minutes)
  • Shared among all users
  • Limited memory and CPU
  • Use for: Editing files, submitting jobs, basic testing

Compute Nodes

  • CPU Nodes: Various configurations (8-64 cores, 32GB-1TB RAM)
  • GPU Nodes: NVIDIA GPUs (V100, A100, RTX series)
  • Access via: SLURM job scheduler only
  • Use for: Training models, running experiments, intensive computation

Software Environment

Module System

PACE uses environment modules to manage software:

# List available modules
module avail

# Search for specific software
module avail python
module avail cuda
module avail torch

# Load modules
module load python/3.11
module load cuda/11.8

# List loaded modules
module list

# Unload modules
module unload python/3.11
module purge  # unload all

# Show module details
module show python/3.11

Resource Allocation System

Quality of Service (QoS) Levels

  • inferno: Default queue, higher priority, long running jobs
  • embers: Low priority, pre-emptible jobs with 1 hour guaranteed runtime

Account Structure

# Check your allocations
pace-quota

File System and Storage

TODO: Add details about file systems, storage options, and best practices for data management.

Basic SLURM Commands

Job Submission

Interactive Jobs

# Request interactive session
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=4G --time=2:00:00 --qos=inferno

# Request GPU node
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4G --gres=gpu:1 --time=1:00:00 --qos=inferno

# Exit interactive session
exit

Batch Jobs

Create a SLURM script job.slurm:

#!/bin/bash
#SBATCH --job-name=my-experiment
#SBATCH --account=paceship-dsgt_clef2025
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
#SBATCH --time=4:00:00
#SBATCH --qos=inferno
#SBATCH --output=job-%j.out
#SBATCH --error=job-%j.err

# Load modules
module load python/3.11
module load cuda/11.8

# Activate environment
source ~/.venvs/research-env/bin/activate

# Run your script
python train_model.py --config configs/bert.yaml

Submit the job:

sbatch job.slurm

Job Management

# Check job queue
squeue -u $USER

# Check all jobs for your account
squeue -A paceship-dsgt_clef2025

# Check job details
scontrol show job JOBID

# Cancel job
scancel JOBID

# Cancel all your jobs
scancel -u $USER

Job Monitoring

# Check running jobs
squeue -u $USER -t RUNNING

# Monitor resource usage of running job
ssh to_compute_node
htop
nvidia-smi  # for GPU usage

Common SLURM Parameters

Resource Requests

# CPU jobs
--nodes=1                    # Number of nodes
--ntasks=1                   # Number of tasks (usually 1 for Python)
--cpus-per-task=8            # CPU cores per task
--mem-per-cpu=4G             # Memory per CPU core
--time=4:00:00              # Wall time (HH:MM:SS)

# GPU jobs
--gres=gpu:1                 # Request 1 GPU
--gres=gpu:rtx_6000:1       # Request specific GPU type
--gres=gpu:2                 # Request 2 GPUs

# Memory options
--mem=32G                    # Total memory for job
--mem-per-cpu=4G            # Memory per CPU core

Job Control

--job-name=my-job           # Job name
--output=job-%j.out         # Output file (%j = job ID)
--error=job-%j.err          # Error file
--mail-type=ALL             # Email notifications
--mail-user=you@gatech.edu  # Email address

Best Practices

Resource Management

  1. Start small: Test with short jobs first
  2. Request only what you need: Don't waste resources
  3. Use checkpointing: Save progress for long jobs