PACE Setup
Work in Progress
Getting Access to PACE
Account Types and Limits
- Student Accounts: Free tier with limited compute hours
- Research Allocations: Group allocations with shared compute time
- Storage: Home directory (50GB) + group storage allocation
Connecting to PACE via SSH
SSH Configuration (Recommended)
See the SSH and Git Setup guide for detailed instructions on configuring your SSH connection.
First Login Setup
# After first successful login
ssh pace
# Check your environment
hostname
whoami
pwd
df -h $HOME
# Check available modules
module avail
Understanding the PACE Environment
Cluster Architecture
PACE consists of multiple clusters:
-
Phoenix: Primary cluster with modern hardware
-
Login nodes: General access, file management, job submission
- Compute nodes: CPU and GPU nodes for actual computation
-
Storage: High-performance parallel file systems
-
ICE: Specialized cluster for certain workloads
- Different hardware configurations
- May have different software availability
Node Types
Login Nodes
- Purpose: File management, job submission, light development
- Limitations:
- No intensive computation (kills jobs after 30 minutes)
- Shared among all users
- Limited memory and CPU
- Use for: Editing files, submitting jobs, basic testing
Compute Nodes
- CPU Nodes: Various configurations (8-64 cores, 32GB-1TB RAM)
- GPU Nodes: NVIDIA GPUs (V100, A100, RTX series)
- Access via: SLURM job scheduler only
- Use for: Training models, running experiments, intensive computation
Software Environment
Module System
PACE uses environment modules to manage software:
# List available modules
module avail
# Search for specific software
module avail python
module avail cuda
module avail torch
# Load modules
module load python/3.11
module load cuda/11.8
# List loaded modules
module list
# Unload modules
module unload python/3.11
module purge # unload all
# Show module details
module show python/3.11
Resource Allocation System
Quality of Service (QoS) Levels
- inferno: Default queue, higher priority, long running jobs
- embers: Low priority, pre-emptible jobs with 1 hour guaranteed runtime
Account Structure
# Check your allocations
pace-quota
File System and Storage
TODO: Add details about file systems, storage options, and best practices for data management.
Basic SLURM Commands
Job Submission
Interactive Jobs
# Request interactive session
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=4G --time=2:00:00 --qos=inferno
# Request GPU node
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4G --gres=gpu:1 --time=1:00:00 --qos=inferno
# Exit interactive session
exit
Batch Jobs
Create a SLURM script job.slurm
:
#!/bin/bash
#SBATCH --job-name=my-experiment
#SBATCH --account=paceship-dsgt_clef2025
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
#SBATCH --time=4:00:00
#SBATCH --qos=inferno
#SBATCH --output=job-%j.out
#SBATCH --error=job-%j.err
# Load modules
module load python/3.11
module load cuda/11.8
# Activate environment
source ~/.venvs/research-env/bin/activate
# Run your script
python train_model.py --config configs/bert.yaml
Submit the job:
sbatch job.slurm
Job Management
# Check job queue
squeue -u $USER
# Check all jobs for your account
squeue -A paceship-dsgt_clef2025
# Check job details
scontrol show job JOBID
# Cancel job
scancel JOBID
# Cancel all your jobs
scancel -u $USER
Job Monitoring
# Check running jobs
squeue -u $USER -t RUNNING
# Monitor resource usage of running job
ssh to_compute_node
htop
nvidia-smi # for GPU usage
Common SLURM Parameters
Resource Requests
# CPU jobs
--nodes=1 # Number of nodes
--ntasks=1 # Number of tasks (usually 1 for Python)
--cpus-per-task=8 # CPU cores per task
--mem-per-cpu=4G # Memory per CPU core
--time=4:00:00 # Wall time (HH:MM:SS)
# GPU jobs
--gres=gpu:1 # Request 1 GPU
--gres=gpu:rtx_6000:1 # Request specific GPU type
--gres=gpu:2 # Request 2 GPUs
# Memory options
--mem=32G # Total memory for job
--mem-per-cpu=4G # Memory per CPU core
Job Control
--job-name=my-job # Job name
--output=job-%j.out # Output file (%j = job ID)
--error=job-%j.err # Error file
--mail-type=ALL # Email notifications
--mail-user=you@gatech.edu # Email address
Best Practices
Resource Management
- Start small: Test with short jobs first
- Request only what you need: Don't waste resources
- Use checkpointing: Save progress for long jobs