Getting Started with HPC Clusters, Slurm & tmux

01 Connecting to your HPC cluster

A terminal (also called a command prompt or shell) is a window where you type commands instead of clicking buttons. Here is how to open one on your machine:

Windows

Press the Windows key, search for PowerShell or Windows Terminal, and open it.

Mac

Press Cmd + Space, type Terminal, and press Enter.

Linux

Press Ctrl + Alt + T.

Once your terminal is open, connect to the cluster using SSH (Secure Shell). Your institution will provide you with a hostname, username, and password:

ssh username@hpc.university.edu

Replace username and hpc.university.edu with the credentials from your institution or instructor. When prompted for your password, type it and press Enter — the cursor will not move while you type, which is normal.

Interactive Tool

SSH & SCP command builder

Fill in your cluster details once, then copy the exact commands you need.

UsernameHostRemote project pathSingle fileFolder name

Connect

ssh username@hpc.university.edu

Upload one file

scp myscript.py username@hpc.university.edu:~/my_project/

Upload a folder

scp -r my_project/ username@hpc.university.edu:~/my_project/

💡 First-time connection

You may see a message asking if you trust the server. Type yes and press Enter.

⚠ Login nodes vs. compute nodes

When you SSH in, you land on a login node — a shared entry point not meant for heavy computation. Only run tiny test commands here. For real workloads, always use Slurm (see Part 8).

02 Navigating the terminal

Once connected, you are in a text environment. Instead of clicking folders, you type commands. Here are the essential ones:

Command	What it does
ls	List files and folders in the current location
ls -lh	List files with sizes and details (human-readable)
cd foldername	Enter a folder (directory)
cd ..	Go up one level (back to the parent folder)
cd ~	Go directly to your home folder
pwd	Show where you currently are
mkdir name	Create a new folder called name
rm filename	Delete a file — no undo, be careful!
rm -r foldername	Delete an entire folder and its contents
cat file.txt	Display the contents of a text file
head -n 20 file.txt	Show only the first 20 lines of a file
tail -n 20 file.txt	Show only the last 20 lines of a file

📂 Example walkthrough

ls — see what is in your home folder
mkdir my_project — create a new project folder
cd my_project — enter that folder
ls — it is empty for now

03 Copying and moving files — `cp` and `mv`

Two of the most useful commands for managing files on an HPC cluster are cp (copy) and mv (move/rename). Think of them like drag-and-drop — but in the terminal.

Copying files — cp

cp makes a duplicate of a file. The original stays in place.

# copy a file, giving it a new name
cp source.py destination.py
# copy into a folder
cp myscript.py backup/myscript.py
# copy an entire folder (-r means recursive)
cp -r my_project/ my_project_backup/

Command	What it does
cp file.py copy.py	Make a copy of file.py called copy.py
cp file.py folder/	Copy file.py into folder/
cp -r folder/ backup/	Copy an entire folder into backup/

💡 Tip

Use cp -r to duplicate a whole project folder before making big changes — a handy safety net!

Moving and renaming files — mv

mv moves a file to a new location or renames it. Unlike cp, the original is removed.

# rename a file
mv oldname.py newname.py
# move into a folder
mv myscript.py scripts/
# move a whole folder
mv results/ outputs/results/

⚠ mv overwrites without asking!

If a file with the same name already exists at the destination, mv will silently overwrite it. Always double-check before moving.

04 Editing files in the terminal — vim

On an HPC cluster you cannot open a graphical text editor. Instead, you edit files directly in the terminal using vim (or the simpler nano). vim feels strange at first, but a few commands get you a long way.

🎛️

vim has two modes. Normal mode — for navigating and running commands — is where vim always starts. Insert mode — for actually typing text — is entered by pressing i. Switch back to Normal with Esc.

Opening, saving, and closing

vim myscript.py # open (or create) a file

Command	What it does
:w	Save (write) the file
:q	Quit vim (only if no unsaved changes)
:wq	Save and quit
:q!	Quit WITHOUT saving (force quit)
:w newname.py	Save as a different filename

Moving around (Normal mode)

Command	What it does
Arrow keys	Move the cursor
gg	Jump to the very top of the file
G	Jump to the very bottom of the file
0	Jump to the start of the current line
$	Jump to the end of the current line
/searchterm	Search for a word (press `n` for next match)

Editing text

Command	What it does
i	Enter Insert mode before the cursor
o	Insert a new line below and enter Insert mode
Esc	Return to Normal mode
dd	Delete the entire current line
u	Undo the last change
Ctrl + r	Redo (undo the undo)
yy	Copy (yank) the current line
p	Paste the copied line below the cursor

🛟 Minimal vim survival guide

1. Open a file: vim myscript.py
2. Start editing: press i
3. Make your changes
4. Stop editing: press Esc
5. Save and quit: type :wq then press Enter

If you get stuck, press Esc a few times then type :q! to exit without saving.

nano — the beginner-friendly alternative

If vim feels overwhelming, nano is much simpler. Commands are shown at the bottom of the screen.

nano myscript.py

Command	What it does
Ctrl + O, Enter	Save the file
Ctrl + X	Exit nano
Ctrl + K	Cut (delete) the current line
Ctrl + W	Search for text

05 Uploading your Python files

Before running a Python script on the cluster, you need to transfer it from your laptop. Open a new terminal on your local machine (not the cluster one) and use SCP:

# upload a single file
scp myscript.py username@hpc.university.edu:~/my_project/
# upload an entire folder (-r flag)
scp -r my_project/ username@hpc.university.edu:~/

Once transferred, you can also load Python and run a quick sanity check directly — but only on the login node for tiny tests:

module load Python
          python myscript.py

06 Python environments — venv & uv

On an HPC cluster, the system Python is shared by everyone. Installing packages globally is either forbidden or will break things for others. The solution is a virtual environment — an isolated folder that holds its own Python interpreter and packages, separate from the system and from other projects.

📦

Why bother with virtual environments? Two projects might need different versions of the same package (e.g. PyTorch 1.x vs 2.x). Environments let each project have exactly what it needs without conflict — and you can wipe and recreate them cleanly at any time.

Option A — venv (built into Python, no install needed)

venv is Python's built-in tool. It is available everywhere Python is installed, making it the safe default on any HPC cluster.

# 1. Load Python via the module system first
module load Python
# 2. Create a virtual environment called .venv in your project folder
python -m venv .venv
# 3. Activate it — your prompt will change to show (.venv)
source .venv/bin/activate
# 4. Install packages — they go into .venv/, not the system
pip install torch numpy pandas
# 5. Install from a requirements file
pip install -r requirements.txt
# 6. Save your current packages to a requirements file
pip freeze > requirements.txt
# 7. Deactivate when done
deactivate

Command	What it does
python -m venv .venv	Create a virtual environment in the `.venv/` folder
source .venv/bin/activate	Activate the environment (must do this every session)
deactivate	Leave the environment (back to system Python)
pip install package	Install a package into the active environment
pip install -r requirements.txt	Install all packages listed in requirements.txt
pip freeze > requirements.txt	Save all installed packages and their versions to a file
pip list	Show all installed packages in the current environment
pip show package	Show details about a specific installed package
pip uninstall package	Remove a package from the environment
which python	Confirm you are using the environment's Python, not the system one

⚠ Remember to activate in your Slurm script

A Slurm job starts a fresh shell — your activation from the terminal is not carried over. Always add source .venv/bin/activate to your job script, after module load Python:

module load Python
source /home/username/my_project/.venv/bin/activate
python train.py

Option B — uv (fast, modern, recommended for new projects)

uv is a newer Python package manager written in Rust. It is dramatically faster than pip — often 10–100× — and handles both virtual environments and package installation in one tool. It may not be pre-installed on your cluster, but installing it for yourself takes seconds.

# Install uv into your home directory (no admin rights needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Reload your shell so the 'uv' command is found
source ~/.bashrc

Once installed, the workflow is similar to venv but faster and with smarter dependency resolution:

# Create a virtual environment (uses .venv/ by default)
uv venv
# Or specify a Python version explicitly
uv venv --python 3.11
# Activate (same as venv)
source .venv/bin/activate
# Install packages — much faster than pip
uv pip install torch numpy pandas
# Install from requirements.txt
uv pip install -r requirements.txt
# Install from pyproject.toml (if your project uses it)
uv pip install -e .
# Run a script directly without activating (uv handles it)
uv run python train.py

Command	What it does
uv venv	Create a virtual environment in `.venv/`
uv venv --python 3.11	Create an environment with a specific Python version
uv pip install package	Install a package (much faster than pip)
uv pip install -r requirements.txt	Install from a requirements file
uv pip install -e .	Install current project in editable mode (from pyproject.toml)
uv pip freeze	List installed packages and versions
uv run python train.py	Run a script inside the environment without activating first
uv pip compile requirements.in	Resolve and lock dependencies to a `requirements.txt`

⚡ venv vs. uv — which should I use?

venv + pip: always available, no setup, works everywhere. Use it when you are on a new cluster and are not sure what is installed.
uv: dramatically faster installs, smarter dependency handling, and a nicer overall experience. Use it for any new project where you can do a one-time install. On a shared cluster, install it to ~/.cargo/bin/ (its default) — no admin rights required.

Recommended project layout with an environment

my_project/
          .venv/ # virtual environment — never commit this!
data/ # datasets
scripts/ # your Python files
outputs/ # results, logs, saved models
job.slurm # Slurm job script
requirements.txt # pinned package versions
pyproject.toml # optional: modern project metadata
.gitignore # include .venv/ here!

🙈 Always gitignore your environment

Add .venv/ to your .gitignore. Virtual environments are large, machine-specific, and fully reproducible from requirements.txt — there is no reason to commit them.

07 Using tmux for persistent sessions

If you close your laptop or lose your Wi-Fi connection, any running program in your SSH session will be killed. tmux solves this by keeping your session alive on the server even when you disconnect — like leaving a TV on at home while you go out.

Start a new session

tmux new -s mysession — give it a memorable name.

Run your script as normal

python myscript.py

Detach — your script keeps running

Press Ctrl + B, then press D. You are back at the normal shell; the session is still alive.

Reconnect later

tmux attach -t mysession

Command	What it does
tmux new -s name	Start a new session called name
Ctrl+B, D	Detach (leave session running)
tmux attach -t name	Reconnect to a session
tmux ls	List all sessions
tmux kill-session -t name	Delete a session

🔀 tmux vs. Slurm — when to use which

tmux: quick tests, debugging, interactive work that takes a few minutes.
Slurm: long training runs, batch jobs, anything needing a GPU.

08 Submitting jobs with Slurm

Slurm is a job scheduler. Instead of running your code directly, you write a small script describing your job (how long, how many CPUs, whether you need a GPU), and Slurm runs it on a compute node when resources are available.

Exploring your cluster first

Command	What it does
sinfo	List all partitions (queues) and how many nodes are idle/busy
squeue	Show all currently running and queued jobs on the cluster
squeue -u username	Show only your own jobs
sacctmgr show user	Show your account and billing group information
module avail	List all software modules available to load
module list	Show which modules you currently have loaded

Creating a CPU job script

Create the file with nano job.slurm, write the following, then save with Ctrl+O, Enter, Ctrl+X:

#!/bin/bash #SBATCH --job-name=my_first_job #SBATCH --time=00:30:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=8G
module load Python
          python myscript.py

Interactive Tool

Slurm script builder

Adjust the resources you want and copy a ready-to-edit job script.

Job nameTimeCPUsMemoryPartitionAccountPython moduleVirtual env activate pathRun command

CPU job4 CPUs8G RAM00:30:00

Generated job script

#!/bin/bash
#SBATCH --job-name=my_first_job
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

module load Python
source /home/username/my_project/.venv/bin/activate
python train.py

Flag	Meaning
--job-name	A friendly name to identify your job in the queue
--time=00:30:00	Maximum time allowed (HH:MM:SS) — job is killed if exceeded
--nodes=1	Number of machines to use (almost always 1 for Python)
--ntasks=1	Number of parallel tasks (1 for a single Python script)
--cpus-per-task=4	How many CPU cores to allocate
--mem=8G	How much RAM to allocate
--partition=name	The queue to submit to — clusters have different partitions (e.g. short, long, gpu). Check your cluster's docs.
--account=name	Your billing account — required on some HPC systems. Your welcome email will specify this.

🗂 Cluster-specific settings

Every HPC cluster has its own partition names, time limits, and account requirements. Always check your cluster's documentation or welcome email. Use sinfo to list partitions and sacctmgr show user to see your accounts.

Submitting, monitoring, and reading output

# submit the job
sbatch job.slurm
# check status — PD = pending, R = running, CG = completing
squeue -u username
# read the output when done
cat slurm-12345.out
# cancel a running or pending job
scancel 12345

When your job finishes, Slurm creates slurm-JOBID.out in your current folder. This file contains everything your Python script printed — if something went wrong, the error message is here.

Interactive sessions

Instead of a batch job, you can request an interactive shell on a compute node to test things in real time:

srun --partition=gpu --gpus=1 --time=00:30:00 --pty bash
# type 'exit' when done to release resources

09 Running GPU jobs & monitoring with `nvidia-smi`

GPU jobs follow the same pattern as CPU jobs, with a few extra flags:

#!/bin/bash #SBATCH --job-name=gpu_training #SBATCH --time=02:00:00 #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=32G #SBATCH --gpus=1
module load Python
          module load CUDA

          python train.py

The key additions: --partition=gpu routes your job to a GPU node (the exact name varies by cluster), --gpus=1 requests one GPU, and module load CUDA loads the GPU libraries needed by PyTorch or TensorFlow.

⏳ GPU availability

GPU nodes are in high demand. If your job is stuck as PD (pending) for a long time, try requesting fewer GPUs or a shorter time limit.

Monitoring GPU usage with nvidia-smi

Once your job is running on a GPU node, nvidia-smi lets you check how the GPU is being used — whether your code is actually using it, and how much memory it consumes.

# one-time snapshot
nvidia-smi
# example output:
+-----------------------------------------------------------------------------+
          | NVIDIA-SMI 525.85   Driver Version: 525.85   CUDA Version: 12.0            |
          |-------------------------------+----------------------+----------------------+|
          | GPU  Name        Temp  Pwr:Usage/Cap |  Memory-Usage   | GPU-Util  Compute  |
          |   0  A100-SXM4   42C   68W / 400W   | 12543MiB/40960MiB |    87%   Default  |
          +-----------------------------------------------------------------------------+

Column	What it tells you
GPU-Util	% of time the GPU is actively computing — should be high (>70%) during training
Memory-Usage	GPU RAM used vs. total — if it hits 100%, your job crashes with an OOM (Out of Memory) error
Temp	GPU temperature in °C — normal range is 30–85°C
Pwr:Usage/Cap	Power draw — useful for estimating job cost

Useful nvidia-smi variants

Command	What it does
nvidia-smi	One-time snapshot of all GPUs
nvidia-smi -l 2	Live view, refreshes every 2 seconds (Ctrl+C to stop)
nvidia-smi -L	List all available GPUs and their names
watch -n 2 nvidia-smi	Alternative live view using the `watch` command
nvidia-smi --query-gpu=utilization.gpu, memory.used,memory.total --format=csv	Print only GPU %, memory used, and total as CSV

📍 Where to run nvidia-smi

nvidia-smi only works on a GPU compute node — not on the login node. Use it inside an interactive srun session, or add it to your job script to log GPU stats to the output file.

If GPU-Util stays near 0% during training, your model is likely not on the GPU — check that you are calling .to(device) or .cuda() on your model and tensors.

10 Using Git on the cluster

Git is a version control system — it tracks changes to your code, lets you collaborate with others, and makes it easy to bring your project onto the cluster directly from GitHub or GitLab. Instead of uploading files with scp every time you make a change, you push from your laptop to GitHub and pull on the cluster.

First-time setup

git config --global user.name "Your Name"
git config --global user.email "you@example.com"

The everyday workflow

Clone — download the repository

git clone https://github.com/yourname/yourproject.git

Check what changed

git status

Stage your changes

git add . (all files) or git add myscript.py (one file)

Commit — save a snapshot with a message

git commit -m "Add training loop for experiment 1"

Push / pull — sync with GitHub

git push to upload your commits; git pull to download the latest changes.

✏️ Good commit messages matter

Write a short message that describes what changed and why.
Good: "Fix learning rate scheduler bug"
Bad: "stuff" or "changes"

Branches, history, and undoing mistakes

Command	What it does
git log --oneline	Show a compact list of recent commits
git diff	Show what you changed but haven't staged yet
git branch name	Create a new branch
git checkout name	Switch to that branch
git checkout -b name	Create and switch in one step
git merge name	Merge a branch into the current one
git restore file.py	Discard unsaved changes to a file
git stash	Temporarily set aside uncommitted changes
git stash pop	Bring stashed changes back

🌿 Branch for every experiment

Always create a new branch for each experiment or feature. Keep main clean and working — this way you can always go back to a known good state.

Ignoring files — .gitignore

Some files should never be committed — large datasets, model weights, caches. Create a .gitignore file in your project root:

# inside .gitignore
data/
          outputs/
          *.pt
          *.pth
          __pycache__/
          .ipynb_checkpoints/

⚠ Avoid git reset --hard

git reset --hard permanently discards uncommitted changes with no undo. Use git restore or git stash instead.

11 Common first-run issues

Most frustrating first attempts fail for a small set of predictable reasons. Check this section before assuming the cluster is broken or your code is cursed.

SSH says permission denied

Double-check your username and hostname first. If they are correct, confirm whether your cluster requires VPN access, key-based login, or a one-time password instead of a normal password.

Slurm rejects your job script

Errors like invalid account or unknown partition usually mean the cluster expects different queue names or billing settings. Run sinfo and compare with your cluster documentation.

Python or pip is missing inside the job

Interactive shells and Slurm jobs do not always share the same environment. Load the Python module inside the job script itself, then activate your .venv in that script too.

ModuleNotFoundError after install

You probably installed packages into one interpreter and ran the job with another. Check which python after activation and confirm the same path is used inside the Slurm job.

Job stays pending forever

GPU jobs and long runtimes wait the longest. Ask for fewer resources, shorten the time limit, or start with a CPU test job to confirm the workflow before scaling up.

The output file looks empty

Make sure your script actually prints progress and that it reaches the relevant code path. If it exits immediately, the real error is usually at the top of slurm-JOBID.out.

12 Quick reference — all commands

Connecting & uploading files

Command	What it does
ssh user@hpc.university.edu	Connect to the HPC cluster
scp file.py user@hpc.university.edu:~/	Upload a file
scp -r folder/ user@hpc.university.edu:~/	Upload a whole folder

Terminal navigation

Command	What it does
ls / ls -lh	List files (plain / with details)
cd name / cd ..	Enter folder / go up a level
cd ~	Go to home folder
pwd	Show current location
mkdir name	Create a folder
rm file / rm -r folder	Delete file / delete folder
cat / head / tail	Show full file / first N lines / last N lines
cp file copy / cp -r a/ b/	Copy file / copy entire folder
mv old new / mv file folder/	Rename / move a file

vim editor

Command	What it does
vim file.py	Open a file
i	Enter Insert mode
Esc	Return to Normal mode
:wq	Save and quit
:q!	Quit without saving
dd / u	Delete current line / undo
/term	Search for text

tmux sessions

Command	What it does
tmux new -s name	Start a new session
Ctrl+B, D	Detach (leave running)
tmux attach -t name	Reconnect
tmux ls	List all sessions
tmux kill-session -t name	Delete a session

Slurm job management

Command	What it does
sbatch job.slurm	Submit a job
squeue -u username	Check your job status
scancel JOBID	Cancel a job
cat slurm-JOBID.out	Read job output
sinfo	List partitions and availability
srun --pty bash	Start an interactive compute session
module load Python	Load Python module
module load CUDA	Load GPU libraries
module avail	List all available modules

GPU monitoring — nvidia-smi

Command	What it does
nvidia-smi	Snapshot of all GPUs (usage, memory, temperature)
nvidia-smi -l 2	Live view, refreshes every 2 seconds
nvidia-smi -L	List all available GPUs and their names
watch -n 2 nvidia-smi	Alternative live view using watch

Python environments — venv & uv

Command	What it does
python -m venv .venv	Create a virtual environment
source .venv/bin/activate	Activate the environment
deactivate	Leave the environment
pip install package	Install a package
pip install -r requirements.txt	Install from requirements file
pip freeze > requirements.txt	Save installed packages to file
pip list	Show all installed packages
which python	Confirm you are using the venv Python
uv venv	Create environment with uv (faster)
uv venv --python 3.11	Create environment with specific Python version
uv pip install package	Install package with uv (10–100× faster than pip)
uv pip install -r requirements.txt	Install from requirements file with uv
uv run python train.py	Run script inside environment without activating

Git

Command	What it does
git clone URL	Download a repository
git status	See what changed
git add . / git add file	Stage all / a specific file
git commit -m "msg"	Save a snapshot
git push / git pull	Upload / download changes
git log --oneline	View recent commits
git branch name	Create a branch
git checkout name	Switch branch
git merge name	Merge branch into current
git restore file	Discard changes to a file
git stash / git stash pop	Shelve / restore uncommitted work

Download

PDF

🚀 You are ready to go!

Start small: connect to the cluster, upload a simple Python script, and submit a CPU job first. Once that works, move to GPU jobs. If anything goes wrong, the .out file is your best debugging companion.

Next steps

Keep the momentum going

Once you finish this guide, use one concrete follow-up action right away so the workflow sticks.

Submit a tiny CPU job Use the minimal Slurm example first, even if your real target is GPU training. Upgrade to a GPU run Only add CUDA, a GPU partition, and larger resources after the CPU path is clean. Save the PDF cheat sheet Keep the command summary nearby when you are working on the cluster for the first few times.

The shortest path to your first successful cluster run

Who this guide is designed for

01 Connecting to your HPC cluster

SSH & SCP command builder

02 Navigating the terminal

03 Copying and moving files — cp and mv

Copying files — cp

Moving and renaming files — mv

04 Editing files in the terminal — vim

Opening, saving, and closing

Moving around (Normal mode)

Editing text

nano — the beginner-friendly alternative

05 Uploading your Python files

06 Python environments — venv & uv

Option A — venv (built into Python, no install needed)

Option B — uv (fast, modern, recommended for new projects)

Recommended project layout with an environment

07 Using tmux for persistent sessions

08 Submitting jobs with Slurm

Exploring your cluster first

Creating a CPU job script

Slurm script builder

Submitting, monitoring, and reading output

Interactive sessions

09 Running GPU jobs & monitoring with nvidia-smi

Monitoring GPU usage with nvidia-smi

Useful nvidia-smi variants

10 Using Git on the cluster

First-time setup

The everyday workflow

Branches, history, and undoing mistakes

Ignoring files — .gitignore

11 Common first-run issues

SSH says permission denied

Slurm rejects your job script

Python or pip is missing inside the job

ModuleNotFoundError after install

Job stays pending forever

The output file looks empty

12 Quick reference — all commands

Connecting & uploading files

Terminal navigation

vim editor

tmux sessions

Slurm job management

GPU monitoring — nvidia-smi

Python environments — venv & uv

Git

Keep the momentum going

03 Copying and moving files — `cp` and `mv`

09 Running GPU jobs & monitoring with `nvidia-smi`