01 Connecting to your HPC cluster

A terminal (also called a command prompt or shell) is a window where you type commands instead of clicking buttons. Here is how to open one on your machine:

W
Windows
Press the Windows key, search for PowerShell or Windows Terminal, and open it.
M
Mac
Press Cmd + Space, type Terminal, and press Enter.
L
Linux
Press Ctrl + Alt + T.

Once your terminal is open, connect to the cluster using SSH (Secure Shell). Your institution will provide you with a hostname, username, and password:

ssh username@hpc.university.edu

Replace username and hpc.university.edu with the credentials from your institution or instructor. When prompted for your password, type it and press Enter — the cursor will not move while you type, which is normal.

Interactive Tool

SSH & SCP command builder

Fill in your cluster details once, then copy the exact commands you need.

Connect
ssh username@hpc.university.edu
Upload one file
scp myscript.py username@hpc.university.edu:~/my_project/
Upload a folder
scp -r my_project/ username@hpc.university.edu:~/my_project/
💡 First-time connection

You may see a message asking if you trust the server. Type yes and press Enter.

⚠ Login nodes vs. compute nodes

When you SSH in, you land on a login node — a shared entry point not meant for heavy computation. Only run tiny test commands here. For real workloads, always use Slurm (see Part 8).

Once connected, you are in a text environment. Instead of clicking folders, you type commands. Here are the essential ones:

CommandWhat it does
lsList files and folders in the current location
ls -lhList files with sizes and details (human-readable)
cd foldernameEnter a folder (directory)
cd ..Go up one level (back to the parent folder)
cd ~Go directly to your home folder
pwdShow where you currently are
mkdir nameCreate a new folder called name
rm filenameDelete a file — no undo, be careful!
rm -r foldernameDelete an entire folder and its contents
cat file.txtDisplay the contents of a text file
head -n 20 file.txtShow only the first 20 lines of a file
tail -n 20 file.txtShow only the last 20 lines of a file
📂 Example walkthrough

ls — see what is in your home folder
mkdir my_project — create a new project folder
cd my_project — enter that folder
ls — it is empty for now

03 Copying and moving files — cp and mv

Two of the most useful commands for managing files on an HPC cluster are cp (copy) and mv (move/rename). Think of them like drag-and-drop — but in the terminal.

Copying files — cp

cp makes a duplicate of a file. The original stays in place.

# copy a file, giving it a new name cp source.py destination.py # copy into a folder cp myscript.py backup/myscript.py # copy an entire folder (-r means recursive) cp -r my_project/ my_project_backup/
CommandWhat it does
cp file.py copy.pyMake a copy of file.py called copy.py
cp file.py folder/Copy file.py into folder/
cp -r folder/ backup/Copy an entire folder into backup/
💡 Tip

Use cp -r to duplicate a whole project folder before making big changes — a handy safety net!

Moving and renaming files — mv

mv moves a file to a new location or renames it. Unlike cp, the original is removed.

# rename a file mv oldname.py newname.py # move into a folder mv myscript.py scripts/ # move a whole folder mv results/ outputs/results/
⚠ mv overwrites without asking!

If a file with the same name already exists at the destination, mv will silently overwrite it. Always double-check before moving.

04 Editing files in the terminal — vim

On an HPC cluster you cannot open a graphical text editor. Instead, you edit files directly in the terminal using vim (or the simpler nano). vim feels strange at first, but a few commands get you a long way.

🎛️
vim has two modes. Normal mode — for navigating and running commands — is where vim always starts. Insert mode — for actually typing text — is entered by pressing i. Switch back to Normal with Esc.

Opening, saving, and closing

vim myscript.py # open (or create) a file
CommandWhat it does
:wSave (write) the file
:qQuit vim (only if no unsaved changes)
:wqSave and quit
:q!Quit WITHOUT saving (force quit)
:w newname.pySave as a different filename

Moving around (Normal mode)

CommandWhat it does
Arrow keysMove the cursor
ggJump to the very top of the file
GJump to the very bottom of the file
0Jump to the start of the current line
$Jump to the end of the current line
/searchtermSearch for a word (press n for next match)

Editing text

CommandWhat it does
iEnter Insert mode before the cursor
oInsert a new line below and enter Insert mode
EscReturn to Normal mode
ddDelete the entire current line
uUndo the last change
Ctrl + rRedo (undo the undo)
yyCopy (yank) the current line
pPaste the copied line below the cursor
🛟 Minimal vim survival guide

1. Open a file: vim myscript.py
2. Start editing: press i
3. Make your changes
4. Stop editing: press Esc
5. Save and quit: type :wq then press Enter

If you get stuck, press Esc a few times then type :q! to exit without saving.

nano — the beginner-friendly alternative

If vim feels overwhelming, nano is much simpler. Commands are shown at the bottom of the screen.

nano myscript.py
CommandWhat it does
Ctrl + O, EnterSave the file
Ctrl + XExit nano
Ctrl + KCut (delete) the current line
Ctrl + WSearch for text

05 Uploading your Python files

Before running a Python script on the cluster, you need to transfer it from your laptop. Open a new terminal on your local machine (not the cluster one) and use SCP:

# upload a single file scp myscript.py username@hpc.university.edu:~/my_project/ # upload an entire folder (-r flag) scp -r my_project/ username@hpc.university.edu:~/

Once transferred, you can also load Python and run a quick sanity check directly — but only on the login node for tiny tests:

module load Python python myscript.py

06 Python environments — venv & uv

On an HPC cluster, the system Python is shared by everyone. Installing packages globally is either forbidden or will break things for others. The solution is a virtual environment — an isolated folder that holds its own Python interpreter and packages, separate from the system and from other projects.

📦
Why bother with virtual environments? Two projects might need different versions of the same package (e.g. PyTorch 1.x vs 2.x). Environments let each project have exactly what it needs without conflict — and you can wipe and recreate them cleanly at any time.

Option A — venv (built into Python, no install needed)

venv is Python's built-in tool. It is available everywhere Python is installed, making it the safe default on any HPC cluster.

# 1. Load Python via the module system first module load Python # 2. Create a virtual environment called .venv in your project folder python -m venv .venv # 3. Activate it — your prompt will change to show (.venv) source .venv/bin/activate # 4. Install packages — they go into .venv/, not the system pip install torch numpy pandas # 5. Install from a requirements file pip install -r requirements.txt # 6. Save your current packages to a requirements file pip freeze > requirements.txt # 7. Deactivate when done deactivate
CommandWhat it does
python -m venv .venvCreate a virtual environment in the .venv/ folder
source .venv/bin/activateActivate the environment (must do this every session)
deactivateLeave the environment (back to system Python)
pip install packageInstall a package into the active environment
pip install -r requirements.txtInstall all packages listed in requirements.txt
pip freeze > requirements.txtSave all installed packages and their versions to a file
pip listShow all installed packages in the current environment
pip show packageShow details about a specific installed package
pip uninstall packageRemove a package from the environment
which pythonConfirm you are using the environment's Python, not the system one
⚠ Remember to activate in your Slurm script

A Slurm job starts a fresh shell — your activation from the terminal is not carried over. Always add source .venv/bin/activate to your job script, after module load Python:

module load Python
source /home/username/my_project/.venv/bin/activate
python train.py

Option B — uv (fast, modern, recommended for new projects)

uv is a newer Python package manager written in Rust. It is dramatically faster than pip — often 10–100× — and handles both virtual environments and package installation in one tool. It may not be pre-installed on your cluster, but installing it for yourself takes seconds.

# Install uv into your home directory (no admin rights needed) curl -LsSf https://astral.sh/uv/install.sh | sh # Reload your shell so the 'uv' command is found source ~/.bashrc

Once installed, the workflow is similar to venv but faster and with smarter dependency resolution:

# Create a virtual environment (uses .venv/ by default) uv venv # Or specify a Python version explicitly uv venv --python 3.11 # Activate (same as venv) source .venv/bin/activate # Install packages — much faster than pip uv pip install torch numpy pandas # Install from requirements.txt uv pip install -r requirements.txt # Install from pyproject.toml (if your project uses it) uv pip install -e . # Run a script directly without activating (uv handles it) uv run python train.py
CommandWhat it does
uv venvCreate a virtual environment in .venv/
uv venv --python 3.11Create an environment with a specific Python version
uv pip install packageInstall a package (much faster than pip)
uv pip install -r requirements.txtInstall from a requirements file
uv pip install -e .Install current project in editable mode (from pyproject.toml)
uv pip freezeList installed packages and versions
uv run python train.pyRun a script inside the environment without activating first
uv pip compile requirements.inResolve and lock dependencies to a requirements.txt
⚡ venv vs. uv — which should I use?

venv + pip: always available, no setup, works everywhere. Use it when you are on a new cluster and are not sure what is installed.
uv: dramatically faster installs, smarter dependency handling, and a nicer overall experience. Use it for any new project where you can do a one-time install. On a shared cluster, install it to ~/.cargo/bin/ (its default) — no admin rights required.

Recommended project layout with an environment

my_project/ .venv/ # virtual environment — never commit this! data/ # datasets scripts/ # your Python files outputs/ # results, logs, saved models job.slurm # Slurm job script requirements.txt # pinned package versions pyproject.toml # optional: modern project metadata .gitignore # include .venv/ here!
🙈 Always gitignore your environment

Add .venv/ to your .gitignore. Virtual environments are large, machine-specific, and fully reproducible from requirements.txt — there is no reason to commit them.

07 Using tmux for persistent sessions

If you close your laptop or lose your Wi-Fi connection, any running program in your SSH session will be killed. tmux solves this by keeping your session alive on the server even when you disconnect — like leaving a TV on at home while you go out.

1
Start a new session
tmux new -s mysession — give it a memorable name.
2
Run your script as normal
python myscript.py
3
Detach — your script keeps running
Press Ctrl + B, then press D. You are back at the normal shell; the session is still alive.
4
Reconnect later
tmux attach -t mysession
CommandWhat it does
tmux new -s nameStart a new session called name
Ctrl+B, DDetach (leave session running)
tmux attach -t nameReconnect to a session
tmux lsList all sessions
tmux kill-session -t nameDelete a session
🔀 tmux vs. Slurm — when to use which

tmux: quick tests, debugging, interactive work that takes a few minutes.
Slurm: long training runs, batch jobs, anything needing a GPU.

08 Submitting jobs with Slurm

Slurm is a job scheduler. Instead of running your code directly, you write a small script describing your job (how long, how many CPUs, whether you need a GPU), and Slurm runs it on a compute node when resources are available.

Exploring your cluster first

CommandWhat it does
sinfoList all partitions (queues) and how many nodes are idle/busy
squeueShow all currently running and queued jobs on the cluster
squeue -u usernameShow only your own jobs
sacctmgr show userShow your account and billing group information
module availList all software modules available to load
module listShow which modules you currently have loaded

Creating a CPU job script

Create the file with nano job.slurm, write the following, then save with Ctrl+O, Enter, Ctrl+X:

#!/bin/bash #SBATCH --job-name=my_first_job #SBATCH --time=00:30:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=8G module load Python python myscript.py
Interactive Tool

Slurm script builder

Adjust the resources you want and copy a ready-to-edit job script.

CPU job4 CPUs8G RAM00:30:00
Generated job script
#!/bin/bash
#SBATCH --job-name=my_first_job
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

module load Python
source /home/username/my_project/.venv/bin/activate
python train.py
FlagMeaning
--job-nameA friendly name to identify your job in the queue
--time=00:30:00Maximum time allowed (HH:MM:SS) — job is killed if exceeded
--nodes=1Number of machines to use (almost always 1 for Python)
--ntasks=1Number of parallel tasks (1 for a single Python script)
--cpus-per-task=4How many CPU cores to allocate
--mem=8GHow much RAM to allocate
--partition=nameThe queue to submit to — clusters have different partitions (e.g. short, long, gpu). Check your cluster's docs.
--account=nameYour billing account — required on some HPC systems. Your welcome email will specify this.
🗂 Cluster-specific settings

Every HPC cluster has its own partition names, time limits, and account requirements. Always check your cluster's documentation or welcome email. Use sinfo to list partitions and sacctmgr show user to see your accounts.

Submitting, monitoring, and reading output

# submit the job sbatch job.slurm # check status — PD = pending, R = running, CG = completing squeue -u username # read the output when done cat slurm-12345.out # cancel a running or pending job scancel 12345

When your job finishes, Slurm creates slurm-JOBID.out in your current folder. This file contains everything your Python script printed — if something went wrong, the error message is here.

Interactive sessions

Instead of a batch job, you can request an interactive shell on a compute node to test things in real time:

srun --partition=gpu --gpus=1 --time=00:30:00 --pty bash # type 'exit' when done to release resources

09 Running GPU jobs & monitoring with nvidia-smi

GPU jobs follow the same pattern as CPU jobs, with a few extra flags:

#!/bin/bash #SBATCH --job-name=gpu_training #SBATCH --time=02:00:00 #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=32G #SBATCH --gpus=1 module load Python module load CUDA python train.py

The key additions: --partition=gpu routes your job to a GPU node (the exact name varies by cluster), --gpus=1 requests one GPU, and module load CUDA loads the GPU libraries needed by PyTorch or TensorFlow.

⏳ GPU availability

GPU nodes are in high demand. If your job is stuck as PD (pending) for a long time, try requesting fewer GPUs or a shorter time limit.

Monitoring GPU usage with nvidia-smi

Once your job is running on a GPU node, nvidia-smi lets you check how the GPU is being used — whether your code is actually using it, and how much memory it consumes.

# one-time snapshot nvidia-smi # example output: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85 Driver Version: 525.85 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+| | GPU Name Temp Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute | | 0 A100-SXM4 42C 68W / 400W | 12543MiB/40960MiB | 87% Default | +-----------------------------------------------------------------------------+
ColumnWhat it tells you
GPU-Util% of time the GPU is actively computing — should be high (>70%) during training
Memory-UsageGPU RAM used vs. total — if it hits 100%, your job crashes with an OOM (Out of Memory) error
TempGPU temperature in °C — normal range is 30–85°C
Pwr:Usage/CapPower draw — useful for estimating job cost

Useful nvidia-smi variants

CommandWhat it does
nvidia-smiOne-time snapshot of all GPUs
nvidia-smi -l 2Live view, refreshes every 2 seconds (Ctrl+C to stop)
nvidia-smi -LList all available GPUs and their names
watch -n 2 nvidia-smiAlternative live view using the watch command
nvidia-smi --query-gpu=utilization.gpu,
memory.used,memory.total --format=csv
Print only GPU %, memory used, and total as CSV
📍 Where to run nvidia-smi

nvidia-smi only works on a GPU compute node — not on the login node. Use it inside an interactive srun session, or add it to your job script to log GPU stats to the output file.

If GPU-Util stays near 0% during training, your model is likely not on the GPU — check that you are calling .to(device) or .cuda() on your model and tensors.

10 Using Git on the cluster

Git is a version control system — it tracks changes to your code, lets you collaborate with others, and makes it easy to bring your project onto the cluster directly from GitHub or GitLab. Instead of uploading files with scp every time you make a change, you push from your laptop to GitHub and pull on the cluster.

First-time setup

git config --global user.name "Your Name" git config --global user.email "you@example.com"

The everyday workflow

1
Clone — download the repository
git clone https://github.com/yourname/yourproject.git
2
Check what changed
git status
3
Stage your changes
git add . (all files) or git add myscript.py (one file)
4
Commit — save a snapshot with a message
git commit -m "Add training loop for experiment 1"
5
Push / pull — sync with GitHub
git push to upload your commits; git pull to download the latest changes.
✏️ Good commit messages matter

Write a short message that describes what changed and why.
Good: "Fix learning rate scheduler bug"
Bad: "stuff" or "changes"

Branches, history, and undoing mistakes

CommandWhat it does
git log --onelineShow a compact list of recent commits
git diffShow what you changed but haven't staged yet
git branch nameCreate a new branch
git checkout nameSwitch to that branch
git checkout -b nameCreate and switch in one step
git merge nameMerge a branch into the current one
git restore file.pyDiscard unsaved changes to a file
git stashTemporarily set aside uncommitted changes
git stash popBring stashed changes back
🌿 Branch for every experiment

Always create a new branch for each experiment or feature. Keep main clean and working — this way you can always go back to a known good state.

Ignoring files — .gitignore

Some files should never be committed — large datasets, model weights, caches. Create a .gitignore file in your project root:

# inside .gitignore data/ outputs/ *.pt *.pth __pycache__/ .ipynb_checkpoints/
⚠ Avoid git reset --hard

git reset --hard permanently discards uncommitted changes with no undo. Use git restore or git stash instead.

11 Common first-run issues

Most frustrating first attempts fail for a small set of predictable reasons. Check this section before assuming the cluster is broken or your code is cursed.

SSH says permission denied

Double-check your username and hostname first. If they are correct, confirm whether your cluster requires VPN access, key-based login, or a one-time password instead of a normal password.

Slurm rejects your job script

Errors like invalid account or unknown partition usually mean the cluster expects different queue names or billing settings. Run sinfo and compare with your cluster documentation.

Python or pip is missing inside the job

Interactive shells and Slurm jobs do not always share the same environment. Load the Python module inside the job script itself, then activate your .venv in that script too.

ModuleNotFoundError after install

You probably installed packages into one interpreter and ran the job with another. Check which python after activation and confirm the same path is used inside the Slurm job.

Job stays pending forever

GPU jobs and long runtimes wait the longest. Ask for fewer resources, shorten the time limit, or start with a CPU test job to confirm the workflow before scaling up.

The output file looks empty

Make sure your script actually prints progress and that it reaches the relevant code path. If it exits immediately, the real error is usually at the top of slurm-JOBID.out.

12 Quick reference — all commands

Connecting & uploading files

CommandWhat it does
ssh user@hpc.university.eduConnect to the HPC cluster
scp file.py user@hpc.university.edu:~/Upload a file
scp -r folder/ user@hpc.university.edu:~/Upload a whole folder

Terminal navigation

CommandWhat it does
ls / ls -lhList files (plain / with details)
cd name / cd ..Enter folder / go up a level
cd ~Go to home folder
pwdShow current location
mkdir nameCreate a folder
rm file / rm -r folderDelete file / delete folder
cat / head / tailShow full file / first N lines / last N lines
cp file copy / cp -r a/ b/Copy file / copy entire folder
mv old new / mv file folder/Rename / move a file

vim editor

CommandWhat it does
vim file.pyOpen a file
iEnter Insert mode
EscReturn to Normal mode
:wqSave and quit
:q!Quit without saving
dd / uDelete current line / undo
/termSearch for text

tmux sessions

CommandWhat it does
tmux new -s nameStart a new session
Ctrl+B, DDetach (leave running)
tmux attach -t nameReconnect
tmux lsList all sessions
tmux kill-session -t nameDelete a session

Slurm job management

CommandWhat it does
sbatch job.slurmSubmit a job
squeue -u usernameCheck your job status
scancel JOBIDCancel a job
cat slurm-JOBID.outRead job output
sinfoList partitions and availability
srun --pty bashStart an interactive compute session
module load PythonLoad Python module
module load CUDALoad GPU libraries
module availList all available modules

GPU monitoring — nvidia-smi

CommandWhat it does
nvidia-smiSnapshot of all GPUs (usage, memory, temperature)
nvidia-smi -l 2Live view, refreshes every 2 seconds
nvidia-smi -LList all available GPUs and their names
watch -n 2 nvidia-smiAlternative live view using watch

Python environments — venv & uv

CommandWhat it does
python -m venv .venvCreate a virtual environment
source .venv/bin/activateActivate the environment
deactivateLeave the environment
pip install packageInstall a package
pip install -r requirements.txtInstall from requirements file
pip freeze > requirements.txtSave installed packages to file
pip listShow all installed packages
which pythonConfirm you are using the venv Python
uv venvCreate environment with uv (faster)
uv venv --python 3.11Create environment with specific Python version
uv pip install packageInstall package with uv (10–100× faster than pip)
uv pip install -r requirements.txtInstall from requirements file with uv
uv run python train.pyRun script inside environment without activating

Git

CommandWhat it does
git clone URLDownload a repository
git statusSee what changed
git add . / git add fileStage all / a specific file
git commit -m "msg"Save a snapshot
git push / git pullUpload / download changes
git log --onelineView recent commits
git branch nameCreate a branch
git checkout nameSwitch branch
git merge nameMerge branch into current
git restore fileDiscard changes to a file
git stash / git stash popShelve / restore uncommitted work
Download
🚀 You are ready to go!

Start small: connect to the cluster, upload a simple Python script, and submit a CPU job first. Once that works, move to GPU jobs. If anything goes wrong, the .out file is your best debugging companion.

Next steps

Keep the momentum going

Once you finish this guide, use one concrete follow-up action right away so the workflow sticks.