01 Connecting to your HPC cluster
A terminal (also called a command prompt or shell) is a window where you type commands instead of clicking buttons. Here is how to open one on your machine:
Cmd + Space, type Terminal, and press Enter.Ctrl + Alt + T.Once your terminal is open, connect to the cluster using SSH (Secure Shell). Your institution will provide you with a hostname, username, and password:
Replace username and hpc.university.edu with the credentials from your institution or instructor. When prompted for your password, type it and press Enter — the cursor will not move while you type, which is normal.
SSH & SCP command builder
Fill in your cluster details once, then copy the exact commands you need.
ssh username@hpc.university.edu
scp myscript.py username@hpc.university.edu:~/my_project/
scp -r my_project/ username@hpc.university.edu:~/my_project/
You may see a message asking if you trust the server. Type yes and press Enter.
When you SSH in, you land on a login node — a shared entry point not meant for heavy computation. Only run tiny test commands here. For real workloads, always use Slurm (see Part 8).
02 Navigating the terminal
Once connected, you are in a text environment. Instead of clicking folders, you type commands. Here are the essential ones:
| Command | What it does |
|---|---|
| ls | List files and folders in the current location |
| ls -lh | List files with sizes and details (human-readable) |
| cd foldername | Enter a folder (directory) |
| cd .. | Go up one level (back to the parent folder) |
| cd ~ | Go directly to your home folder |
| pwd | Show where you currently are |
| mkdir name | Create a new folder called name |
| rm filename | Delete a file — no undo, be careful! |
| rm -r foldername | Delete an entire folder and its contents |
| cat file.txt | Display the contents of a text file |
| head -n 20 file.txt | Show only the first 20 lines of a file |
| tail -n 20 file.txt | Show only the last 20 lines of a file |
ls — see what is in your home folder
mkdir my_project — create a new project folder
cd my_project — enter that folder
ls — it is empty for now
03 Copying and moving files — cp and mv
Two of the most useful commands for managing files on an HPC cluster are cp (copy) and mv (move/rename). Think of them like drag-and-drop — but in the terminal.
Copying files — cp
cp makes a duplicate of a file. The original stays in place.
| Command | What it does |
|---|---|
| cp file.py copy.py | Make a copy of file.py called copy.py |
| cp file.py folder/ | Copy file.py into folder/ |
| cp -r folder/ backup/ | Copy an entire folder into backup/ |
Use cp -r to duplicate a whole project folder before making big changes — a handy safety net!
Moving and renaming files — mv
mv moves a file to a new location or renames it. Unlike cp, the original is removed.
If a file with the same name already exists at the destination, mv will silently overwrite it. Always double-check before moving.
04 Editing files in the terminal — vim
On an HPC cluster you cannot open a graphical text editor. Instead, you edit files directly in the terminal using vim (or the simpler nano). vim feels strange at first, but a few commands get you a long way.
i. Switch back to Normal with Esc.Opening, saving, and closing
| Command | What it does |
|---|---|
| :w | Save (write) the file |
| :q | Quit vim (only if no unsaved changes) |
| :wq | Save and quit |
| :q! | Quit WITHOUT saving (force quit) |
| :w newname.py | Save as a different filename |
Moving around (Normal mode)
| Command | What it does |
|---|---|
| Arrow keys | Move the cursor |
| gg | Jump to the very top of the file |
| G | Jump to the very bottom of the file |
| 0 | Jump to the start of the current line |
| $ | Jump to the end of the current line |
| /searchterm | Search for a word (press n for next match) |
Editing text
| Command | What it does |
|---|---|
| i | Enter Insert mode before the cursor |
| o | Insert a new line below and enter Insert mode |
| Esc | Return to Normal mode |
| dd | Delete the entire current line |
| u | Undo the last change |
| Ctrl + r | Redo (undo the undo) |
| yy | Copy (yank) the current line |
| p | Paste the copied line below the cursor |
1. Open a file: vim myscript.py
2. Start editing: press i
3. Make your changes
4. Stop editing: press Esc
5. Save and quit: type :wq then press Enter
If you get stuck, press Esc a few times then type :q! to exit without saving.
nano — the beginner-friendly alternative
If vim feels overwhelming, nano is much simpler. Commands are shown at the bottom of the screen.
| Command | What it does |
|---|---|
| Ctrl + O, Enter | Save the file |
| Ctrl + X | Exit nano |
| Ctrl + K | Cut (delete) the current line |
| Ctrl + W | Search for text |
05 Uploading your Python files
Before running a Python script on the cluster, you need to transfer it from your laptop. Open a new terminal on your local machine (not the cluster one) and use SCP:
Once transferred, you can also load Python and run a quick sanity check directly — but only on the login node for tiny tests:
06 Python environments — venv & uv
On an HPC cluster, the system Python is shared by everyone. Installing packages globally is either forbidden or will break things for others. The solution is a virtual environment — an isolated folder that holds its own Python interpreter and packages, separate from the system and from other projects.
Option A — venv (built into Python, no install needed)
venv is Python's built-in tool. It is available everywhere Python is installed, making it the safe default on any HPC cluster.
| Command | What it does |
|---|---|
| python -m venv .venv | Create a virtual environment in the .venv/ folder |
| source .venv/bin/activate | Activate the environment (must do this every session) |
| deactivate | Leave the environment (back to system Python) |
| pip install package | Install a package into the active environment |
| pip install -r requirements.txt | Install all packages listed in requirements.txt |
| pip freeze > requirements.txt | Save all installed packages and their versions to a file |
| pip list | Show all installed packages in the current environment |
| pip show package | Show details about a specific installed package |
| pip uninstall package | Remove a package from the environment |
| which python | Confirm you are using the environment's Python, not the system one |
A Slurm job starts a fresh shell — your activation from the terminal is not carried over. Always add source .venv/bin/activate to your job script, after module load Python:
module load Python
source /home/username/my_project/.venv/bin/activate
python train.py
Option B — uv (fast, modern, recommended for new projects)
uv is a newer Python package manager written in Rust. It is dramatically faster than pip — often 10–100× — and handles both virtual environments and package installation in one tool. It may not be pre-installed on your cluster, but installing it for yourself takes seconds.
Once installed, the workflow is similar to venv but faster and with smarter dependency resolution:
| Command | What it does |
|---|---|
| uv venv | Create a virtual environment in .venv/ |
| uv venv --python 3.11 | Create an environment with a specific Python version |
| uv pip install package | Install a package (much faster than pip) |
| uv pip install -r requirements.txt | Install from a requirements file |
| uv pip install -e . | Install current project in editable mode (from pyproject.toml) |
| uv pip freeze | List installed packages and versions |
| uv run python train.py | Run a script inside the environment without activating first |
| uv pip compile requirements.in | Resolve and lock dependencies to a requirements.txt |
venv + pip: always available, no setup, works everywhere. Use it when you are on a new cluster and are not sure what is installed.
uv: dramatically faster installs, smarter dependency handling, and a nicer overall experience. Use it for any new project where you can do a one-time install. On a shared cluster, install it to ~/.cargo/bin/ (its default) — no admin rights required.
Recommended project layout with an environment
Add .venv/ to your .gitignore. Virtual environments are large, machine-specific, and fully reproducible from requirements.txt — there is no reason to commit them.
07 Using tmux for persistent sessions
If you close your laptop or lose your Wi-Fi connection, any running program in your SSH session will be killed. tmux solves this by keeping your session alive on the server even when you disconnect — like leaving a TV on at home while you go out.
tmux new -s mysession — give it a memorable name.python myscript.pyCtrl + B, then press D. You are back at the normal shell; the session is still alive.tmux attach -t mysession| Command | What it does |
|---|---|
| tmux new -s name | Start a new session called name |
| Ctrl+B, D | Detach (leave session running) |
| tmux attach -t name | Reconnect to a session |
| tmux ls | List all sessions |
| tmux kill-session -t name | Delete a session |
tmux: quick tests, debugging, interactive work that takes a few minutes.
Slurm: long training runs, batch jobs, anything needing a GPU.
08 Submitting jobs with Slurm
Slurm is a job scheduler. Instead of running your code directly, you write a small script describing your job (how long, how many CPUs, whether you need a GPU), and Slurm runs it on a compute node when resources are available.
Exploring your cluster first
| Command | What it does |
|---|---|
| sinfo | List all partitions (queues) and how many nodes are idle/busy |
| squeue | Show all currently running and queued jobs on the cluster |
| squeue -u username | Show only your own jobs |
| sacctmgr show user | Show your account and billing group information |
| module avail | List all software modules available to load |
| module list | Show which modules you currently have loaded |
Creating a CPU job script
Create the file with nano job.slurm, write the following, then save with Ctrl+O, Enter, Ctrl+X:
Slurm script builder
Adjust the resources you want and copy a ready-to-edit job script.
#!/bin/bash #SBATCH --job-name=my_first_job #SBATCH --time=00:30:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=8G module load Python source /home/username/my_project/.venv/bin/activate python train.py
| Flag | Meaning |
|---|---|
| --job-name | A friendly name to identify your job in the queue |
| --time=00:30:00 | Maximum time allowed (HH:MM:SS) — job is killed if exceeded |
| --nodes=1 | Number of machines to use (almost always 1 for Python) |
| --ntasks=1 | Number of parallel tasks (1 for a single Python script) |
| --cpus-per-task=4 | How many CPU cores to allocate |
| --mem=8G | How much RAM to allocate |
| --partition=name | The queue to submit to — clusters have different partitions (e.g. short, long, gpu). Check your cluster's docs. |
| --account=name | Your billing account — required on some HPC systems. Your welcome email will specify this. |
Every HPC cluster has its own partition names, time limits, and account requirements. Always check your cluster's documentation or welcome email. Use sinfo to list partitions and sacctmgr show user to see your accounts.
Submitting, monitoring, and reading output
When your job finishes, Slurm creates slurm-JOBID.out in your current folder. This file contains everything your Python script printed — if something went wrong, the error message is here.
Interactive sessions
Instead of a batch job, you can request an interactive shell on a compute node to test things in real time:
09 Running GPU jobs & monitoring with nvidia-smi
GPU jobs follow the same pattern as CPU jobs, with a few extra flags:
The key additions: --partition=gpu routes your job to a GPU node (the exact name varies by cluster), --gpus=1 requests one GPU, and module load CUDA loads the GPU libraries needed by PyTorch or TensorFlow.
GPU nodes are in high demand. If your job is stuck as PD (pending) for a long time, try requesting fewer GPUs or a shorter time limit.
Monitoring GPU usage with nvidia-smi
Once your job is running on a GPU node, nvidia-smi lets you check how the GPU is being used — whether your code is actually using it, and how much memory it consumes.
| Column | What it tells you |
|---|---|
| GPU-Util | % of time the GPU is actively computing — should be high (>70%) during training |
| Memory-Usage | GPU RAM used vs. total — if it hits 100%, your job crashes with an OOM (Out of Memory) error |
| Temp | GPU temperature in °C — normal range is 30–85°C |
| Pwr:Usage/Cap | Power draw — useful for estimating job cost |
Useful nvidia-smi variants
| Command | What it does |
|---|---|
| nvidia-smi | One-time snapshot of all GPUs |
| nvidia-smi -l 2 | Live view, refreshes every 2 seconds (Ctrl+C to stop) |
| nvidia-smi -L | List all available GPUs and their names |
| watch -n 2 nvidia-smi | Alternative live view using the watch command |
| nvidia-smi --query-gpu=utilization.gpu, memory.used,memory.total --format=csv | Print only GPU %, memory used, and total as CSV |
nvidia-smi only works on a GPU compute node — not on the login node. Use it inside an interactive srun session, or add it to your job script to log GPU stats to the output file.
If GPU-Util stays near 0% during training, your model is likely not on the GPU — check that you are calling .to(device) or .cuda() on your model and tensors.
10 Using Git on the cluster
Git is a version control system — it tracks changes to your code, lets you collaborate with others, and makes it easy to bring your project onto the cluster directly from GitHub or GitLab. Instead of uploading files with scp every time you make a change, you push from your laptop to GitHub and pull on the cluster.
First-time setup
The everyday workflow
git clone https://github.com/yourname/yourproject.gitgit statusgit add . (all files) or git add myscript.py (one file)git commit -m "Add training loop for experiment 1"git push to upload your commits; git pull to download the latest changes.Write a short message that describes what changed and why.
Good: "Fix learning rate scheduler bug"
Bad: "stuff" or "changes"
Branches, history, and undoing mistakes
| Command | What it does |
|---|---|
| git log --oneline | Show a compact list of recent commits |
| git diff | Show what you changed but haven't staged yet |
| git branch name | Create a new branch |
| git checkout name | Switch to that branch |
| git checkout -b name | Create and switch in one step |
| git merge name | Merge a branch into the current one |
| git restore file.py | Discard unsaved changes to a file |
| git stash | Temporarily set aside uncommitted changes |
| git stash pop | Bring stashed changes back |
Always create a new branch for each experiment or feature. Keep main clean and working — this way you can always go back to a known good state.
Ignoring files — .gitignore
Some files should never be committed — large datasets, model weights, caches. Create a .gitignore file in your project root:
git reset --hard permanently discards uncommitted changes with no undo. Use git restore or git stash instead.
11 Common first-run issues
Most frustrating first attempts fail for a small set of predictable reasons. Check this section before assuming the cluster is broken or your code is cursed.
SSH says permission denied
Double-check your username and hostname first. If they are correct, confirm whether your cluster requires VPN access, key-based login, or a one-time password instead of a normal password.
Slurm rejects your job script
Errors like invalid account or unknown partition usually mean the cluster expects different queue names or billing settings. Run sinfo and compare with your cluster documentation.
Python or pip is missing inside the job
Interactive shells and Slurm jobs do not always share the same environment. Load the Python module inside the job script itself, then activate your .venv in that script too.
ModuleNotFoundError after install
You probably installed packages into one interpreter and ran the job with another. Check which python after activation and confirm the same path is used inside the Slurm job.
Job stays pending forever
GPU jobs and long runtimes wait the longest. Ask for fewer resources, shorten the time limit, or start with a CPU test job to confirm the workflow before scaling up.
The output file looks empty
Make sure your script actually prints progress and that it reaches the relevant code path. If it exits immediately, the real error is usually at the top of slurm-JOBID.out.
12 Quick reference — all commands
Connecting & uploading files
| Command | What it does |
|---|---|
| ssh user@hpc.university.edu | Connect to the HPC cluster |
| scp file.py user@hpc.university.edu:~/ | Upload a file |
| scp -r folder/ user@hpc.university.edu:~/ | Upload a whole folder |
Terminal navigation
| Command | What it does |
|---|---|
| ls / ls -lh | List files (plain / with details) |
| cd name / cd .. | Enter folder / go up a level |
| cd ~ | Go to home folder |
| pwd | Show current location |
| mkdir name | Create a folder |
| rm file / rm -r folder | Delete file / delete folder |
| cat / head / tail | Show full file / first N lines / last N lines |
| cp file copy / cp -r a/ b/ | Copy file / copy entire folder |
| mv old new / mv file folder/ | Rename / move a file |
vim editor
| Command | What it does |
|---|---|
| vim file.py | Open a file |
| i | Enter Insert mode |
| Esc | Return to Normal mode |
| :wq | Save and quit |
| :q! | Quit without saving |
| dd / u | Delete current line / undo |
| /term | Search for text |
tmux sessions
| Command | What it does |
|---|---|
| tmux new -s name | Start a new session |
| Ctrl+B, D | Detach (leave running) |
| tmux attach -t name | Reconnect |
| tmux ls | List all sessions |
| tmux kill-session -t name | Delete a session |
Slurm job management
| Command | What it does |
|---|---|
| sbatch job.slurm | Submit a job |
| squeue -u username | Check your job status |
| scancel JOBID | Cancel a job |
| cat slurm-JOBID.out | Read job output |
| sinfo | List partitions and availability |
| srun --pty bash | Start an interactive compute session |
| module load Python | Load Python module |
| module load CUDA | Load GPU libraries |
| module avail | List all available modules |
GPU monitoring — nvidia-smi
| Command | What it does |
|---|---|
| nvidia-smi | Snapshot of all GPUs (usage, memory, temperature) |
| nvidia-smi -l 2 | Live view, refreshes every 2 seconds |
| nvidia-smi -L | List all available GPUs and their names |
| watch -n 2 nvidia-smi | Alternative live view using watch |
Python environments — venv & uv
| Command | What it does |
|---|---|
| python -m venv .venv | Create a virtual environment |
| source .venv/bin/activate | Activate the environment |
| deactivate | Leave the environment |
| pip install package | Install a package |
| pip install -r requirements.txt | Install from requirements file |
| pip freeze > requirements.txt | Save installed packages to file |
| pip list | Show all installed packages |
| which python | Confirm you are using the venv Python |
| uv venv | Create environment with uv (faster) |
| uv venv --python 3.11 | Create environment with specific Python version |
| uv pip install package | Install package with uv (10–100× faster than pip) |
| uv pip install -r requirements.txt | Install from requirements file with uv |
| uv run python train.py | Run script inside environment without activating |
Git
| Command | What it does |
|---|---|
| git clone URL | Download a repository |
| git status | See what changed |
| git add . / git add file | Stage all / a specific file |
| git commit -m "msg" | Save a snapshot |
| git push / git pull | Upload / download changes |
| git log --oneline | View recent commits |
| git branch name | Create a branch |
| git checkout name | Switch branch |
| git merge name | Merge branch into current |
| git restore file | Discard changes to a file |
| git stash / git stash pop | Shelve / restore uncommitted work |
Start small: connect to the cluster, upload a simple Python script, and submit a CPU job first. Once that works, move to GPU jobs. If anything goes wrong, the .out file is your best debugging companion.