Tutorial
PyTorch Training on Cloud GPUs: The 2026 Practical Guide
GPU selection, setup, checkpointing, and real costs
PyTorch remains the dominant framework for ML research and production fine-tuning in 2026. Its flexibility and the depth of its ecosystem — Transformers, PEFT, Lightning, vLLM — make it the default choice for teams training or fine-tuning models from scratch on their own data.
The bottleneck is almost always the same: access to cheap, reliable GPU compute with enough VRAM to fit your model. This guide covers everything you need to know to run PyTorch training jobs on cloud GPUs efficiently in 2026 — from picking the right GPU to managing costs and ensuring your checkpoints are safe.
GPU Selection for PyTorch: VRAM Requirements by Model Size
VRAM is the primary constraint when sizing a GPU for PyTorch training. The rule of thumb for full fp16 training is approximately 2× the model parameter count in GB (e.g., a 7B parameter model requires roughly 14 GB VRAM in fp16). With 4-bit quantization (QLoRA), this drops to around 0.5× — but optimizer states still require additional memory.
| Model size | Min VRAM | Recommended GPU | GhostNexus price |
|---|---|---|---|
| < 1B params | 4 GB | RTX 3090 / RTX 4070 | $0.25–$0.35/hr |
| 1B–7B params | 16 GB | RTX 4090 (24 GB) | $0.50/hr |
| 7B–13B params (fp16) | 24 GB | RTX 4090 or A100 40GB | $0.50–$2.20/hr |
| 13B–34B params (fp16) | 48–80 GB | A100 80GB | $2.20/hr |
| 34B–70B params (4-bit) | 40 GB | A100 80GB | $2.20/hr |
| > 70B params | Multi-GPU | 2× A100 80GB | $4.40/hr |
VRAM estimates assume fp16 mixed precision with AdamW optimizer. QLoRA (4-bit) can halve VRAM requirements for inference-heavy layers. Prices on GhostNexus as of April 2026.
For most practical fine-tuning workloads — adapting a 7B or 8B model on a custom dataset — the RTX 4090 at $0.50/hr is the sweet spot. It handles Llama-3.1 8B in 4-bit without offloading, supports all major PEFT methods (LoRA, QLoRA, prefix tuning), and is fast enough for 10k-sample datasets to complete in under 4 hours.
Setting Up a PyTorch Training Job on GhostNexus
GhostNexus runs sandboxed jobs: your script is uploaded, executed on the selected GPU, and logs are returned. There is no persistent container to manage. Install the SDK once, then launch jobs directly from your local machine or CI pipeline.
# 3-line setup — install, configure, run pip install ghostnexus import ghostnexus as gn result = gn.run(script="train.py", gpu="rtx-4090", region="eu-west")
The gn.run() call uploads your train.py, allocates the GPU, executes the job, streams logs in real time, and returns a result object containing your outputs, logs, and a full audit trail. Per-second billing means you pay only for actual GPU time consumed.
You can also pass additional files (datasets, config files) and environment variables:
result = gn.run(
script="train.py",
gpu="rtx-4090",
region="eu-west",
files=["config.yaml", "data/train.jsonl"],
env={"HF_TOKEN": "hf_xxx", "WANDB_API_KEY": "xxx"},
timeout=14400 # 4h max, safety cutoff
)DataLoader Best Practices for Cloud Training
Cloud sandboxes have fast local SSDs but no persistent network storage between jobs. Follow these practices to avoid common pitfalls:
Use local temp files — never network paths inside the job
Mount your dataset as a local file passed via the files= parameter. Inside your train.py, reference it by local path (e.g. /tmp/train.jsonl). Fetching datasets from a remote URL inside the sandbox adds latency and can time out on large files.
Pre-tokenize and cache before uploading
Run tokenization locally and save the result as a .pt or .arrow file. Tokenizing inside the sandbox wastes billable GPU time. A 50k-sample tokenized dataset stored as a PyTorch tensor file is typically under 500 MB — fast to upload and instant to load.
Set num_workers carefully for sandboxed environments
Sandboxed jobs share host CPU resources. Set DataLoader num_workers=2 or num_workers=4 as a starting point. Using num_workers=0 is safe but slower; too many workers in a shared environment can cause fork errors or memory pressure.
Enable pin_memory=True for GPU jobs
When using a CUDA GPU, set pin_memory=True in your DataLoader. This speeds up host-to-device transfers by using page-locked memory, at the cost of slightly higher CPU RAM usage — a worthwhile trade-off for training loops.
Checkpointing Strategy for Cloud Jobs
Cloud jobs can be interrupted — by timeouts, preemption (on community tiers), or network issues. A solid checkpointing strategy means you never lose more than N steps of training.
# Save checkpoint every 500 steps — safe default for long runs
from transformers import TrainingArguments
args = TrainingArguments(
output_dir="./checkpoints",
save_steps=500,
save_total_limit=3, # Keep last 3 checkpoints only
logging_steps=50,
report_to="none" # Disable W&B/MLflow if not configured
)GhostNexus jobs write outputs to ./outputs/ by default, which is automatically returned to you via result.outputs at job completion. For long-running jobs, you can retrieve intermediate logs in real time:
# Stream logs in real time during training
for log_line in gn.stream_logs(result.job_id):
print(log_line)
# Retrieve output files after completion
result.download_outputs("./local_checkpoints/")- →Save every 500 steps for jobs under 10k steps; every 1000 steps for longer runs.
- →Keep save_total_limit=3 to avoid filling disk with redundant checkpoints.
- →Log loss and learning rate every 50 steps to enable early stopping without restarting from step 0.
- →Store your best checkpoint separately: trainer.save_model('./best_model') after evaluation.
Cost Examples: What Common PyTorch Training Jobs Actually Cost
Per-second billing means you pay only for actual GPU time. The following estimates are based on real runs on GhostNexus hardware in April 2026.
| Training job | GPU | Duration | Est. cost | Notes |
|---|---|---|---|---|
| ResNet-50 on ImageNet (90 epochs) | RTX 4090 | ~6h | ~$3.00 | Batch size 256, fp16 AMP |
| BERT-base fine-tuning (GLUE/SST-2, 5 epochs) | RTX 4090 | ~45 min | ~$0.38 | Seq len 128, batch 32 |
| Llama-3.1 8B QLoRA (10k samples, 3 epochs) | RTX 4090 | ~3.5h | ~$1.75 | 4-bit quant, LoRA r=16 |
| Llama-3.1 70B QLoRA (10k samples, 1 epoch) | A100 80GB | ~8h | ~$17.60 | 4-bit quant, LoRA r=8 |
Costs are estimates based on GhostNexus pricing (April 2026). Actual durations vary with dataset size, hardware warm-up time, and model architecture. Per-second billing means no waste on short jobs.
For context: a Llama-3.1 8B QLoRA fine-tune on 10k samples at $1.75 compares to roughly $8–12 on AWS (p3.2xlarge with hourly billing, no GDPR compliance) or $6–9 on RunPod Secure Cloud. The per-second billing advantage on GhostNexus is especially significant for runs under 2 hours, where hourly-billing platforms round up to the next full hour.
Run Your First PyTorch Job on GhostNexus
Get $15 in free credits — enough for a full Llama-3.1 8B QLoRA fine-tune or 30 hours of ResNet training. No contract, no minimum spend, per-second billing.
Start free — $15 bonus creditsRelated articles