Skip to main content

Tutorial

PyTorch Training on Cloud GPUs: The 2026 Practical Guide
GPU selection, setup, checkpointing, and real costs

·8 min read·By the GhostNexus team

PyTorch remains the dominant framework for ML research and production fine-tuning in 2026. Its flexibility and the depth of its ecosystem — Transformers, PEFT, Lightning, vLLM — make it the default choice for teams training or fine-tuning models from scratch on their own data.

The bottleneck is almost always the same: access to cheap, reliable GPU compute with enough VRAM to fit your model. This guide covers everything you need to know to run PyTorch training jobs on cloud GPUs efficiently in 2026 — from picking the right GPU to managing costs and ensuring your checkpoints are safe.

GPU Selection for PyTorch: VRAM Requirements by Model Size

VRAM is the primary constraint when sizing a GPU for PyTorch training. The rule of thumb for full fp16 training is approximately 2× the model parameter count in GB (e.g., a 7B parameter model requires roughly 14 GB VRAM in fp16). With 4-bit quantization (QLoRA), this drops to around 0.5× — but optimizer states still require additional memory.

Model sizeMin VRAMRecommended GPUGhostNexus price
< 1B params4 GBRTX 3090 / RTX 4070$0.25–$0.35/hr
1B–7B params16 GBRTX 4090 (24 GB)$0.50/hr
7B–13B params (fp16)24 GBRTX 4090 or A100 40GB$0.50–$2.20/hr
13B–34B params (fp16)48–80 GBA100 80GB$2.20/hr
34B–70B params (4-bit)40 GBA100 80GB$2.20/hr
> 70B paramsMulti-GPU2× A100 80GB$4.40/hr

VRAM estimates assume fp16 mixed precision with AdamW optimizer. QLoRA (4-bit) can halve VRAM requirements for inference-heavy layers. Prices on GhostNexus as of April 2026.

For most practical fine-tuning workloads — adapting a 7B or 8B model on a custom dataset — the RTX 4090 at $0.50/hr is the sweet spot. It handles Llama-3.1 8B in 4-bit without offloading, supports all major PEFT methods (LoRA, QLoRA, prefix tuning), and is fast enough for 10k-sample datasets to complete in under 4 hours.

Setting Up a PyTorch Training Job on GhostNexus

GhostNexus runs sandboxed jobs: your script is uploaded, executed on the selected GPU, and logs are returned. There is no persistent container to manage. Install the SDK once, then launch jobs directly from your local machine or CI pipeline.

# 3-line setup — install, configure, run
pip install ghostnexus

import ghostnexus as gn
result = gn.run(script="train.py", gpu="rtx-4090", region="eu-west")

The gn.run() call uploads your train.py, allocates the GPU, executes the job, streams logs in real time, and returns a result object containing your outputs, logs, and a full audit trail. Per-second billing means you pay only for actual GPU time consumed.

You can also pass additional files (datasets, config files) and environment variables:

result = gn.run(
    script="train.py",
    gpu="rtx-4090",
    region="eu-west",
    files=["config.yaml", "data/train.jsonl"],
    env={"HF_TOKEN": "hf_xxx", "WANDB_API_KEY": "xxx"},
    timeout=14400   # 4h max, safety cutoff
)

DataLoader Best Practices for Cloud Training

Cloud sandboxes have fast local SSDs but no persistent network storage between jobs. Follow these practices to avoid common pitfalls:

Use local temp files — never network paths inside the job

Mount your dataset as a local file passed via the files= parameter. Inside your train.py, reference it by local path (e.g. /tmp/train.jsonl). Fetching datasets from a remote URL inside the sandbox adds latency and can time out on large files.

Pre-tokenize and cache before uploading

Run tokenization locally and save the result as a .pt or .arrow file. Tokenizing inside the sandbox wastes billable GPU time. A 50k-sample tokenized dataset stored as a PyTorch tensor file is typically under 500 MB — fast to upload and instant to load.

Set num_workers carefully for sandboxed environments

Sandboxed jobs share host CPU resources. Set DataLoader num_workers=2 or num_workers=4 as a starting point. Using num_workers=0 is safe but slower; too many workers in a shared environment can cause fork errors or memory pressure.

Enable pin_memory=True for GPU jobs

When using a CUDA GPU, set pin_memory=True in your DataLoader. This speeds up host-to-device transfers by using page-locked memory, at the cost of slightly higher CPU RAM usage — a worthwhile trade-off for training loops.

Checkpointing Strategy for Cloud Jobs

Cloud jobs can be interrupted — by timeouts, preemption (on community tiers), or network issues. A solid checkpointing strategy means you never lose more than N steps of training.

# Save checkpoint every 500 steps — safe default for long runs
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./checkpoints",
    save_steps=500,
    save_total_limit=3,   # Keep last 3 checkpoints only
    logging_steps=50,
    report_to="none"      # Disable W&B/MLflow if not configured
)

GhostNexus jobs write outputs to ./outputs/ by default, which is automatically returned to you via result.outputs at job completion. For long-running jobs, you can retrieve intermediate logs in real time:

# Stream logs in real time during training
for log_line in gn.stream_logs(result.job_id):
    print(log_line)

# Retrieve output files after completion
result.download_outputs("./local_checkpoints/")
  • Save every 500 steps for jobs under 10k steps; every 1000 steps for longer runs.
  • Keep save_total_limit=3 to avoid filling disk with redundant checkpoints.
  • Log loss and learning rate every 50 steps to enable early stopping without restarting from step 0.
  • Store your best checkpoint separately: trainer.save_model('./best_model') after evaluation.

Cost Examples: What Common PyTorch Training Jobs Actually Cost

Per-second billing means you pay only for actual GPU time. The following estimates are based on real runs on GhostNexus hardware in April 2026.

Training jobGPUDurationEst. costNotes
ResNet-50 on ImageNet (90 epochs)RTX 4090~6h~$3.00Batch size 256, fp16 AMP
BERT-base fine-tuning (GLUE/SST-2, 5 epochs)RTX 4090~45 min~$0.38Seq len 128, batch 32
Llama-3.1 8B QLoRA (10k samples, 3 epochs)RTX 4090~3.5h~$1.754-bit quant, LoRA r=16
Llama-3.1 70B QLoRA (10k samples, 1 epoch)A100 80GB~8h~$17.604-bit quant, LoRA r=8

Costs are estimates based on GhostNexus pricing (April 2026). Actual durations vary with dataset size, hardware warm-up time, and model architecture. Per-second billing means no waste on short jobs.

For context: a Llama-3.1 8B QLoRA fine-tune on 10k samples at $1.75 compares to roughly $8–12 on AWS (p3.2xlarge with hourly billing, no GDPR compliance) or $6–9 on RunPod Secure Cloud. The per-second billing advantage on GhostNexus is especially significant for runs under 2 hours, where hourly-billing platforms round up to the next full hour.

Run Your First PyTorch Job on GhostNexus

Get $15 in free credits — enough for a full Llama-3.1 8B QLoRA fine-tune or 30 hours of ResNet training. No contract, no minimum spend, per-second billing.

Start free — $15 bonus credits