Skip to main content

ML Engineering Guide

How to Train an LLM on Cloud GPU: Step-by-Step Python Guide 2026
QLoRA fine-tuning with HuggingFace Transformers and EU compute

·12 min read·By the GhostNexus team

Training or fine-tuning a large language model on cloud GPU is the most cost-effective path for teams that need domain-specific models without paying hyperscaler prices. In 2026, the combination of QLoRA quantization, the HuggingFace Transformers ecosystem, and affordable cloud GPU rentals makes it possible to fine-tune a 7B–13B parameter model for under $20 — in under two hours.

This guide walks through the complete pipeline: dataset preparation, environment setup, QLoRA fine-tuning configuration, job submission to a cloud GPU via the GhostNexus SDK, and model export. All examples use Python and are tested on A100 80GB (single node) and RTX 4090 (single node, 24GB VRAM) configurations.

If your team processes EU personal data during training, the guide also covers how to route your jobs through GDPR-compliant EU infrastructure using a single parameter — no architecture changes required.

Prerequisites and Environment Setup

You will need Python 3.10+, a HuggingFace account (free), and a GhostNexus account with at least $5 in credits. The full fine-tune in this guide costs approximately $3.50 on an A100 80GB and $4.80 on a 2x RTX 4090 configuration.

# Create a virtual environment
python -m venv llm-env
source llm-env/bin/activate  # Linux/macOS
# llm-env\Scripts\activate  # Windows

# Install dependencies
pip install transformers==4.40.0 \
            datasets==2.19.0 \
            peft==0.10.0 \
            bitsandbytes==0.43.0 \
            accelerate==0.29.0 \
            trl==0.8.6 \
            ghostnexus>=0.5.0

Step 1: Prepare Your Dataset

Fine-tuning works best when your dataset closely matches the target task. For instruction fine-tuning (the most common use case), your data should be formatted as instruction-response pairs. The standard format used by most open-source models is the Alpaca format:

# dataset_prep.py
from datasets import Dataset
import json

# Load your data (JSONL format)
# Each line: {"instruction": "...", "input": "...", "output": "..."}
def load_dataset_from_jsonl(path: str) -> Dataset:
    records = []
    with open(path) as f:
        for line in f:
            records.append(json.loads(line.strip()))
    return Dataset.from_list(records)

def format_alpaca(example: dict) -> dict:
    """Format as instruction-following prompt."""
    if example.get("input"):
        prompt = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Input:\n{example['input']}\n\n"
            f"### Response:\n{example['output']}"
        )
    else:
        prompt = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Response:\n{example['output']}"
        )
    return {"text": prompt}

dataset = load_dataset_from_jsonl("data/train.jsonl")
dataset = dataset.map(format_alpaca)
dataset.save_to_disk("data/formatted")
print(f"Dataset ready: {len(dataset)} examples")

For this guide we use a minimum of 500 examples. Quality matters more than quantity — 500 high-quality domain examples outperform 10,000 noisy ones. If your dataset contains personal data (names, emails, etc.), ensure you have a legal basis for processing and use a GDPR-compliant cloud for training.

Step 2: QLoRA Fine-Tuning Script

QLoRA (Quantized Low-Rank Adaptation) is the standard approach for fine-tuning 7B–70B models on single-GPU configurations. It loads the base model in 4-bit quantization and trains only a small number of adapter parameters — typically 0.1–1% of total model parameters.

# train.py — QLoRA fine-tuning with HuggingFace
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from datasets import load_from_disk
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer

# ── Configuration ──────────────────────────────────────────────
BASE_MODEL = "mistralai/Mistral-7B-v0.3"    # or meta-llama/Meta-Llama-3-8B
OUTPUT_DIR = "/tmp/output"
DATASET_PATH = "/tmp/data/formatted"

QLORA_CONFIG = LoraConfig(
    r=16,                    # rank — higher = more params, better quality
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

BNB_CONFIG = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

TRAINING_ARGS = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,      # effective batch = 16
    learning_rate=2e-4,
    fp16=False,
    bf16=True,                           # A100/H100 — use bf16
    logging_steps=25,
    save_steps=200,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    report_to="none",
)

# ── Load model and tokenizer ────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=BNB_CONFIG,
    device_map="auto",
    trust_remote_code=True,
)
model.config.use_cache = False
model = get_peft_model(model, QLORA_CONFIG)
model.print_trainable_parameters()

# ── Train ───────────────────────────────────────────────────────
dataset = load_from_disk(DATASET_PATH)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=TRAINING_ARGS,
    packing=False,
)

trainer.train()
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print("Training complete. Model saved to", OUTPUT_DIR)

Step 3: Submit to Cloud GPU via GhostNexus SDK

Once your script is ready locally, submit it to a cloud GPU using the GhostNexus SDK. The SDK handles file upload, job scheduling, GPU allocation, and result download automatically.

# submit_job.py
import ghostnexus as gn

# Authenticate (set GHOSTNEXUS_API_KEY env var or pass api_key=)
gn.configure(api_key="your_api_key_here")

job = gn.Job(
    script="train.py",
    files={
        "data/formatted": "./local_data/formatted",   # upload dataset
    },
    gpu="a100-80gb",          # or "rtx-4090", "h100-sxm"
    region="eu-west",         # GDPR-compliant EU routing
    timeout_hours=4,
    env={
        "HF_TOKEN": "hf_...",          # HuggingFace token for model download
        "TRANSFORMERS_CACHE": "/tmp/hf_cache",
    },
    result_paths=["/tmp/output"],      # download on completion
)

print(f"Job submitted: ${job.id}")
print(f"Estimated cost: ${job.estimated_cost_usd}")

# Wait and stream logs
for log_line in job.stream_logs():
    print(log_line, end="")

# Download results
job.download_results("./local_output")
print(f"Final cost: ${job.cost_usd}")
print(f"Audit log: ${job.audit_log_url}")   # AI Act compliance record

Step 4: Merge LoRA Adapters and Export

After training, you have a base model plus LoRA adapters. For deployment, merge them into a single model file. This runs locally on a CPU-only machine in 5–10 minutes:

# merge_and_export.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE_MODEL = "mistralai/Mistral-7B-v0.3"
ADAPTER_PATH = "./local_output"
MERGED_OUTPUT = "./merged_model"

print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="cpu",
)

print("Loading adapters and merging...")
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
model = model.merge_and_unload()

print("Saving merged model...")
model.save_pretrained(MERGED_OUTPUT)
AutoTokenizer.from_pretrained(ADAPTER_PATH).save_pretrained(MERGED_OUTPUT)
print("Done. Merged model at:", MERGED_OUTPUT)

GPU and Cost Reference for LLM Fine-Tuning (2026)

GPUVRAMMax model sizeGhostNexus priceEst. 7B fine-tune cost
RTX 409024 GB7B (QLoRA)$0.69/hr~$2.10
A100 40GB40 GB13B (QLoRA)$1.60/hr~$2.40
A100 80GB80 GB30B (QLoRA)$2.20/hr~$3.30
H100 SXM80 GB30B+ (bf16)$3.80/hr~$3.80
2x A100 80GB160 GB65B (QLoRA)$4.40/hr~$6.60

Estimated fine-tune cost assumes 3 epochs on 1,000 examples at 2,048 tokens max length. Actual costs vary by dataset size and training configuration. All GhostNexus nodes are EU/EEA.

GDPR Compliance for LLM Training

If your training dataset contains personal data — customer support tickets, medical records, HR data, or any data relating to identified EU individuals — GDPR applies to the training process itself. This means your GPU provider must be able to sign a GDPR-compliant DPA, and processing must occur within the EU/EEA or under adequate safeguards.

The single line region="eu-west" in the GhostNexus SDK ensures your job is routed exclusively to EU/EEA nodes. Combined with the available DPA, this satisfies GDPR Article 28 requirements and provides the processing records required by Article 30.

Start Training Your LLM on Cloud GPU

$15 free credits. No credit card required. GDPR-compliant EU infrastructure. Fine-tune your first model in under 2 hours.

Get $15 free credits — start training

Newsletter

GPU cloud insights. No spam.

New features, GPU pricing updates, and compute guides — straight to your inbox.