Fine-Tune Llama 3 on a Cloud GPU for Under $5
Fine-tuning Llama 3 8B used to mean paying $50+ for a reserved GPU instance and fighting with SSH, Docker, and driver versions. With QLoRA and per-second cloud billing, a full 1-epoch fine-tuning run now costs under $5. This guide walks through the complete setup — from dataset to trained adapter — using HuggingFace PEFT and GhostNexus.
TL;DR
- → Llama 3 8B + QLoRA fits in 12 GB VRAM (RTX 4070 or better)
- → 1 epoch on 1,000 examples takes ~20–30 minutes on an RTX 4090
- → At $0.50/hr that's $0.17–$0.25 per run
- → The trained LoRA adapter is ~100 MB — easy to ship
Prerequisites
- Python 3.10+ locally (just to prepare the script)
- HuggingFace account + access to
meta-llama/Meta-Llama-3-8B - GhostNexus account with $5 of credits (get $15 free here)
pip install ghostnexus
Why QLoRA Makes This Cheap
Full fine-tuning Llama 3 8B requires storing all 8 billion parameters in fp16 — about 16 GB just for the weights, plus activations and optimizer state. You'd need an A100 80GB and hours of training time.
QLoRA (Quantized Low-Rank Adaptation) solves this two ways: the base model loads in 4-bit quantization (~4 GB for 8B params), and only a small adapter layer is actually trained (~10M trainable parameters vs 8B). The result: the whole setup fits in 12 GB VRAM and trains 10× faster.
VRAM requirements with QLoRA
| Model | VRAM (QLoRA 4-bit) | Recommended GPU | Cost |
|---|---|---|---|
| Llama 3 8B | ~10–12 GB | RTX 4070 / 4080 | $0.25–$0.35/hr |
| Llama 3 8B (batch=4) | ~18 GB | RTX 4090 (24 GB) | $0.50/hr |
| Llama 3 70B | ~48 GB | RTX A6000 (48 GB) | $0.70/hr |
| Llama 3 70B (full) | ~140 GB | A100 80GB ×2 | $4.40/hr |
The Fine-Tuning Script
Save this as finetune_llama3.py. It uses the HuggingFace peft library for QLoRA and transformers for the model. The dataset here is a small instruction-following example — swap it for your own.
import os
import torch
from datasets import Dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
# ── Config ─────────────────────────────────────────────────────────────
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
HF_TOKEN = os.environ["HF_TOKEN"] # set via GhostNexus env vars
OUTPUT = "./llama3-finetuned-adapter"
MAX_LEN = 512
EPOCHS = 1
BATCH = 4
# ── Quantization (loads model in 4-bit — fits in 12 GB VRAM) ───────────
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# ── Load model & tokenizer ─────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
token=HF_TOKEN,
)
model.config.use_cache = False
# ── LoRA adapter config ────────────────────────────────────────────────
lora_config = LoraConfig(
r=16, # rank — higher = more capacity, more VRAM
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~10M (0.12% of 8B total)
# ── Dataset ────────────────────────────────────────────────────────────
# Replace with your own instruction-response pairs
examples = [
{"text": "<|system|>You are a helpful assistant.<|user|>What is gradient descent?<|assistant|>Gradient descent is an optimization algorithm that minimizes a loss function by iteratively moving in the direction of steepest descent. Each step size is controlled by the learning rate."},
{"text": "<|system|>You are a helpful assistant.<|user|>Explain overfitting in one sentence.<|assistant|>Overfitting happens when a model memorizes training data so well that it fails to generalize to new, unseen examples."},
# ... add your 1000+ examples here
]
dataset = Dataset.from_list(examples)
# ── Training ───────────────────────────────────────────────────────────
training_args = TrainingArguments(
output_dir=OUTPUT,
num_train_epochs=EPOCHS,
per_device_train_batch_size=BATCH,
gradient_accumulation_steps=2,
warmup_ratio=0.03,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
report_to="none",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
max_seq_length=MAX_LEN,
)
print("Starting fine-tuning...")
trainer.train()
trainer.save_model(OUTPUT)
print(f"Adapter saved to {OUTPUT}")Install dependencies first
Create a requirements.txt in the same folder:
torch>=2.2.0 transformers>=4.40.0 peft>=0.10.0 trl>=0.8.6 bitsandbytes>=0.43.0 datasets>=2.18.0 accelerate>=0.28.0
Submit to GhostNexus
You don't need to SSH into anything. Just call client.run() — GhostNexus uploads your script to an RTX 4090 node, installs your requirements.txt automatically, and streams the output back.
import ghostnexus
client = ghostnexus.Client(api_key="gn_live_YOUR_KEY")
job = client.run(
"finetune_llama3.py",
task_name="llama3-qlora-run1",
# Pass your HuggingFace token as an env var (never hardcode it)
env={"HF_TOKEN": "hf_YOUR_TOKEN"},
)
print(f"Job dispatched: {job.job_id}")
result = job.wait(timeout=3600) # wait up to 1 hour
print(f"Status: {result.status}")
print(f"Duration: {result.duration_seconds:.0f}s")
print(f"Cost: ${result.cost_credits:.4f}")
print("\n--- Training output ---")
print(result.output)Real Cost Breakdown
Here's what you can expect for a 1-epoch run on 1,000 instruction examples:
| Scenario | GPU | Duration | Cost |
|---|---|---|---|
| 1,000 examples, 1 epoch | RTX 4090 | ~22 min | $0.18 |
| 5,000 examples, 1 epoch | RTX 4090 | ~90 min | $0.75 |
| 10,000 examples, 3 epochs | RTX 4090 | ~6 hours | $3.00 |
| 50,000 examples, 3 epochs | A100 80GB | ~8 hours | $17.60 |
Billed per second. Costs include compute only — no startup fees, no minimum.
Retrieving the Trained Adapter
After training completes, the adapter weights are included in the job output logs. To get the actual binary files, modify the training script to upload to HuggingFace Hub at the end of the run — the simplest way to persist outputs:
# Upload adapter to HuggingFace Hub after training
from huggingface_hub import HfApi
api = HfApi(token=HF_TOKEN)
api.upload_folder(
folder_path=OUTPUT,
repo_id="YOUR_HF_USERNAME/llama3-my-adapter",
repo_type="model",
)
print("Adapter uploaded to HuggingFace Hub")Tips to Keep Costs Down
- 1.Test on a tiny dataset first. Run 50 examples, 1 epoch. Costs ~$0.01 but catches import errors and shape mismatches before the full run.
- 2.Use gradient checkpointing if you run out of VRAM: add
gradient_checkpointing=TruetoTrainingArguments. Halves VRAM usage at ~20% speed cost. - 3.Lower LoRA rank for faster experimentation.
r=8is half the VRAM and trains faster — good for early iterations. User=64only for the final production run. - 4.Cache your dataset. If you're iterating on hyperparameters, pre-tokenize your dataset and serialize it to disk so you're not spending compute time on preprocessing.
Try it with $15 free credits
That's enough for 30+ fine-tuning experiments on Llama 3 8B. No credit card required to start.
Start fine-tuning free →Code WELCOME15 applied automatically