Skip to main content
GhostNexus/Blog/Fine-tune Llama 3
GPU Guides·April 2026·12 min read

Fine-Tune Llama 3 on a Cloud GPU for Under $5

Fine-tuning Llama 3 8B used to mean paying $50+ for a reserved GPU instance and fighting with SSH, Docker, and driver versions. With QLoRA and per-second cloud billing, a full 1-epoch fine-tuning run now costs under $5. This guide walks through the complete setup — from dataset to trained adapter — using HuggingFace PEFT and GhostNexus.

TL;DR

  • Llama 3 8B + QLoRA fits in 12 GB VRAM (RTX 4070 or better)
  • 1 epoch on 1,000 examples takes ~20–30 minutes on an RTX 4090
  • At $0.50/hr that's $0.17–$0.25 per run
  • The trained LoRA adapter is ~100 MB — easy to ship

Prerequisites

  • Python 3.10+ locally (just to prepare the script)
  • HuggingFace account + access to meta-llama/Meta-Llama-3-8B
  • GhostNexus account with $5 of credits (get $15 free here)
  • pip install ghostnexus

Why QLoRA Makes This Cheap

Full fine-tuning Llama 3 8B requires storing all 8 billion parameters in fp16 — about 16 GB just for the weights, plus activations and optimizer state. You'd need an A100 80GB and hours of training time.

QLoRA (Quantized Low-Rank Adaptation) solves this two ways: the base model loads in 4-bit quantization (~4 GB for 8B params), and only a small adapter layer is actually trained (~10M trainable parameters vs 8B). The result: the whole setup fits in 12 GB VRAM and trains 10× faster.

VRAM requirements with QLoRA

ModelVRAM (QLoRA 4-bit)Recommended GPUCost
Llama 3 8B~10–12 GBRTX 4070 / 4080$0.25–$0.35/hr
Llama 3 8B (batch=4)~18 GBRTX 4090 (24 GB)$0.50/hr
Llama 3 70B~48 GBRTX A6000 (48 GB)$0.70/hr
Llama 3 70B (full)~140 GBA100 80GB ×2$4.40/hr

The Fine-Tuning Script

Save this as finetune_llama3.py. It uses the HuggingFace peft library for QLoRA and transformers for the model. The dataset here is a small instruction-following example — swap it for your own.

finetune_llama3.pyPython
import os
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer

# ── Config ─────────────────────────────────────────────────────────────
MODEL_ID  = "meta-llama/Meta-Llama-3-8B-Instruct"
HF_TOKEN  = os.environ["HF_TOKEN"]        # set via GhostNexus env vars
OUTPUT    = "./llama3-finetuned-adapter"
MAX_LEN   = 512
EPOCHS    = 1
BATCH     = 4

# ── Quantization (loads model in 4-bit — fits in 12 GB VRAM) ───────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# ── Load model & tokenizer ─────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    token=HF_TOKEN,
)
model.config.use_cache = False

# ── LoRA adapter config ────────────────────────────────────────────────
lora_config = LoraConfig(
    r=16,                        # rank — higher = more capacity, more VRAM
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~10M (0.12% of 8B total)

# ── Dataset ────────────────────────────────────────────────────────────
# Replace with your own instruction-response pairs
examples = [
    {"text": "<|system|>You are a helpful assistant.<|user|>What is gradient descent?<|assistant|>Gradient descent is an optimization algorithm that minimizes a loss function by iteratively moving in the direction of steepest descent. Each step size is controlled by the learning rate."},
    {"text": "<|system|>You are a helpful assistant.<|user|>Explain overfitting in one sentence.<|assistant|>Overfitting happens when a model memorizes training data so well that it fails to generalize to new, unseen examples."},
    # ... add your 1000+ examples here
]
dataset = Dataset.from_list(examples)

# ── Training ───────────────────────────────────────────────────────────
training_args = TrainingArguments(
    output_dir=OUTPUT,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH,
    gradient_accumulation_steps=2,
    warmup_ratio=0.03,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    dataset_text_field="text",
    max_seq_length=MAX_LEN,
)

print("Starting fine-tuning...")
trainer.train()
trainer.save_model(OUTPUT)
print(f"Adapter saved to {OUTPUT}")

Install dependencies first

Create a requirements.txt in the same folder:

requirements.txt
torch>=2.2.0
transformers>=4.40.0
peft>=0.10.0
trl>=0.8.6
bitsandbytes>=0.43.0
datasets>=2.18.0
accelerate>=0.28.0

Submit to GhostNexus

You don't need to SSH into anything. Just call client.run() — GhostNexus uploads your script to an RTX 4090 node, installs your requirements.txt automatically, and streams the output back.

submit.pyrun this locally
import ghostnexus

client = ghostnexus.Client(api_key="gn_live_YOUR_KEY")

job = client.run(
    "finetune_llama3.py",
    task_name="llama3-qlora-run1",
    # Pass your HuggingFace token as an env var (never hardcode it)
    env={"HF_TOKEN": "hf_YOUR_TOKEN"},
)

print(f"Job dispatched: {job.job_id}")
result = job.wait(timeout=3600)  # wait up to 1 hour

print(f"Status:   {result.status}")
print(f"Duration: {result.duration_seconds:.0f}s")
print(f"Cost:     ${result.cost_credits:.4f}")
print("\n--- Training output ---")
print(result.output)
Note on HuggingFace token: Llama 3 is a gated model. You need to request access at huggingface.co/meta-llama and create an access token in your HF account settings. Pass it as an environment variable — never embed it in your script.

Real Cost Breakdown

Here's what you can expect for a 1-epoch run on 1,000 instruction examples:

ScenarioGPUDurationCost
1,000 examples, 1 epochRTX 4090~22 min$0.18
5,000 examples, 1 epochRTX 4090~90 min$0.75
10,000 examples, 3 epochsRTX 4090~6 hours$3.00
50,000 examples, 3 epochsA100 80GB~8 hours$17.60

Billed per second. Costs include compute only — no startup fees, no minimum.

Retrieving the Trained Adapter

After training completes, the adapter weights are included in the job output logs. To get the actual binary files, modify the training script to upload to HuggingFace Hub at the end of the run — the simplest way to persist outputs:

add to end of finetune_llama3.py
# Upload adapter to HuggingFace Hub after training
from huggingface_hub import HfApi

api = HfApi(token=HF_TOKEN)
api.upload_folder(
    folder_path=OUTPUT,
    repo_id="YOUR_HF_USERNAME/llama3-my-adapter",
    repo_type="model",
)
print("Adapter uploaded to HuggingFace Hub")

Tips to Keep Costs Down

  • 1.Test on a tiny dataset first. Run 50 examples, 1 epoch. Costs ~$0.01 but catches import errors and shape mismatches before the full run.
  • 2.Use gradient checkpointing if you run out of VRAM: add gradient_checkpointing=True to TrainingArguments. Halves VRAM usage at ~20% speed cost.
  • 3.Lower LoRA rank for faster experimentation. r=8 is half the VRAM and trains faster — good for early iterations. Use r=64 only for the final production run.
  • 4.Cache your dataset. If you're iterating on hyperparameters, pre-tokenize your dataset and serialize it to disk so you're not spending compute time on preprocessing.

Try it with $15 free credits

That's enough for 30+ fine-tuning experiments on Llama 3 8B. No credit card required to start.

Start fine-tuning free →

Code WELCOME15 applied automatically