Skip to main content
TutorialLLM Fine-tuning

Fine-tuning an LLM for Under $20: The Practical 2026 Guide

·10 min read·By the GhostNexus team

Fine-tuning a Large Language Model is often presented as an expensive operation reserved for large enterprises. In 2026, that's simply not true. With LoRA and QLoRA techniques and a rented GPU, you can specialize Mistral 7B or LLaMA 3.1 8B for your domain for under $2 — and well under $20 even for longer runs.

$1.20
Total cost for 4h of fine-tuning
RTX 3090 · 24 GB VRAM · $0.30/hrModel: Mistral 7B or LLaMA 3.1 8BTechnique: LoRA / QLoRA (4-bit)

Choosing the right base model

The first decision is choosing the foundation model. In 2026, two families dominate for low-cost fine-tuning:

Mistral 7B (and its derivatives)

Mistral 7B remains one of the best performance/size ratios available in open weights. Its 32k token context window and Grouped Query Attention architecture make it particularly efficient to inference and fine-tune. Mistral-7B-Instruct-v0.3 is an excellent starting point for adapting a conversational assistant to your domain. For classification or extraction tasks, Mistral-7B-v0.1 (base) is preferable.

LLaMA 3.1 8B

Meta LLaMA 3.1 8B is slightly more memory-hungry than Mistral 7B but performs better on reasoning and code benchmarks. It's available on Hugging Face after accepting the Meta license (free). The Instruct version is optimized for dialogue; the Base version for completion or supervised fine-tuning tasks.

Our recommendation: start with Mistral 7B Instruct for a domain assistant or chatbot. Choose LLaMA 3.1 8B if your task involves multi-step reasoning or structured code generation.

Hardware requirements: why the RTX 3090 is more than enough

An RTX 3090 has 24 GB of VRAM — exactly what's needed to fine-tune a 7-8B model with QLoRA (4-bit quantization) or standard LoRA. Here's the memory breakdown:

ConfigurationModel VRAMTotal VRAM
Mistral 7B · QLoRA 4-bit · batch 4~5 GB~14 GB ✓
LLaMA 3.1 8B · QLoRA 4-bit · batch 4~6 GB~16 GB ✓
Mistral 7B · LoRA bf16 · batch 8~14 GB~22 GB ✓

For larger models (13B, 34B), you'll need an RTX 4090 (24 GB, faster) or two GPUs in parallel — but for the vast majority of domain use cases, fine-tuned 7-8B models outperform larger unspecialized ones.

Cost breakdown: 4h on RTX 3090 = $1.20

The RTX 3090 rate on GhostNexus is $0.30/hr. A LoRA fine-tuning of Mistral 7B on a dataset of 10,000 examples of 512 tokens, with 3 epochs, takes approximately 3.5–4 hours on an RTX 3090:

GPU: RTX 3090$0.30/hr
Estimated duration (3 epochs, 10k examples)4h
Total cost$1.20

For a larger dataset (100k examples) or more epochs, costs stay very reasonable: 5–10h of compute is $1.50–$3.00. Well under $20 in all common scenarios.

For comparison, the same job on an AWS p3.2xlarge (V100 16 GB) would cost ~$3.06/hr — over $12 for the same run. On Google Colab Pro+, GPU quota is limited and unpredictable. On GhostNexus, resources are available immediately, billed to the second, with no quota.

Step-by-step: fine-tuning on GhostNexus with the Python SDK

Step 1 — Install the GhostNexus SDK

terminal
pip install ghostnexus-sdk
# or with poetry:
poetry add ghostnexus-sdk

Step 2 — Configure authentication

Create your account at ghostnexus.net/login and retrieve your API key from the dashboard. Export it as an environment variable:

terminal
export GHOSTNEXUS_API_KEY="gn_sk_xxxxxxxxxxxxxxxxxxxx"

Step 3 — Launch LoRA fine-tuning

Here is a complete Python script that launches a LoRA fine-tuning of Mistral 7B Instruct on GhostNexus, handling GPU provisioning, model loading, and training:

fine_tune.py
import ghostnexus as gn
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch

# -- 1. Connect to GhostNexus and reserve GPU --
client = gn.Client()  # reads GHOSTNEXUS_API_KEY from environment

session = client.sessions.create(
    gpu_type="RTX_3090",   # 24 GB VRAM, $0.30/hr
    region="eu-west",      # EU-localized data
    disk_gb=50,
)
print(f"Session started: {session.id} | GPU: {session.gpu_type}")
print(f"Rate: {session.price_per_hour} $/hr")

# -- 2. Load base model with 4-bit quantization --
model_name = "mistralai/Mistral-7B-Instruct-v0.3"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# -- 3. LoRA configuration --
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~20M / 7B total (~0.28%) — very fast to train

# -- 4. Load dataset --
dataset = load_dataset("json", data_files={"train": "data/train.jsonl"})

def format_prompt(example):
    return {
        "text": f"[INST] {example['instruction']} [/INST] {example['output']}"
    }

dataset = dataset.map(format_prompt)

# -- 5. Training --
training_args = TrainingArguments(
    output_dir="./outputs/mistral-7b-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    logging_steps=50,
    save_strategy="epoch",
    optim="paged_adamw_8bit",
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
    args=training_args,
    peft_config=lora_config,
)

trainer.train()

# -- 6. Save LoRA adapters --
model.save_pretrained("./outputs/mistral-7b-lora-adapter")
tokenizer.save_pretrained("./outputs/mistral-7b-lora-adapter")
print("Fine-tuning complete. Adapters saved.")

# -- 7. Download artifacts and stop session --
session.download_artifacts(
    remote_path="./outputs/",
    local_path="./local_model/",
)
session.stop()
print(f"Total session cost: {session.total_cost:.4f} $")

Step 4 — Prepare your dataset

The expected format for instruction fine-tuning (SFT) is a JSONL file with instruction/response pairs. Each line is a JSON object:

data/train.jsonl
{"instruction": "Summarize this contract in 3 key points.", "output": "1. Duration: 12 months renewable..."}
{"instruction": "Identify the termination clauses.", "output": "Article 14 — Termination for cause..."}
{"instruction": "Translate this passage into plain language.", "output": "In plain terms, this means that..."}

For quality fine-tuning, aim for 1,000 to 10,000 well-crafted examples. Data quality beats quantity — 500 perfectly formatted examples often outperform 5,000 noisy ones.

Step 5 — Evaluate and merge the model

LoRA adapters are lightweight (~100–300 MB). For production deployment, you can merge them with the base model or load them dynamically at inference via PEFT. Merging produces a standalone model compatible with vLLM, Ollama, or llama.cpp:

merge_model.py
from peft import AutoPeftModelForCausalLM
import torch

model = AutoPeftModelForCausalLM.from_pretrained(
    "./outputs/mistral-7b-lora-adapter",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Merge LoRA weights into base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./outputs/mistral-7b-merged")
print("Merged model saved — ready for vLLM or Ollama.")

Tips to stay under $20

  • Always use QLoRA (4-bit) for 7-8B models. The quality loss is negligible compared to the VRAM and compute savings.
  • Cap sequence length to what you actually need. Going from max_seq_length=4096 to 2048 nearly halves attention memory usage.
  • Run short validation passes before the full training run: 100 steps to verify metrics are converging, then restart for the full duration.
  • Save only LoRA adapters, not the full model at every checkpoint — this saves storage and transfer time.
  • Stop the session immediately after downloading your artifacts. GhostNexus bills per second, not per hour.

Launch your first fine-tuning today

Create your GhostNexus account for free and get starter credits to test your first GPU job. RTX 3090 available immediately, billed per second, EU-localized data.

Questions about setup or your use case? contact@ghostnexus.net