Choosing the right base model
The first decision is choosing the foundation model. In 2026, two families dominate for low-cost fine-tuning:
Mistral 7B (and its derivatives)
Mistral 7B remains one of the best performance/size ratios available in open weights. Its 32k token context window and Grouped Query Attention architecture make it particularly efficient to inference and fine-tune. Mistral-7B-Instruct-v0.3 is an excellent starting point for adapting a conversational assistant to your domain. For classification or extraction tasks, Mistral-7B-v0.1 (base) is preferable.
LLaMA 3.1 8B
Meta LLaMA 3.1 8B is slightly more memory-hungry than Mistral 7B but performs better on reasoning and code benchmarks. It's available on Hugging Face after accepting the Meta license (free). The Instruct version is optimized for dialogue; the Base version for completion or supervised fine-tuning tasks.
Our recommendation: start with Mistral 7B Instruct for a domain assistant or chatbot. Choose LLaMA 3.1 8B if your task involves multi-step reasoning or structured code generation.
Hardware requirements: why the RTX 3090 is more than enough
An RTX 3090 has 24 GB of VRAM — exactly what's needed to fine-tune a 7-8B model with QLoRA (4-bit quantization) or standard LoRA. Here's the memory breakdown:
| Configuration | Model VRAM | Total VRAM |
|---|---|---|
| Mistral 7B · QLoRA 4-bit · batch 4 | ~5 GB | ~14 GB ✓ |
| LLaMA 3.1 8B · QLoRA 4-bit · batch 4 | ~6 GB | ~16 GB ✓ |
| Mistral 7B · LoRA bf16 · batch 8 | ~14 GB | ~22 GB ✓ |
For larger models (13B, 34B), you'll need an RTX 4090 (24 GB, faster) or two GPUs in parallel — but for the vast majority of domain use cases, fine-tuned 7-8B models outperform larger unspecialized ones.
Cost breakdown: 4h on RTX 3090 = $1.20
The RTX 3090 rate on GhostNexus is $0.30/hr. A LoRA fine-tuning of Mistral 7B on a dataset of 10,000 examples of 512 tokens, with 3 epochs, takes approximately 3.5–4 hours on an RTX 3090:
For a larger dataset (100k examples) or more epochs, costs stay very reasonable: 5–10h of compute is $1.50–$3.00. Well under $20 in all common scenarios.
For comparison, the same job on an AWS p3.2xlarge (V100 16 GB) would cost ~$3.06/hr — over $12 for the same run. On Google Colab Pro+, GPU quota is limited and unpredictable. On GhostNexus, resources are available immediately, billed to the second, with no quota.
Step-by-step: fine-tuning on GhostNexus with the Python SDK
Step 1 — Install the GhostNexus SDK
pip install ghostnexus-sdk
# or with poetry:
poetry add ghostnexus-sdkStep 2 — Configure authentication
Create your account at ghostnexus.net/login and retrieve your API key from the dashboard. Export it as an environment variable:
export GHOSTNEXUS_API_KEY="gn_sk_xxxxxxxxxxxxxxxxxxxx"Step 3 — Launch LoRA fine-tuning
Here is a complete Python script that launches a LoRA fine-tuning of Mistral 7B Instruct on GhostNexus, handling GPU provisioning, model loading, and training:
import ghostnexus as gn
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch
# -- 1. Connect to GhostNexus and reserve GPU --
client = gn.Client() # reads GHOSTNEXUS_API_KEY from environment
session = client.sessions.create(
gpu_type="RTX_3090", # 24 GB VRAM, $0.30/hr
region="eu-west", # EU-localized data
disk_gb=50,
)
print(f"Session started: {session.id} | GPU: {session.gpu_type}")
print(f"Rate: {session.price_per_hour} $/hr")
# -- 2. Load base model with 4-bit quantization --
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# -- 3. LoRA configuration --
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: ~20M / 7B total (~0.28%) — very fast to train
# -- 4. Load dataset --
dataset = load_dataset("json", data_files={"train": "data/train.jsonl"})
def format_prompt(example):
return {
"text": f"[INST] {example['instruction']} [/INST] {example['output']}"
}
dataset = dataset.map(format_prompt)
# -- 5. Training --
training_args = TrainingArguments(
output_dir="./outputs/mistral-7b-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=False,
bf16=True,
logging_steps=50,
save_strategy="epoch",
optim="paged_adamw_8bit",
warmup_ratio=0.05,
lr_scheduler_type="cosine",
report_to="none",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=2048,
args=training_args,
peft_config=lora_config,
)
trainer.train()
# -- 6. Save LoRA adapters --
model.save_pretrained("./outputs/mistral-7b-lora-adapter")
tokenizer.save_pretrained("./outputs/mistral-7b-lora-adapter")
print("Fine-tuning complete. Adapters saved.")
# -- 7. Download artifacts and stop session --
session.download_artifacts(
remote_path="./outputs/",
local_path="./local_model/",
)
session.stop()
print(f"Total session cost: {session.total_cost:.4f} $")
Step 4 — Prepare your dataset
The expected format for instruction fine-tuning (SFT) is a JSONL file with instruction/response pairs. Each line is a JSON object:
{"instruction": "Summarize this contract in 3 key points.", "output": "1. Duration: 12 months renewable..."}
{"instruction": "Identify the termination clauses.", "output": "Article 14 — Termination for cause..."}
{"instruction": "Translate this passage into plain language.", "output": "In plain terms, this means that..."}For quality fine-tuning, aim for 1,000 to 10,000 well-crafted examples. Data quality beats quantity — 500 perfectly formatted examples often outperform 5,000 noisy ones.
Step 5 — Evaluate and merge the model
LoRA adapters are lightweight (~100–300 MB). For production deployment, you can merge them with the base model or load them dynamically at inference via PEFT. Merging produces a standalone model compatible with vLLM, Ollama, or llama.cpp:
from peft import AutoPeftModelForCausalLM
import torch
model = AutoPeftModelForCausalLM.from_pretrained(
"./outputs/mistral-7b-lora-adapter",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Merge LoRA weights into base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./outputs/mistral-7b-merged")
print("Merged model saved — ready for vLLM or Ollama.")Tips to stay under $20
- Always use QLoRA (4-bit) for 7-8B models. The quality loss is negligible compared to the VRAM and compute savings.
- Cap sequence length to what you actually need. Going from max_seq_length=4096 to 2048 nearly halves attention memory usage.
- Run short validation passes before the full training run: 100 steps to verify metrics are converging, then restart for the full duration.
- Save only LoRA adapters, not the full model at every checkpoint — this saves storage and transfer time.
- Stop the session immediately after downloading your artifacts. GhostNexus bills per second, not per hour.