Stable Diffusion Cloud GPU 2026: Run SDXL & Flux Without a Local GPU
SDXL needs 8 GB VRAM. Flux.1-dev needs 16 GB. Buying a GPU that sits idle 90% of the time makes no sense. Here is how to run the best open-source image models on cloud GPUs for fractions of a cent per image — GDPR-compliant, no setup, pay per second.
TL;DR
- • SDXL runs well on 8 GB VRAM (RTX 3080 class); Flux.1-dev needs 16 GB or fp8 quantization to 8 GB
- • Cloud GPU costs: $0.50–$0.74/hr for an RTX 4090 — a 500-image batch costs under $0.25
- • EU image studios need GDPR-compliant providers — most GPU clouds don't qualify
- • GhostNexus offers RTX 4090s at $0.50/hr, EU-hosted, DPA-ready, billed per second
VRAM Requirements by Model
Your choice of model determines the GPU you need. Here's the minimum VRAM for each major image model in 2026 (float16 unless noted):
| Model | Min VRAM | Batch size | Recommended GPU |
|---|---|---|---|
| SD 1.5 | 4 GB | 1–4 at 512px | RTX 3070 |
| SDXL 1.0 | 8 GB | 1–2 at 1024px | RTX 3080 |
| SDXL + ControlNet | 12 GB | 1 at 1024px | RTX 3080 Ti / 4070 |
| Flux.1-schnell | 12 GB | 1–2 at 1024px | RTX 4070 / 3090 |
| Flux.1-dev | 16 GB | 1 at 1024px | RTX 4080 / A100 |
| Flux.1-dev (fp8) | 8 GB | 1 at 1024px | RTX 3080 |
| Stable Video Diffusion | 20 GB | 1 video | RTX 3090 / A100 |
RTX 4090 (24 GB VRAM) covers all models above without quantization. It's the default GPU on GhostNexus at $0.50/hr.
Cost Per Image on Cloud GPU
An RTX 4090 generates SDXL images at roughly 4–6 seconds each (20 steps, DPM++ 2M). At $0.50/hr:
A batch of 1,000 SDXL images costs under $0.70 on GhostNexus. Compare that to Midjourney Pro at $60/month for 1,000 images, or DALL-E 3 at $0.04–$0.08 per image.
GPU Cloud Provider Comparison (Image Generation)
| Provider | GPU | Price | GDPR | Notes |
|---|---|---|---|---|
| GhostNexusEU | RTX 4090 (24 GB) | $0.50/hr | Yes | EU-hosted, pay-per-second |
| RunPod | RTX 4090 | $0.74/hr | Partial | Community = no EU DPA |
| Vast.ai | RTX 4090 | $0.35–0.55/hr | No | Random hosts, no compliance |
| Lambda Labs | A10 (24 GB) | $0.60/hr | No | US-only data centers |
| Google Colab Pro+ | A100 (40 GB) | $57/mo flat | No | Session limits, queued access |
| Paperspace | A100 (40 GB) | $3.09/hr | No | US and EU nodes available |
Run SDXL on Cloud GPU: Full Script
Submit this script to GhostNexus with the Python SDK — it generates 4 images and prints the generation time:
import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import time
# Load SDXL with optimized scheduler
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention()
prompts = [
"A cyberpunk city at night, neon lights, ultra detailed, 8k",
"Portrait of an astronaut on Mars, cinematic, Hasselblad",
"Ancient forest waterfall, golden hour, photorealistic",
"Abstract geometric art, vibrant colors, minimalist",
]
t0 = time.perf_counter()
for i, prompt in enumerate(prompts):
image = pipe(
prompt=prompt,
num_inference_steps=20,
guidance_scale=7.5,
width=1024, height=1024,
).images[0]
image.save(f"/tmp/output_{i}.png")
print(f"Image {i+1}/4 done — {time.perf_counter()-t0:.1f}s elapsed")
total = time.perf_counter() - t0
print(f"\n4 images in {total:.1f}s ({total/4:.1f}s/image)")
print(f"Estimated cost: ${total/3600 * 0.50:.5f}")# Submit with the SDK:
import ghostnexus
client = ghostnexus.Client()
job = client.run("sdxl_batch.py", task_name="sdxl-batch-4")
for chunk in job.stream_logs():
print(chunk, end="", flush=True)
# Output:
# Image 1/4 done — 5.2s elapsed
# Image 2/4 done — 10.4s elapsed
# Image 3/4 done — 15.7s elapsed
# Image 4/4 done — 20.9s elapsed
#
# 4 images in 20.9s (5.2s/image)
# Estimated cost: $0.00290Running Flux.1 on Cloud GPU
Flux.1-dev produces significantly better image quality than SDXL but requires more VRAM and takes longer per image. On an RTX 4090 (24 GB):
- →Flux.1-schnell: 4 inference steps, ~25s/image, best quality/speed ratio. Cost: $0.0035/image.
- →Flux.1-dev: 20–50 steps, ~100s/image, maximum quality. Cost: $0.014/image.
- →Flux.1-dev (fp8 quantized): Runs in 8 GB VRAM on RTX 3080, ~150s/image, minor quality loss.
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
image = pipe(
"A hyperrealistic portrait of a woman in Renaissance style",
num_inference_steps=4, # schnell = 4 steps
guidance_scale=0.0,
height=1024, width=1024,
).images[0]
image.save("/tmp/flux_output.png")GDPR and Image Generation
If you generate images of real people, use training data from EU users, or run an image generation service for EU customers, GDPR applies to your GPU infrastructure.
Most GPU cloud providers process data on US servers with no valid EU Standard Contractual Clauses (SCCs). That creates legal exposure under GDPR Art. 44–49 (international transfers).
GhostNexus runs on EU infrastructure (Frankfurt), signs a Data Processing Agreement on request, and maintains a sub-processor list in compliance with GDPR Art. 28.
ControlNet and img2img on Cloud
ControlNet requires an additional 1–2 GB VRAM on top of the base model. On an RTX 4090, SDXL + ControlNet generates images in 8–12 seconds at 1024×1024:
- ✓Canny edge, depth, pose, and tile ControlNets all work out of the box
- ✓Upload your source images via
inline=Truewith base64-encoded data - ✓Output images returned in job logs as base64 or saved to a persistent volume (coming Q3 2026)
Try It Now — $15 Free Credits
Sign up and run your first SDXL or Flux.1 batch in under 5 minutes. No credit card required for the free credits.
Use code WELCOME15 at signup
Frequently Asked Questions
Can I use my own custom fine-tuned SDXL model?
Yes. Upload your model weights as part of your script or download them from Hugging Face at job start. The container has internet access during setup only (outbound HTTPS for model downloads).
How do I get the generated images back?
Encode images as base64 in your script and print them to stdout — they appear in the job output logs. Persistent storage and S3-compatible output volumes are on the Q3 2026 roadmap.
How long does it take to start a job?
Cold start (model download + container launch): 30–90 seconds for a 6 GB SDXL checkpoint. Subsequent jobs with cached models: under 10 seconds.
Can I run multiple GPU jobs at once?
Yes. Use the async client to dispatch jobs concurrently. Each job runs on a separate GPU node. There is no multi-GPU single-job support currently.
Is ComfyUI or InvokeAI supported?
These tools require a UI/browser session, which is not supported. The platform runs Python scripts only. You can use their underlying pipelines (diffusers, comfy-script) directly in your script.