Deploy Any AI Model, Effortlessly and Privately

Instant inference endpoints on dedicated NVIDIA GPUs.
Now in Preview.

Your model. Your data. Your rules. Deploy scalable, production-ready AI inference endpoints through a simple API. No complex DevOps, no unpredictable bills, just reliable, predictable AI performance in a GDPR-compliant European cloud.

Exoscale’s Dedicated Inference

Exoscale Dedicated Inference is a fully managed inference service that allows you to turn any LLM model into a production-ready API endpoint in minutes, without the complexity of managing infrastructure or scaling manually.

With dedicated NVIDIA GPUs, predictable performance, and fully sovereign European hosting, you get everything needed to run real-time inference, RAG pipelines, agent workloads, and enterprise AI applications.

Why Deploy Your Inference on Exoscale?

Dedicated, Predictable Performance for AI Workloads

Run your models on dedicated NVIDIA GPUs. Your inference workloads get consistent throughput and ultra-low latency, making it perfect for real-time AI services such as inference, agent-based workloads and RAGs.

Fully Managed Service, Zero Infrastructure Overhead

No need for Kubernetes, Docker, load balancers, or GPU lifecycle management. You focus on product and AI logic, we operate everything else behind the scenes.

OpenAI-Compatible API, No Vendor Lock-In

Use the familiar OpenAI API format to integrate your model instantly. No code changes, no vendor lock-in, your applications work out of the box on a sovereign European cloud.

Transparent Pricing

The Dedicated Inference is included at no extra cost. You only pay for the GPU time you use and the storage, billed per second with no hidden fees or surprise bills. Calculate your price.

Secure & Sovereign by Design

Your data never leaves Europe. Hosted entirely in European, GDPR-compliant data centers, making it ideal for finance, healthcare, government, and regulated industries.

Yours exclusively

Fully dedicated down to logs and GPU memory. Only your data and your users. Want to go further, bring your own domain: inference.company.com - coming soon.

Deploying Your Model: 3-Step Process

Load the Model

Point to any model on Hugging Face, public, gated, or private. We securely cache it in Object Storage for instant access.

          exo dedicated-inference model create...

Launch the Deployment

Select your GPU size and hit deploy. We handle the provisioning, drivers, and orchestration automatically.

          exo dedicated-inference deployment create...

Get the Endpoint & Integrate

Receive your OpenAI-compatible API key instantly. Start sending inference requests immediately.

          exo dedicated-inference deployment show...

Recommendations for optimal performance

Choosing the right GPU is essential for fast, cost-efficient inference. These recommended configurations balance performance and price to help you scale reliably at every stage, from prototyping to production.

Model	Suggested GPU	GPU price per hour
openai/gpt-oss-20b	GPU A5000	€1.34
openai/gpt-oss-120b	4x GPU A5000 or 1x GPU RTX Pro 6000	€4.55 or €2.15
QuantTrio/Qwen3-Coder-480B-A35B-Instruct-AWQ	4x GPU RTX Pro 6000	€15.97

Real-Time Chatbots & AI Assistants

Deploy conversational AI systems that respond instantly, with low latency and predictable performance. Ideal for customer support, HR helpdesks, and banking assistants that handle thousands of parallel conversations.

Recommendation Engines & Personalization

Deliver personalized experiences that drive revenue. Build product recommendation systems for e-commerce, personalized content feeds, and dynamic inventory optimization.

Retrieval-Augmented Generation (RAG)

Combine LLMs with your private data for smarter outputs. Build legal, financial, or technical assistants trained on internal documents without ever sending data to public AI services.

AI Embedding, Search, and Vectorization

Generate embeddings for semantic search, similarity matching, and knowledge graph building. Power next-gen search engines that understand context and meaning, detect duplicates, and enable clustering based on vector representations.

Predictable, hourly pricing for dedicated AI inference

Dedicated Inference runs your models on isolated GPU capacity. Pricing depends on the selected GPU type, while cached model files use standard Object Storage pricing.

	Product Details	Price ({{ currency \| uppercase }})
Dedicated Inference GPU A5000	per hour	{{ prices['ai']['dedicated_inference_gpua5000'][currency] \| number:8 }}
Dedicated Inference GPU A40 (GPU3)	per hour	{{ prices['ai']['dedicated_inference_gpu3'][currency] \| number:8 }}
Dedicated Inference GPU RTX Pro 6000	per hour	{{ prices['ai']['dedicated_inference_gpurtx6000pro'][currency] \| number:8 }}
Model storage (cached files)	per GiB.hour	{{ prices['ai']['model'][currency] \| number:8 }}

[1]

If you need a specific GPU model not listed here, but which is part of our GPU offering, please contact us.
[2]

Pricing scales with the total number of GPUs used. The hourly cost is multiplied by the number of GPUs per deployment and the number of replicas (total GPUs = GPU count × replicas).

Combine Dedicated Inference with...

…other Exoscale products. Easily compatible with GPUs, pgvector and/or Vector search.

GPU Cloud Computing

NVIDIA-powered GPU instances for machine learning, data processing, and compute-intensive workloads.

Discover

Managed pgvector

PostgreSQL with pgvector extension. Perfect for hybrid workloads needing both relational and vector data.

Discover

Managed Vector search

OpenSearch-based vector database. Optimized for pure AI workloads and large-scale semantic search.

Discover

Get Started with Dedicated Inference Today

Access to Dedicated Inference is currently limited to preview participants. To get started, please contact our team and request access.

Get started with our Dedicated Inference Tutorials

Discover step-by-step guides to help you deploy, manage, and scale your AI models on Exoscale.

Get Started in Minutes

Launch your first AI inference endpoint in minutes using our CLI-based guide, designed to help you run LLMs quickly and easily.

Get started

Practical How-To Guides

Dedicated Inference on Exoscale with hands-on guides covering deployment strategies, monitoring, optimization, and real-world best practices.

How-To Guides

Explore More Exoscale Services

Expand your infrastructure with services that boost availability, optimize performance, and provide expert support for all your workloads. Exoscale offers everything you need to scale your projects successfully.

Scalable Kubernetes Service

Deploy containerized applications on a production-ready Kubernetes cluster in under two minutes. Use SKS as the control layer for your virtual machine instances, with support for CLI, API, Terraform, and other DevOps tools.

Discover

Simple Object Storage

Use a highly scalable and S3-compatible storage solution for unstructured data. Ideal for storing backups, logs, static assets, or media, fully integrated with Exoscale regions and access-controlled via API.

Discover

Support Plans

Get the help you need to run your infrastructure with confidence through flexible support plans, designed to provide expert guidance, faster response times, and dedicated assistance tailored to your business.

Discover

Frequently Asked Questions about Dedicated Inference

How do I handle scaling when traffic spikes?

Scaling is controlled via a simple API call by adjusting the replicas parameter, allowing you to instantly deploy more inference engines to manage increased load.

Which GPU types can I use with Dedicated Inference?

You can run models on A30, A40, A5000, 3080ti, and RTX Pro 6000 GPUs in zones where they are offered. Check here the availability and pricing.

How does the pricing work?

Pricing is simple and transparent. You pay per second for GPU compute usage (billing starts when the model is ready and stops when scaled to zero). Additionally, standard Object Storage (SOS) rates apply for storing your cached models. The managed service layer is included at no extra cost. You can calculate your costs by using the Exoscale Calculator.

What is the minimum cost?

There are no upfront fees. The minimum cost is simply the GPU usage time (billed per second) plus the minimal standard Object Storage (SOS) fees for caching your model files. If you scale to zero, you only pay for storage.

Where is the service available?

Dedicated Inference is available in the Exoscale zones that currently offer the GPU servers where Dedicated Inference is available, such as AT-VIE-2, CH-GVA-2, DE-FRA-1, HR-ZAG-1, keeping your data local and compliant.

Can I scale the service to zero?

Dedicated Inference supports scaling deployments down to zero manually, so you can stop GPU usage when the service is idle. Autoscaling capabilities are coming soon. Please contact us to know more about our roadmap and upcoming capabilities.

Instant inference endpoints on dedicated NVIDIA GPUs. Now in Preview.

Exoscale’s Dedicated Inference

Why Deploy Your Inference on Exoscale?

Dedicated, Predictable Performance for AI Workloads

Fully Managed Service, Zero Infrastructure Overhead

OpenAI-Compatible API, No Vendor Lock-In

Transparent Pricing

Secure & Sovereign by Design

Yours exclusively

Deploying Your Model: 3-Step Process

Load the Model

Launch the Deployment

Get the Endpoint & Integrate

Recommendations for optimal performance

Designed for the AI Workloads That Matter

Real-Time Chatbots & AI Assistants

Recommendation Engines & Personalization

Retrieval-Augmented Generation (RAG)

AI Embedding, Search, and Vectorization

Predictable, hourly pricing for dedicated AI inference

Combine Dedicated Inference with...

GPU Cloud Computing

Managed pgvector

Managed Vector search

Get Started with Dedicated Inference Today

Get started with our Dedicated Inference Tutorials

Get Started in Minutes

Practical How-To Guides

Explore More Exoscale Services

Scalable Kubernetes Service

Simple Object Storage

Support Plans

Frequently Asked Questions about Dedicated Inference

How do I handle scaling when traffic spikes?

Which GPU types can I use with Dedicated Inference?

How does the pricing work?

What is the minimum cost?

Where is the service available?

Can I scale the service to zero?

Instant inference endpoints on dedicated NVIDIA GPUs.
Now in Preview.