Register

Exoscale’s Dedicated Inference


Exoscale Dedicated Inference is a fully managed inference service that allows you to turn any LLM model into a production-ready API endpoint in minutes, without the complexity of managing infrastructure or scaling manually.

With dedicated NVIDIA GPUs, predictable performance, and fully sovereign European hosting, you get everything needed to run real-time inference, RAG pipelines, agent workloads, and enterprise AI applications.

Why Deploy Your Inference on Exoscale?

Dedicated, Predictable Performance for AI Workloads

Dedicated, Predictable Performance for AI Workloads

Run your models on dedicated NVIDIA GPUs. Your inference workloads get consistent throughput and ultra-low latency, making it perfect for real-time AI services such as inference, agent-based workloads and RAGs.

Fully Managed Service, Zero Infrastructure Overhead

Fully Managed Service, Zero Infrastructure Overhead

No need for Kubernetes, Docker, load balancers, or GPU lifecycle management. You focus on product and AI logic, we operate everything else behind the scenes.

OpenAI-Compatible API, No Vendor Lock-In

OpenAI-Compatible API, No Vendor Lock-In

Use the familiar OpenAI API format to integrate your model instantly. No code changes, no vendor lock-in, your applications work out of the box on a sovereign European cloud.

Transparent Pricing

Transparent Pricing

The Dedicated Inference is included at no extra cost. You only pay for the GPU time you use and the storage, billed per second with no hidden fees or surprise bills. Calculate your price.

Secure & Sovereign by Design

Secure & Sovereign by Design

Your data never leaves Europe. Hosted entirely in European, GDPR-compliant data centers, making it ideal for finance, healthcare, government, and regulated industries.

Yours exclusively

Yours exclusively

Fully dedicated down to logs and GPU memory. Only your data and your users. Want to go further, bring your own domain: inference.company.com - coming soon.

Deploying Your Model: 3-Step Process

1

Load the Model

Point to any model on Hugging Face, public, gated, or private. We securely cache it in Object Storage for instant access.

          exo dedicated-inference model create...
        
2

Launch the Deployment

Select your GPU size and hit deploy. We handle the provisioning, drivers, and orchestration automatically.

          exo dedicated-inference deployment create...
        
3

Get the Endpoint & Integrate

Receive your OpenAI-compatible API key instantly. Start sending inference requests immediately.

          exo dedicated-inference deployment show...
        

Recommendations for optimal performance

Choosing the right GPU is essential for fast, cost-efficient inference. These recommended configurations balance performance and price to help you scale reliably at every stage, from prototyping to production.

Model Suggested GPU GPU price per hour
openai/gpt-oss-20b GPU A5000 €1.34
openai/gpt-oss-120b 4x GPU A5000 or 1x GPU RTX Pro 6000 €4.55 or €2.15
QuantTrio/Qwen3-Coder-480B-A35B-Instruct-AWQ 4x GPU RTX Pro 6000 €15.97

Designed for the AI Workloads That Matter

Exoscale Dedicated Inference is built for production at any scale, from small teams testing new AI capabilities to enterprises operating business-critical AI.

Real-Time Chatbots & AI Assistants

Deploy conversational AI systems that respond instantly, with low latency and predictable performance. Ideal for customer support, HR helpdesks, and banking assistants that handle thousands of parallel conversations.

Recommendation Engines & Personalization

Deliver personalized experiences that drive revenue. Build product recommendation systems for e-commerce, personalized content feeds, and dynamic inventory optimization.

Retrieval-Augmented Generation (RAG)

Combine LLMs with your private data for smarter outputs. Build legal, financial, or technical assistants trained on internal documents without ever sending data to public AI services.

AI Embedding, Search, and Vectorization

Generate embeddings for semantic search, similarity matching, and knowledge graph building. Power next-gen search engines that understand context and meaning, detect duplicates, and enable clustering based on vector representations.

Combine Dedicated Inference with...

…other Exoscale products. Easily compatible with GPUs, Managed PGVector and/or Vector Search.

GPU Cloud Computing

GPU Cloud Computing

NVIDIA-powered GPU instances for machine learning, data processing, and compute-intensive workloads.

Discover
Managed PGVector

Managed PGVector

PostgreSQL with pgvector extension. Perfect for hybrid workloads needing both relational and vector data.

Discover
Managed Vector Search

Managed Vector Search

OpenSearch-based vector database. Optimized for pure AI workloads and large-scale semantic search.

Discover

Get Started with Dedicated Inference Today

Access to Dedicated Inference is currently limited to preview participants. To get started, please contact our team and request access.

Contact us

Explore More Exoscale Services

Expand your infrastructure with services that boost availability, optimize performance, and provide expert support for all your workloads. Exoscale offers everything you need to scale your projects successfully.

Kubernetes

Scalable Kubernetes Service

Deploy containerized applications on a production-ready Kubernetes cluster in under two minutes. Use SKS as the control layer for your virtual machine instances, with support for CLI, API, Terraform, and other DevOps tools.

Discover
object-storage

Simple Object Storage

Use a highly scalable and S3-compatible storage solution for unstructured data. Ideal for storing backups, logs, static assets, or media, fully integrated with Exoscale regions and access-controlled via API.

Discover
Support Plans

Support Plans

Get the help you need to run your infrastructure with confidence through flexible support plans, designed to provide expert guidance, faster response times, and dedicated assistance tailored to your business.

Discover

Frequently Asked Questions about Dedicated Inference

How do I handle scaling when traffic spikes?

Scaling is controlled via a simple API call by adjusting the replicas parameter, allowing you to instantly deploy more inference engines to manage increased load.

Which GPU types can I use with Dedicated Inference?

You can run models on A30, A40, A5000, 3080ti, and RTX Pro 6000 GPUs in zones where they are offered. Check here the availability and pricing.

How does the pricing work?

Pricing is simple and transparent. You pay per second for GPU compute usage (billing starts when the model is ready and stops when scaled to zero). Additionally, standard Object Storage (SOS) rates apply for storing your cached models. The managed service layer is included at no extra cost. You can calculate your costs by using the Exoscale Calculator.

What is the minimum cost?

There are no upfront fees. The minimum cost is simply the GPU usage time (billed per second) plus the minimal standard Object Storage (SOS) fees for caching your model files. If you scale to zero, you only pay for storage.

Where is the service available?

Dedicated Inference is available in the Exoscale zones that currently offer the GPU servers where Dedicated Inference is available, such as AT-VIE-2, CH-GVA-2, DE-FRA-1, HR-ZAG-1, keeping your data local and compliant.

Can I scale the service to zero?

Dedicated Inference supports scaling deployments down to zero manually, so you can stop GPU usage when the service is idle. Autoscaling capabilities are coming soon. Please contact us to know more about our roadmap and upcoming capabilities.