The Azure AI Model Router: an infrastructure architect’s field notes

The request landed in my inbox last month. A dev team wanted to enable the Model Router in their Azure AI Foundry project. The pitch was simple: one endpoint, automatic model selection, lower costs. They wanted my approval to go live.

My job at that point is not to evaluate whether the router is a good product. It is to make sure the infrastructure, governance, and cost controls are in place before it touches a production workload. That took longer than the dev team expected.

This post covers what I checked, what I found, and what I required before saying yes.

What the router is, from an infra seat

The Model Router is a model deployment inside Azure AI Foundry. You deploy it like any other model, it sits inside your Foundry resource, and it calls underlying models on your behalf based on the complexity of the incoming prompt.

That last part matters more than the marketing makes it sound. “On your behalf” includes third-party models: Anthropic’s Claude, xAI’s Grok, DeepSeek, Meta’s Llama. If your routing mode is Balanced or Cost (both are defaults), a prompt could land on any of those depending on what the router decides is appropriate.

There are three routing modes:

Mode	What it does
Balanced (default)	Picks the most cost-effective model within a small quality range
Cost	Widens that quality band further to push toward cheaper models
Quality	Picks the highest-scoring model regardless of cost

You can also configure a model subset to restrict routing to a specific approved list. I will come back to why that is not optional for production.

Problem 1: your landing zone probably is not in the right region

The Model Router is only available in East US 2 and Sweden Central. That is a hard deployment constraint, not a soft recommendation.

For most enterprise landing zones this is a genuine issue. If your platform is built around a different primary region, you are looking at one of two options: deploy a Foundry resource outside your standard topology, or redesign your spoke placement to accommodate it.

Neither is technically difficult, but both need a decision made before any deployment conversation starts.

If your workloads are in EU regions for data residency reasons, Sweden Central is your only option. East US 2 puts prompt data in the US. Whether that is acceptable depends on your data classification policy, and that question needs an answer in writing before you proceed.

In our case the workload was not subject to EU residency requirements, so we went with East US 2 and created a dedicated Foundry spoke there. The network path from application VNets runs through our hub, private DNS resolves *.services.ai.azure.com to the private endpoint IP, and public network access is disabled on the Foundry resource.

If you have not set up private DNS for Foundry endpoints yet, do that before deploying anything. The Microsoft docs cover the required DNS zones. Test resolution from your application VNet before handing the endpoint to a dev team.

Problem 2: auto-update will quietly change what models handle your data

This one caught me off guard.

The active Model Router version is 2025-11-18. That version number does not change when Microsoft adds new models to it. New models, including third-party ones, are added in place. If you deploy with auto-update on and default routing settings, you can wake up one day to find a model you never approved handling production prompts.

There are two deployment modes. Quick Deploy gives you the full model pool and picks up new additions automatically. Custom Deploy lets you define a model subset, an explicit list of what the router is allowed to route to. New models Microsoft adds are excluded from your deployment until you deliberately add them.

For production, Custom Deploy with a defined subset is not optional. Any changes to that subset should go through your normal change control process.

The reason I flag this specifically to security and compliance teams: if your data classification policy distinguishes between Microsoft-hosted models and third-party models billed through the Azure Marketplace, that distinction needs to be encoded in the model subset. The router does not store prompts, but it does read them to make routing decisions. Where that read happens matters to some compliance frameworks.

To show what this looks like in practice: I ran the eval toolkit against a Quick Deploy with no model subset configured. Across three test runs, the router never once picked gpt-4o. Between 28% and 44% of prompts went to Grok each time. That happened with zero deliberate configuration. If your data classification policy requires knowing which model handled a given prompt, that is the scenario you are trying to prevent.

Model	Run 1	Run 2	Run 3
grok-4-1-fast-reasoning	28.6%	44.4%	37.5%
gpt-oss-120b	28.6%	22.2%	25.0%
gpt-5-mini	28.6%	22.2%	25.0%
gpt-5.4	14.3%	11.1%	12.5%
gpt-4o	0%	0%	0%

Three runs, Balanced mode, no model subset configured. gpt-4o was never selected once.

One more thing for the security team: set your content filter on the router deployment itself, not on each underlying model. The filter applies to everything passed to and from the router. If you apply filters per model thinking you are being thorough, you are working against the architecture. One filter on the router is the right approach.

Problem 3: the cost model has a layer most teams miss

The headline promise is cost savings. The actual billing has two components: a router markup charged on every input token, and the underlying model’s standard rate for that request. Your Cost Management dashboard shows a blend. You cannot tell from the default view which underlying model drove a spend spike.

A few things worth getting right before go-live.

Set a budget alert scoped to the Foundry resource group. Put alerts at 80% and 100% of expected monthly spend. The router’s model distribution shifts based on prompt content, which makes spending patterns less predictable than a fixed model deployment.

Capture a baseline spend number before you enable the router. Run the eval toolkit covered below against a fixed baseline model first. That gives you an actual delta to measure against rather than a theoretical one.

In my test runs against the sample dataset, the router saved between 52% and 62% compared to gpt-4o. The savings came entirely from model selection: the router never picked gpt-4o, routing instead to gpt-5-mini, gpt-oss-120b, and Grok. Your numbers will depend on what your workload gets routed to, which is exactly why you need the baseline first.

	Run 1	Run 2	Run 3
Router total cost	$0.0148	$0.0168	$0.0141
Baseline (gpt-4o) total cost	$0.0392	$0.0353	$0.0351
Savings	62.3%	52.4%	59.9%
Router timeouts	3	1	2
Baseline timeouts	0	0	0

10 prompts per run. Tier 1 quota (150 RPM). Timeouts are a quota constraint, not a router defect.

Tag the Foundry resource before it goes anywhere near production. In our landing zone we require CostCenter, WorkloadOwner, DataClassification, and RoutingMode. That last tag tells the FinOps team at a glance whether the deployment is in Cost, Balanced, or Quality mode, which affects expected cost per token significantly.

One other thing: if the workload plans to use Claude models, those require a separate pre-deployment before the router can route to them. That is an extra billable resource in your subscription before it handles a single request through the router. Factor that into the cost projection.

One runtime behaviour worth flagging to your dev teams before they go live: when the router selects an o-series reasoning model, it silently drops temperature, top_p, stop, presence_penalty, frequency_penalty, and logit_bias. No error is returned. The application just behaves differently than expected. If a workload relies on any of those parameters, test explicitly with prompts that trigger reasoning model selection before approving it for production.

Problem 4: quota tier requests take time

The default quota for a new Model Router deployment is Tier 1: 1,000 requests per minute and 1 million tokens per minute on Global Standard. Most production workloads need more than that. Tier 3 gives you 4,000 RPM and 4 million TPM. Tier 5 gets you to 10,000 RPM.

Quota increase requests take time to process. You cannot submit one in response to a 429 in production.

Make the quota request part of your workload onboarding process. When a team requests router access, get their estimated request volume and token throughput upfront and submit the increase alongside the deployment, not after.

Using the eval toolkit as a sign-off mechanism

Once the infrastructure is sorted, the remaining question is whether the router actually performs well for this specific workload. The open-source eval toolkit the Foundry team published handles that.

I want to be clear about how I use it: not as a developer benchmarking exercise, but as a platform team gate. The eval gives me a data-backed answer before I sign off. Without it, approval is based on the vendor’s general claim. With it, approval is based on a test against the actual prompts the workload will send.

The repo is at github.com/microsoft-foundry/Model-Router-Auto-Evaluation.

Setup:

			
# Windows
git clone https://github.com/microsoft-foundry/Model-Router-Auto-Evaluation
cd Model-Router-Auto-Evaluation
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .

		

			
# macOS / Linux
git clone https://github.com/microsoft-foundry/Model-Router-Auto-Evaluation
cd Model-Router-Auto-Evaluation
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

		

Run the demo first, no credentials needed:

			
# Windows
.\scripts\demo.ps1
# macOS / Linux
bash scripts/demo.sh

This generates a full mock report using synthetic data and opens results/demo/dashboard.html in your browser. You will see every chart the tool produces: cost comparison, latency at p50/p90/p99, model distribution, pairwise quality win rates, per-category breakdown. I always do this step first so I know what I am looking at before running a live eval.

Configure credentials for a live run:

			
copy .env.example .env   # Windows
cp .env.example .env     # macOS / Linux

You need three sets of credentials: the router endpoint, a baseline model endpoint (the fixed model the router gets compared against), and a judge model endpoint (a separate model that scores answer quality). Use at least GPT-4o as the judge. Weaker models produce noisy scores.

Validate before spending API calls:

			
python scripts/run_eval.py \
  --config configs/quick_test.yaml \
  --dataset their_prompts.jsonl \
  --dry-run

The dry-run checks that your config is valid and your dataset parses correctly without making any API calls. If this fails, fix it here rather than three minutes into a live run.

Run the eval:

			
python scripts/run_eval.py \
  --config configs/default.yaml \
  --dataset their_prompts.jsonl

For this lab I used Quick Deploy with the full model pool rather than a Custom Deploy with a model subset. That is intentional. Quick Deploy with no governance constraints is exactly what shows up in a real organisation when someone enables the router without platform oversight, and it is what makes the model distribution findings meaningful. For a production eval, you would configure Custom Deploy with your approved model subset first, so the distribution you measure actually reflects your routing policy rather than the router’s unconstrained choices.

I ran this on a test subscription with the 10-prompt sample dataset that ships with the repo. If the run is interrupted, re-run with --resume and it picks up from the last checkpoint.

Results land in results/run-<timestamp>/. Start with dashboard.html for the charts and detailed_results.csv for per-prompt analysis in Excel.

Comparing two runs:

python scripts/compare_results.py results/run-a results/run-b

Useful for before/after comparisons when a team wants to change the model subset or routing mode.

Monitoring the router in production

The eval toolkit tells you whether to enable the router. Azure Monitor tells you whether it is behaving once it is live.

To monitor performance, go to the Monitoring section of your Foundry resource in the Azure portal and open Metrics. Filter by your router deployment name. From there you can split the metrics by underlying model, which is the view that actually tells you something useful. If latency spikes, splitting by model shows you which underlying model is the source.

For cost monitoring the path is slightly different. In Cost Analysis, filter by Tag, set the tag type to Deployment, and set the value to your router deployment name. That scopes the cost view to just the router. Without that filter, the router’s spend blends into your broader Azure OpenAI resource cost and you cannot tell what the router specifically is spending.

Add both views to your platform team’s monitoring runbook before you hand the endpoint to a workload team. Once the router is live and routing autonomously, these are the two signals you check first when something looks wrong.

What I actually look at in the results

As the platform team reviewer I am not looking at the same numbers a developer would look at.

Model distribution chart: which underlying models got used? If anything outside the approved model subset appears, that is a problem to resolve before production.

Cost delta: is the saving real after the router markup? A headline saving can look different once high-token prompts absorb the per-token routing overhead.

Mean latency and p99 separately: the router was consistently faster at mean across my test runs (1.1x to 1.4x), but p99 was unpredictable. In one run the baseline was faster at p99. Which model the router picks for a given prompt type determines your tail, and you have no direct control over that in Balanced mode without a model subset.

	Run 1	Run 2	Run 3
Router mean	6,722ms	6,509ms	5,687ms
Baseline mean	9,105ms	6,895ms	7,686ms
Mean speedup	1.4x faster	1.1x faster	1.4x faster
Router p99	22,244ms	22,042ms	18,146ms
Baseline p99	24,925ms	14,590ms	17,990ms

Run 3 p99: the baseline was faster at the tail that run. Mean wins do not guarantee p99 wins.

Win rate confidence interval: the report shows a 95% CI alongside the win rate. A result of “52% router wins (CI: 41% to 63%)” is not a meaningful improvement over baseline. I do not approve a workload on a win rate whose lower confidence bound is below 45%.

My sign-off checklist

Before I approve a workload to use the router in production:

[ ] Foundry resource in East US 2 or Sweden Central with private endpoint and public access disabled
[ ] Custom Deploy with a model subset that matches the approved model list
[ ] Auto-update disabled or model subset changes subject to change control
[ ] Budget alert on the Foundry resource at 80% and 100%
[ ] Baseline spend captured before the router is enabled
[ ] Foundry resource tagged with CostCenter, WorkloadOwner, DataClassification, RoutingMode
[ ] Quota tier request submitted and approved
[ ] Eval run against at least 30 representative prompts from the actual workload
[ ] Model distribution in eval shows no unapproved models
[ ] p99 latency meets the workload SLA
[ ] Win rate CI lower bound above 45%

João Paulo Costa Azure MVP

Hands-on Azure engineer specialising in infrastructure and automation. Big on troubleshooting and real-world problem solving. Azure MVP in Compute Infrastructure.

The Azure AI Model Router: an infrastructure architect’s field notes

What the router is, from an infra seat

Problem 1: your landing zone probably is not in the right region

Problem 2: auto-update will quietly change what models handle your data

Problem 3: the cost model has a layer most teams miss

Problem 4: quota tier requests take time

Using the eval toolkit as a sign-off mechanism

Monitoring the router in production

What I actually look at in the results

My sign-off checklist

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Get Practical