How to Build a Monitoring System for Azure Workloads

TL;DR

Monitoring is not a portal feature you switch on at the end. I treat it as a separate architecture plane with its own data flows, retention rules, access model, and cost envelope.
For most Azure workloads, the right baseline is: Azure Monitor + Log Analytics workspace + Application Insights + diagnostic settings + alerts + Workbooks. Then add VM Insights, Container Insights, and Network Watcher only where the workload needs deeper visibility.
The most common anti-pattern I see is collecting everything, correlating nothing, and alerting on symptoms instead of business impact. That gives you high spend, low trust, and poor incident response.
If you want monitoring to scale, design for four layers from day one: instrumentation, collection/storage, correlation/analysis, and visualization/action.

The Problem Worth Solving

Most teams still bolt monitoring on after the workload is already live. That is usually where the trouble starts.

In my experience, the failure is rarely “we had no telemetry at all.” The real failure is more subtle: metrics exist in one place, application traces in another, platform logs are either missing or too noisy, and nobody can tie a customer issue to a concrete execution path in under 15 minutes. At that point, monitoring is just expensive decoration.

I treat monitoring as a first-class architecture system. The functional stack serves the workload. The monitoring stack observes the workload. Those are different responsibilities and they need different design decisions. Microsoft’s Well-Architected guidance makes the same point: build a dedicated monitoring system that observes infrastructure, application behavior, and operational processes rather than treating monitoring as an afterthought.

This matters right now for a few reasons:

Azure estates are more distributed than they were even two years ago: App Service, Functions, containers, managed databases, AI endpoints, private networking, CI/CD automation, and external SaaS dependencies.
Teams are shipping faster. If your deployment frequency goes from weekly to daily, weak observability stops being an inconvenience and becomes a reliability risk.
Cost is no longer secondary. Azure Monitor Logs ingestion is often the biggest monitoring cost component, so bad telemetry design becomes a FinOps problem quickly.

I’ve seen this play out repeatedly. In my portfolio work, I built a production-grade Azure portfolio site with Application Insights monitoring, managed identity, CI/CD, and a documented architecture as a reference implementation. In another modernization project on AKS, introducing observability and deployment discipline helped move deployment frequency from weekly to daily, reduced mean time to recovery from hours to minutes, and cut infrastructure cost by 40%.

This affects:

Cloud and AI architects designing multi-service workloads
CTOs who care about uptime, team efficiency, and cloud spend
Platform engineers standardizing telemetry across teams
App teams who need real root-cause analysis, not vanity dashboards

What breaks when you do this wrong?

You miss platform logs because diagnostic settings were never enabled.
You get flooded with alerts because every CPU spike pages someone.
You cannot correlate a failed API request to a dependency call, database issue, or network path.
You retain verbose logs forever and wonder why the Azure bill keeps climbing.
During incidents, the first 20 minutes are spent asking “which dashboard is the right one?”

If your system starts throwing intermittent 5xx responses at 800 to 1,000 RPS and you cannot answer which operation, which dependency, which deployment, and which region within a few minutes, your monitoring architecture is incomplete.

Architecture Overview

Diagram img-1

The architecture I recommend has a simple principle: separate the workload plane from the observability plane, but connect them through consistent telemetry contracts.

At a high level:

Applications and services emit telemetry
- Application code emits traces, requests, dependencies, exceptions, and custom events through Application Insights and OpenTelemetry.
- Azure resources emit platform metrics and resource logs through Azure Monitor.
- VMs, AKS clusters, and network components emit deeper operational data through Azure Monitor Agent, Container Insights, VM Insights, Prometheus integration where needed, and Network Watcher. Microsoft’s Azure Monitor architecture explicitly positions Azure Monitor as the central data platform for logs, metrics, application telemetry, and Prometheus-style signals.
Telemetry lands in the right sink
- Near-real-time time-series signals go to Metrics
- Searchable operational and diagnostic records go to Log Analytics
- Application traces and request/dependency data go through Application Insights, ideally workspace-based
Correlation happens at query and trace level
- Operation IDs, trace IDs, request IDs, user/session identifiers, deployment metadata, and environment tags are attached to telemetry.
- Cross-service analysis happens through KQL in Log Analytics and Application Insights-backed logs.
Actioning and visualization sit on top
- Alerts route through Azure Monitor Alerts + Action Groups
- Investigative views live in Workbooks
- Executive or operational rollups live in Azure Dashboards
- Service-specific views come from VM Insights, Container Insights, and Network Watcher

The critical path is this:

Instrumentation quality -> diagnostic coverage -> correlation identifiers -> actionable alerts

If you get the first two right but skip correlation, you collect noise.
If you get correlation right but alert on the wrong signals, you still wake people up for nothing.

Here is the logical flow I use.

flowchart LR
    A[Azure Workloads] --> B[Application Telemetry]
    A --> C[Platform Metrics]
    A --> D[Resource Logs]
    A --> E[Network Diagnostics]

    B --> F[Application Insights]
    C --> G[Azure Monitor Metrics]
    D --> H[Log Analytics Workspace]
    E --> H

    F --> H
    G --> I[Alerts]
    H --> I
    H --> J[Workbooks]
    H --> K[Dashboards]
    F --> J

    I --> L[Action Groups]
    L --> M[Email SMS Webhook Function Logic App]

The official design guidance is aligned with this pattern: Azure Monitor is the collection and response platform, Log Analytics is the central log analysis engine, Application Insights is the APM layer, Workbooks provide deep analysis views, and specialized insights extend the core platform for VMs, containers, and networks.

Deep Dive: Instrumentation, Correlation, and Telemetry Design

This is the part most teams underinvest in.

Tools do not create observability. Instrumentation does.

If your code emits anonymous exceptions, untagged dependency calls, and inconsistent custom properties, the rest of the stack cannot save you. I recommend designing telemetry the same way you design APIs: define schema, cardinality boundaries, naming conventions, ownership, and retention.

1) Instrumentation: start with contracts, not SDK defaults

For application workloads, use Application Insights with OpenTelemetry where possible. Microsoft now positions Application Insights as an OpenTelemetry-enabled Azure Monitor feature for application performance monitoring.

My minimum telemetry contract per request is:

traceId
operationId
serviceName
serviceVersion
environment
tenantId or customer segment if applicable
correlationId from upstream gateway
business dimensions like orderId, workflowId, or caseId only when cardinality is controlled

What I do not log freely:

PII
raw prompts or payloads without redaction rules
high-cardinality junk like full URLs with random query strings
debug-level logs in production without time-bound purpose

A practical Node.js example with Azure Monitor OpenTelemetry:

npm install @azure/monitor-opentelemetry @opentelemetry/api express

import express from "express";
import { useAzureMonitor } from "@azure/monitor-opentelemetry";
import { context, trace } from "@opentelemetry/api";

useAzureMonitor({
  azureMonitorExporterOptions: {
    connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING
  },
  enableLiveMetrics: true
});

const app = express();

app.get("/healthz", (_req, res) => {
  res.status(200).send("ok");
});

app.get("/orders/:id", async (req, res) => {
  const span = trace.getTracer("orders-api").startSpan("get-order");
  try {
    const orderId = req.params.id;
    span.setAttribute("app.order_id", orderId);
    span.setAttribute("deployment.environment", process.env.NODE_ENV || "prod");

    // Simulate dependency work
    await new Promise(resolve => setTimeout(resolve, 120));

    res.json({ id: orderId, status: "processed" });
  } catch (err) {
    span.recordException(err);
    res.status(500).json({ error: "unexpected_failure" });
  } finally {
    span.end();
  }
});

app.listen(3000, () => {
  console.log("Orders API listening on port 3000");
});

The gotcha here is cardinality. If you stamp every telemetry item with user-level or request-level fields that explode uniqueness, your queries slow down and your data loses analytical value.

2) Collection: enable platform telemetry deliberately

A lot of Azure teams assume the portal overview equals observability. It does not.

Azure Monitor automatically collects some platform metrics and activity logs, but detailed resource logs require diagnostic settings on the resource. Azure’s documentation is explicit about this: richer monitoring data from Azure resources typically requires diagnostic settings and additional collection configuration.

For example, on App Service or other PaaS resources, I usually provision:

a Log Analytics workspace
a workspace-based Application Insights resource
diagnostic settings forwarding logs and metrics
naming conventions that reflect environment and landing zone

resourceGroup="rg-monitoring-prod"
location="eastus"
workspaceName="law-prod-eastus-01"
appInsightsName="appi-orders-prod-01"
webAppName="app-orders-prod-01"
subscriptionId=$(az account show --query id -o tsv)

workspaceId=$(az monitor log-analytics workspace create \
  --resource-group $resourceGroup \
  --workspace-name $workspaceName \
  --location $location \
  --query id -o tsv)

az monitor app-insights component create \
  --app $appInsightsName \
  --location $location \
  --resource-group $resourceGroup \
  --workspace $workspaceName

Then wire diagnostic settings:

az monitor diagnostic-settings create \
  --name "send-to-law" \
  --resource "/subscriptions/$subscriptionId/resourceGroups/$resourceGroup/providers/Microsoft.Web/sites/$webAppName" \
  --workspace $workspaceId \
  --logs '[
    {"category":"AppServiceHTTPLogs","enabled":true},
    {"category":"AppServiceConsoleLogs","enabled":true},
    {"category":"AppServiceAppLogs","enabled":true},
    {"category":"AppServiceAuditLogs","enabled":true}
  ]' \
  --metrics '[{"category":"AllMetrics","enabled":true}]'

For VMs and servers, use Azure Monitor Agent and VM Insights. For AKS, use Container Insights and, where appropriate, managed Prometheus support in Azure Monitor. Microsoft’s enterprise monitoring architecture calls out Azure Monitor Agent, Application Insights, Logs ingestion API, Workbooks, and managed Prometheus as core components of the observability platform.

3) Correlation: where monitoring becomes useful

This is the layer that separates “we have logs” from “we can investigate.”

For distributed workloads, I want to answer these in one query session:

Which request path failed?
Which downstream dependency was slow?
Was there a deployment in the preceding 15 minutes?
Was the issue isolated to one node, one availability zone, one region, or one tenant?
Did network or platform symptoms precede the app failure?

A simple KQL query to correlate request failures with dependency pain:

let failedRequests =
    requests
    | where timestamp > ago(30m)
    | where success == false
    | project operation_Id, requestName=name, requestDuration=duration, resultCode, cloud_RoleName, timestamp;
let slowDependencies =
    dependencies
    | where timestamp > ago(30m)
    | where duration > 1000ms
    | project operation_Id, dependencyTarget=target, dependencyName=name, dependencyDuration=duration, dependencySuccess=success;
failedRequests
| join kind=leftouter slowDependencies on operation_Id
| order by timestamp desc

This is why I push teams to standardize operation_Id propagation across services. Without it, cross-tier analysis degrades into timestamp guesswork.

4) Visualization: build for two audiences

I usually create two classes of views:

Operational views

On-call dashboard
Service health workbook
Dependency failure workbook
Top error signatures
SLO burn or error budget approximation if the team is mature enough

Leadership views

Availability trend
P95/P99 latency
Incident count by service
Mean time to detect
Mean time to recover
Cost of telemetry by workload

Workbooks are where I spend most of my time because they support investigation better than static dashboards. Azure documentation positions Workbooks as the flexible analysis canvas for deep drill-down and custom visual reports.

What I Got Wrong the First Time

The first time I built a “complete” monitoring setup for a distributed workload, I over-collected and under-modeled.

I had metrics, logs, traces, dashboards, and alerts. On paper it looked mature. In practice, it was noisy and expensive. The root problem was that I had not made enough decisions up front about:

table plans
retention
which logs were operationally valuable
which custom dimensions were actually worth indexing
which alerts mapped to customer impact

I see variants of this anti-pattern often: teams ingest every diagnostic category into a central workspace because it feels safer. A month later, nobody trusts the alerts, the KQL queries are messy, and log ingestion becomes a meaningful part of the Azure bill. Microsoft’s cost guidance is very clear that log ingestion is often the largest Azure Monitor cost driver and that data configuration choices heavily influence spend.

How to detect this in production:

ingestion jumps week over week without a matching reliability benefit
workbooks take too long to load
on-call ignores alerts because too many are low signal
post-incident reviews keep saying “insufficient context”
teams export data to spreadsheets because native analysis is too cluttered

The fix was architectural, not operational:

Redefine telemetry classes:
- health signals
- investigative signals
- audit signals
- debug signals
Apply different retention and table strategies
Cut custom dimensions that had no query value
Move from per-resource alert sprawl to service-level alert design
Standardize workbook templates across teams

That same thinking is why I’m opinionated about observability in platform modernization. In my AKS modernization work, standardizing observability alongside GitOps and deployment practices helped reduce MTTR from hours to minutes. In platform engineering work, standardized observability was part of reducing onboarding from 3 days to 2 hours and driving CI/CD adoption to 100% of teams.

Performance & Cost Considerations

This is the part architects often skip, and it is where bad designs get punished.

Baseline performance assumptions

For most Azure production workloads, I think in three telemetry lanes:

Metrics for low-latency health and alerting
Logs for diagnostics and forensic detail
Traces for request-path correlation

Metrics are the cheapest and fastest way to alert on infrastructure and service-level behavior. Logs are the richest but most likely to drive cost. Traces are essential for app debugging but can become expensive if every request is sampled at 100% under heavy traffic.

A practical baseline:

At 100 RPS, full tracing may still be manageable for a critical API if retention is controlled.
At 1,000+ RPS, always-on 100% sampling is usually a bad default unless the workload is narrow and high-value.
For noisy container platforms, network flow logs can explode in volume quickly; Azure’s AKS network monitoring guidance explicitly warns about substantial log volume, throttling, and log loss if not tuned.

Cost levers that actually matter

The biggest levers are usually:

Ingestion volume
Retention duration
Table plan choice
Sampling
Number of duplicated sinks
Verbose diagnostic categories enabled by default

Azure Monitor pricing is regional and logs are billed largely by ingestion, retention, and export. Microsoft’s cost documentation repeatedly emphasizes that ingestion is the biggest component for many customers.

My practical recommendations:

Keep one central workspace per environment or per compliance boundary, not one per resource.
Use separate workspaces only when compliance, tenancy isolation, or operational ownership justifies the trade-off.
Sample application traces intelligently.
Put noisy debugging tables on lower-cost plans where query patterns allow it.
Create daily budget alerts around the monitoring resource group, not just the product workload.
Review top-ingesting tables monthly.

Example: create an action group and a scheduled query alert.

resourceGroup="rg-monitoring-prod"
workspaceResourceId="/subscriptions/<sub-id>/resourceGroups/rg-monitoring-prod/providers/Microsoft.OperationalInsights/workspaces/law-prod-eastus-01"

az monitor action-group create \
  --name "ag-platform-oncall" \
  --resource-group $resourceGroup \
  --short-name "platcall" \
  --email-receiver name="PlatformOnCall" email="platform-oncall@contoso.com"

az monitor scheduled-query create \
  --name "High 5xx Error Rate" \
  --resource-group $resourceGroup \
  --scopes $workspaceResourceId \
  --condition "count > 25" \
  --condition-query "requests | where TimeGenerated > ago(5m) | where toint(resultCode) >= 500 | count" \
  --description "Triggers when 5xx volume exceeds threshold in 5 minutes" \
  --evaluation-frequency "5m" \
  --window-size "5m" \
  --severity 2 \
  --action-groups "/subscriptions/<sub-id>/resourceGroups/rg-monitoring-prod/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

Also use dynamic thresholds selectively. They can reduce alert tuning effort, but I do not recommend them as a blanket substitute for understanding the workload. Microsoft supports dynamic thresholds for relevant metrics and, in some cases, log-query-based scenarios, but they work best where historical patterns are meaningful and stable.

When does this architecture break?

It does not “break” all at once. It degrades in predictable ways:

At high scale: log ingestion cost climbs faster than value if you do not sample and prune.
Across regions: correlation becomes harder if naming, tagging, and workspace strategy are inconsistent.
Across teams: every team creates its own alerts and dashboards, and your platform loses standardization.
In AI-heavy workloads: prompt/response telemetry creates privacy, size, and cardinality problems if ungoverned.

When NOT to Use This

I recommend this architecture for most Azure-hosted enterprise workloads, but not universally.

Do not use the full stack if:

your app is a small internal tool with low criticality and one service boundary
your workload is ephemeral and cost sensitivity is extreme
your team does not have the operational maturity to maintain custom dashboards, queries, and alert tuning
you only need basic uptime checks and platform metrics

If your workload looks like this:

one App Service
one database
low traffic
one team
tolerant to short interruptions

Then don’t over-engineer it. Start with:

Azure Monitor metrics
basic Application Insights
two or three critical alerts
one workbook
a short retention period

If your workload is AKS-heavy, add Container Insights and Prometheus support where it helps, but be careful with log volume. If it is network-sensitive, prioritize Network Watcher connection monitoring and flow-level analysis. If it is VM-centric, VM Insights is often a better first investment than building custom workbook sprawl.

The bigger anti-pattern I want readers to avoid is this: buying more monitoring features instead of designing a monitoring architecture.

Tools do not fix weak signal design.

Key Takeaways

Design monitoring as a separate architecture plane with its own data flows, storage, access controls, and cost model.
Start with four layers: instrumentation, collection/storage, correlation/analysis, and visualization/action.
My default Azure baseline is Azure Monitor + Log Analytics + Application Insights + diagnostic settings + alerts + Workbooks, then add VM, container, or network insights only when justified.
Correlation IDs, environment tags, service version metadata, and disciplined custom dimensions matter more than having 50 dashboards.
The fastest way to waste money is to ingest everything. The fastest way to waste engineering time is to alert on everything.
If you want a reliable monitoring system, optimize for investigation speed, signal quality, and cost control together.

If you’re rethinking your current setup, my advice is simple: adopt a reference architecture, remove the anti-pattern of unmanaged log sprawl, and choose monitoring tools based on the workload’s failure modes rather than what is easiest to enable in the portal.

How to Build a Monitoring System for Azure Workloads

TL;DR

Monitoring is not a portal feature you switch on at the end. I treat it as a separate architecture plane with its own data flows, retention rules, access model, and cost envelope.
For most Azure workloads, the right baseline is: Azure Monitor + Log Analytics workspace + Application Insights + diagnostic settings + alerts + Workbooks. Then add VM Insights, Container Insights, and Network Watcher only where the workload needs deeper visibility.
The most common anti-pattern I see is collecting everything, correlating nothing, and alerting on symptoms instead of business impact. That gives you high spend, low trust, and poor incident response.
If you want monitoring to scale, design for four layers from day one: instrumentation, collection/storage, correlation/analysis, and visualization/action.

The Problem Worth Solving

Most teams still bolt monitoring on after the workload is already live. That is usually where the trouble starts.

This matters right now for a few reasons:

Azure estates are more distributed than they were even two years ago: App Service, Functions, containers, managed databases, AI endpoints, private networking, CI/CD automation, and external SaaS dependencies.
Teams are shipping faster. If your deployment frequency goes from weekly to daily, weak observability stops being an inconvenience and becomes a reliability risk.
Cost is no longer secondary. Azure Monitor Logs ingestion is often the biggest monitoring cost component, so bad telemetry design becomes a FinOps problem quickly.

This affects:

Cloud and AI architects designing multi-service workloads
CTOs who care about uptime, team efficiency, and cloud spend
Platform engineers standardizing telemetry across teams
App teams who need real root-cause analysis, not vanity dashboards

What breaks when you do this wrong?

You miss platform logs because diagnostic settings were never enabled.
You get flooded with alerts because every CPU spike pages someone.
You cannot correlate a failed API request to a dependency call, database issue, or network path.
You retain verbose logs forever and wonder why the Azure bill keeps climbing.
During incidents, the first 20 minutes are spent asking “which dashboard is the right one?”

Architecture Overview

Diagram img-1

The architecture I recommend has a simple principle: separate the workload plane from the observability plane, but connect them through consistent telemetry contracts.

At a high level:

Applications and services emit telemetry
- Application code emits traces, requests, dependencies, exceptions, and custom events through Application Insights and OpenTelemetry.
- Azure resources emit platform metrics and resource logs through Azure Monitor.
- VMs, AKS clusters, and network components emit deeper operational data through Azure Monitor Agent, Container Insights, VM Insights, Prometheus integration where needed, and Network Watcher. Microsoft’s Azure Monitor architecture explicitly positions Azure Monitor as the central data platform for logs, metrics, application telemetry, and Prometheus-style signals.
Telemetry lands in the right sink
- Near-real-time time-series signals go to Metrics
- Searchable operational and diagnostic records go to Log Analytics
- Application traces and request/dependency data go through Application Insights, ideally workspace-based
Correlation happens at query and trace level
- Operation IDs, trace IDs, request IDs, user/session identifiers, deployment metadata, and environment tags are attached to telemetry.
- Cross-service analysis happens through KQL in Log Analytics and Application Insights-backed logs.
Actioning and visualization sit on top
- Alerts route through Azure Monitor Alerts + Action Groups
- Investigative views live in Workbooks
- Executive or operational rollups live in Azure Dashboards
- Service-specific views come from VM Insights, Container Insights, and Network Watcher

The critical path is this:

Instrumentation quality -> diagnostic coverage -> correlation identifiers -> actionable alerts

If you get the first two right but skip correlation, you collect noise.
If you get correlation right but alert on the wrong signals, you still wake people up for nothing.

Here is the logical flow I use.

flowchart LR
    A[Azure Workloads] --> B[Application Telemetry]
    A --> C[Platform Metrics]
    A --> D[Resource Logs]
    A --> E[Network Diagnostics]

    B --> F[Application Insights]
    C --> G[Azure Monitor Metrics]
    D --> H[Log Analytics Workspace]
    E --> H

    F --> H
    G --> I[Alerts]
    H --> I
    H --> J[Workbooks]
    H --> K[Dashboards]
    F --> J

    I --> L[Action Groups]
    L --> M[Email SMS Webhook Function Logic App]

Deep Dive: Instrumentation, Correlation, and Telemetry Design

This is the part most teams underinvest in.

Tools do not create observability. Instrumentation does.

1) Instrumentation: start with contracts, not SDK defaults

My minimum telemetry contract per request is:

traceId
operationId
serviceName
serviceVersion
environment
tenantId or customer segment if applicable
correlationId from upstream gateway
business dimensions like orderId, workflowId, or caseId only when cardinality is controlled

What I do not log freely:

PII
raw prompts or payloads without redaction rules
high-cardinality junk like full URLs with random query strings
debug-level logs in production without time-bound purpose

A practical Node.js example with Azure Monitor OpenTelemetry:

npm install @azure/monitor-opentelemetry @opentelemetry/api express

import express from "express";
import { useAzureMonitor } from "@azure/monitor-opentelemetry";
import { context, trace } from "@opentelemetry/api";

useAzureMonitor({
  azureMonitorExporterOptions: {
    connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING
  },
  enableLiveMetrics: true
});

const app = express();

app.get("/healthz", (_req, res) => {
  res.status(200).send("ok");
});

app.get("/orders/:id", async (req, res) => {
  const span = trace.getTracer("orders-api").startSpan("get-order");
  try {
    const orderId = req.params.id;
    span.setAttribute("app.order_id", orderId);
    span.setAttribute("deployment.environment", process.env.NODE_ENV || "prod");

    // Simulate dependency work
    await new Promise(resolve => setTimeout(resolve, 120));

    res.json({ id: orderId, status: "processed" });
  } catch (err) {
    span.recordException(err);
    res.status(500).json({ error: "unexpected_failure" });
  } finally {
    span.end();
  }
});

app.listen(3000, () => {
  console.log("Orders API listening on port 3000");
});

The gotcha here is cardinality. If you stamp every telemetry item with user-level or request-level fields that explode uniqueness, your queries slow down and your data loses analytical value.

2) Collection: enable platform telemetry deliberately

A lot of Azure teams assume the portal overview equals observability. It does not.

For example, on App Service or other PaaS resources, I usually provision:

a Log Analytics workspace
a workspace-based Application Insights resource
diagnostic settings forwarding logs and metrics
naming conventions that reflect environment and landing zone

resourceGroup="rg-monitoring-prod"
location="eastus"
workspaceName="law-prod-eastus-01"
appInsightsName="appi-orders-prod-01"
webAppName="app-orders-prod-01"
subscriptionId=$(az account show --query id -o tsv)

workspaceId=$(az monitor log-analytics workspace create \
  --resource-group $resourceGroup \
  --workspace-name $workspaceName \
  --location $location \
  --query id -o tsv)

az monitor app-insights component create \
  --app $appInsightsName \
  --location $location \
  --resource-group $resourceGroup \
  --workspace $workspaceName

Then wire diagnostic settings:

az monitor diagnostic-settings create \
  --name "send-to-law" \
  --resource "/subscriptions/$subscriptionId/resourceGroups/$resourceGroup/providers/Microsoft.Web/sites/$webAppName" \
  --workspace $workspaceId \
  --logs '[
    {"category":"AppServiceHTTPLogs","enabled":true},
    {"category":"AppServiceConsoleLogs","enabled":true},
    {"category":"AppServiceAppLogs","enabled":true},
    {"category":"AppServiceAuditLogs","enabled":true}
  ]' \
  --metrics '[{"category":"AllMetrics","enabled":true}]'

3) Correlation: where monitoring becomes useful

This is the layer that separates “we have logs” from “we can investigate.”

For distributed workloads, I want to answer these in one query session:

Which request path failed?
Which downstream dependency was slow?
Was there a deployment in the preceding 15 minutes?
Was the issue isolated to one node, one availability zone, one region, or one tenant?
Did network or platform symptoms precede the app failure?

A simple KQL query to correlate request failures with dependency pain:

let failedRequests =
    requests
    | where timestamp > ago(30m)
    | where success == false
    | project operation_Id, requestName=name, requestDuration=duration, resultCode, cloud_RoleName, timestamp;
let slowDependencies =
    dependencies
    | where timestamp > ago(30m)
    | where duration > 1000ms
    | project operation_Id, dependencyTarget=target, dependencyName=name, dependencyDuration=duration, dependencySuccess=success;
failedRequests
| join kind=leftouter slowDependencies on operation_Id
| order by timestamp desc

This is why I push teams to standardize operation_Id propagation across services. Without it, cross-tier analysis degrades into timestamp guesswork.

4) Visualization: build for two audiences

I usually create two classes of views:

Operational views

On-call dashboard
Service health workbook
Dependency failure workbook
Top error signatures
SLO burn or error budget approximation if the team is mature enough

Leadership views

Availability trend
P95/P99 latency
Incident count by service
Mean time to detect
Mean time to recover
Cost of telemetry by workload

What I Got Wrong the First Time

The first time I built a “complete” monitoring setup for a distributed workload, I over-collected and under-modeled.

I had metrics, logs, traces, dashboards, and alerts. On paper it looked mature. In practice, it was noisy and expensive. The root problem was that I had not made enough decisions up front about:

table plans
retention
which logs were operationally valuable
which custom dimensions were actually worth indexing
which alerts mapped to customer impact

How to detect this in production:

ingestion jumps week over week without a matching reliability benefit
workbooks take too long to load
on-call ignores alerts because too many are low signal
post-incident reviews keep saying “insufficient context”
teams export data to spreadsheets because native analysis is too cluttered

The fix was architectural, not operational:

Redefine telemetry classes:
- health signals
- investigative signals
- audit signals
- debug signals
Apply different retention and table strategies
Cut custom dimensions that had no query value
Move from per-resource alert sprawl to service-level alert design
Standardize workbook templates across teams

Performance & Cost Considerations

This is the part architects often skip, and it is where bad designs get punished.

Baseline performance assumptions

For most Azure production workloads, I think in three telemetry lanes:

Metrics for low-latency health and alerting
Logs for diagnostics and forensic detail
Traces for request-path correlation

A practical baseline:

At 100 RPS, full tracing may still be manageable for a critical API if retention is controlled.
At 1,000+ RPS, always-on 100% sampling is usually a bad default unless the workload is narrow and high-value.
For noisy container platforms, network flow logs can explode in volume quickly; Azure’s AKS network monitoring guidance explicitly warns about substantial log volume, throttling, and log loss if not tuned.

Cost levers that actually matter

The biggest levers are usually:

Ingestion volume
Retention duration
Table plan choice
Sampling
Number of duplicated sinks
Verbose diagnostic categories enabled by default

My practical recommendations:

Keep one central workspace per environment or per compliance boundary, not one per resource.
Use separate workspaces only when compliance, tenancy isolation, or operational ownership justifies the trade-off.
Sample application traces intelligently.
Put noisy debugging tables on lower-cost plans where query patterns allow it.
Create daily budget alerts around the monitoring resource group, not just the product workload.
Review top-ingesting tables monthly.

Example: create an action group and a scheduled query alert.

resourceGroup="rg-monitoring-prod"
workspaceResourceId="/subscriptions/<sub-id>/resourceGroups/rg-monitoring-prod/providers/Microsoft.OperationalInsights/workspaces/law-prod-eastus-01"

az monitor action-group create \
  --name "ag-platform-oncall" \
  --resource-group $resourceGroup \
  --short-name "platcall" \
  --email-receiver name="PlatformOnCall" email="platform-oncall@contoso.com"

az monitor scheduled-query create \
  --name "High 5xx Error Rate" \
  --resource-group $resourceGroup \
  --scopes $workspaceResourceId \
  --condition "count > 25" \
  --condition-query "requests | where TimeGenerated > ago(5m) | where toint(resultCode) >= 500 | count" \
  --description "Triggers when 5xx volume exceeds threshold in 5 minutes" \
  --evaluation-frequency "5m" \
  --window-size "5m" \
  --severity 2 \
  --action-groups "/subscriptions/<sub-id>/resourceGroups/rg-monitoring-prod/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

When does this architecture break?

It does not “break” all at once. It degrades in predictable ways:

At high scale: log ingestion cost climbs faster than value if you do not sample and prune.
Across regions: correlation becomes harder if naming, tagging, and workspace strategy are inconsistent.
Across teams: every team creates its own alerts and dashboards, and your platform loses standardization.
In AI-heavy workloads: prompt/response telemetry creates privacy, size, and cardinality problems if ungoverned.

When NOT to Use This

I recommend this architecture for most Azure-hosted enterprise workloads, but not universally.

Do not use the full stack if:

your app is a small internal tool with low criticality and one service boundary
your workload is ephemeral and cost sensitivity is extreme
your team does not have the operational maturity to maintain custom dashboards, queries, and alert tuning
you only need basic uptime checks and platform metrics

If your workload looks like this:

one App Service
one database
low traffic
one team
tolerant to short interruptions

Then don’t over-engineer it. Start with:

Azure Monitor metrics
basic Application Insights
two or three critical alerts
one workbook
a short retention period

The bigger anti-pattern I want readers to avoid is this: buying more monitoring features instead of designing a monitoring architecture.

Tools do not fix weak signal design.

Key Takeaways

Design monitoring as a separate architecture plane with its own data flows, storage, access controls, and cost model.
Start with four layers: instrumentation, collection/storage, correlation/analysis, and visualization/action.
My default Azure baseline is Azure Monitor + Log Analytics + Application Insights + diagnostic settings + alerts + Workbooks, then add VM, container, or network insights only when justified.
Correlation IDs, environment tags, service version metadata, and disciplined custom dimensions matter more than having 50 dashboards.
The fastest way to waste money is to ingest everything. The fastest way to waste engineering time is to alert on everything.
If you want a reliable monitoring system, optimize for investigation speed, signal quality, and cost control together.

How to Build a Monitoring System for Azure Workloads

TL;DR

The Problem Worth Solving

Architecture Overview

Deep Dive: Instrumentation, Correlation, and Telemetry Design

1) Instrumentation: start with contracts, not SDK defaults

2) Collection: enable platform telemetry deliberately

3) Correlation: where monitoring becomes useful

4) Visualization: build for two audiences

What I Got Wrong the First Time

Performance & Cost Considerations

Baseline performance assumptions

Cost levers that actually matter

When does this architecture break?

When NOT to Use This

Key Takeaways

Related Posts

Leveraging MAI Models in Foundry: A Comprehensive Guide for Architects and Founders

Leveraging Azure's Voice Live API for Real-Time Voice Agents

Introduction to Microsoft Agent Framework: Building the Future of Autonomous AI Systems

Want to discuss this topic?

How to Build a Monitoring System for Azure Workloads

TL;DR

The Problem Worth Solving

Architecture Overview

Deep Dive: Instrumentation, Correlation, and Telemetry Design

1) Instrumentation: start with contracts, not SDK defaults

2) Collection: enable platform telemetry deliberately

3) Correlation: where monitoring becomes useful

4) Visualization: build for two audiences

What I Got Wrong the First Time

Performance & Cost Considerations

Baseline performance assumptions

Cost levers that actually matter

When does this architecture break?

When NOT to Use This

Key Takeaways

Related Posts

Leveraging MAI Models in Foundry: A Comprehensive Guide for Architects and Founders

Leveraging Azure's Voice Live API for Real-Time Voice Agents

Introduction to Microsoft Agent Framework: Building the Future of Autonomous AI Systems

Want to discuss this topic?