Azure Cosmos DB production database lessons learned cloud database NoSQL

Microsoft Dev: Azure Cosmos DB Conf 2026 Recap: Lessons from Production

Q: "What is the most common mistake made when setting up Azure Cosmos DB in production?"

"The most common mistake is an inadequate or poorly chosen partition key. A bad partition key leads to skewed data distribution and hot partitions, which will eventually throttle your throughput regardless of how many Request Units (RUs) you provision. Always design your partition key strategy around your most frequent query patterns."

Q: "How can I prevent throttling errors in Azure Cosmos DB production?"

"Prevent throttling by carefully selecting a partition key that distributes requests evenly across all physical partitions. Monitor your RU consumption closely, especially at the partition level, and optimize your queries to be more efficient. Consider autoscale throughput if your workload is variable to dynamically adjust capacity."

Q: "When should I use a composite partition key in Azure Cosmos DB?"

"A composite partition key can be beneficial when you frequently query based on multiple properties. By concatenating them into a single partition key value, you can improve query performance for those specific scenarios. However, ensure that the combined key still offers good distribution and avoids creating hot partitions."

Q: "What are the alternatives to using a highly cardinal partition key in Azure Cosmos DB?"

"If a single high-cardinality property leads to skew, consider a synthetic partition key (a generated value) that is designed for even distribution. Alternatively, you can create a logical grouping of related data into different containers with more appropriate partition keys based on their access patterns. This often involves rethinking your data model."

Q: "What are the best practices for managing Azure Cosmos DB cost in production?"

"Optimize costs by choosing the right provisioned throughput model (manual vs. autoscale) based on your workload predictability. Right-size your RUs to avoid over-provisioning. Regularly review query performance and ensure efficient data modeling to minimize RU consumption. Consider data lifecycle management to archive or delete older data."

The Coders Blog

May 6, 2026

You provisioned Azure Cosmos DB with ample Request Units (RUs), your application’s P99 latency is creeping up, and throttling errors are becoming more frequent. Sound familiar? This isn’t a capacity problem; it’s a design problem. The Azure Cosmos DB Conference 2026 made one thing brutally clear: the platform exposes your data modeling and partition key choices like a harsh spotlight.

The Unseen Bottleneck: Partition Keys and Skewed Distribution

The single most impactful decision you make for Cosmos DB is the partition key. Forget throwing more RUs at the problem; if your partition key leads to skewed distribution, you’re battling hot partitions. This results in 100% RU utilization on some physical partitions while others languish, leading to relentless throttling and unacceptable latency spikes, even if your aggregate RU usage appears low.

Diagnosing the Dreaded Hot Partition

Azure Monitor is your primary battlefield. Look for:

Normalized RU Consumption (%) By PartitionKeyRangeID: This metric is gold. High values on specific PartitionKeyRangeIDs scream “hot partition.”
PhysicalPartitionThroughput: This shows the actual RU/s being consumed by individual physical partitions.

The fundamental truth from 2026 discussions is that your partition key MUST align with your most frequent access patterns and distribute load evenly. Common pitfalls include using a userId when one user generates a disproportionate amount of traffic, or not considering the 20 GB logical partition limit per physical partition.

Strategic Partitioning: Beyond the Single Key

When a single partition key can’t guarantee even distribution, or when dealing with massive tenants, Hierarchical Partition Keys are your savior. These allow up to three levels of subpartitioning, effectively scaling beyond the 20 GB limit for a single logical partition and offering finer-grained control over data distribution.

Data Modeling: Embracing Denormalization and Avoiding Pitfalls

Cosmos DB thrives on flexibility, but that doesn’t mean a free-for-all.

Denormalize and Embed: For one-to-few or contained relationships, embed related data directly within the parent document. This minimizes costly cross-partition queries.
Avoid Unbounded Arrays: Large, unbounded arrays can lead to large documents and potential RU spikes during updates.
Multiple Containers for Specific Access Patterns: Don’t force one container to serve all purposes. Use the Change Feed to propagate data to specialized containers optimized for different query needs, drastically reducing cross-partition query overhead.

Throughput Management Beyond Provisioning

Beyond basic RU provisioning, proactive management is key:

Monitor utilizationOf20GBLogicalPartition and set alerts. When a logical partition approaches its 20 GB limit, it’s a precursor to issues.
While not ideal, skewed workloads can sometimes be mitigated by manually redistributing throughput across physical partitions using PowerShell or the Azure CLI. This is a workaround, not a fix for bad design.

The Ecosystem and The Hard Truths

Discussions on platforms like Hacker News (even years later) reveal a mixed sentiment. Early criticisms often cited misleading “multi-model” marketing (Cosmos DB excels primarily as a document DB), sparse documentation, and cost concerns. Alternatives like Amazon DynamoDB, MongoDB Atlas, and Google Cloud Firestore are frequently mentioned.

The critical takeaway from production deployments is that the 20 GB logical partition and 10,000 RU/s per logical partition limits are non-negotiable hard limits. They necessitate careful design from day one.

When to Reconsider Cosmos DB

If your workload is inherently relational with complex, multi-table joins as a primary requirement, or if you’re seeking a bare-bones, cost-optimized NoSQL solution for extremely low-throughput scenarios, Cosmos DB might be overkill or a poor fit. It demands a deep understanding of your data and access patterns.

The honest verdict? Azure Cosmos DB is a powerhouse for globally distributed, high-scale, real-time, and AI-driven applications with flexible data. But its success is entirely dependent on disciplined data modeling and partition key design that perfectly matches your workload. Increasing RUs will only delay the inevitable confrontation with a flawed design. Cosmos DB doesn’t hide your design problems; it amplifies them. Choose your partition key wisely.

Share this Post

Cloudflare: Introducing Dynamic Workflows for Durable Execution

Microsoft Dev: Azure Cosmos DB Conf 2026 Recap: Lessons from Production

The Unseen Bottleneck: Partition Keys and Skewed Distribution

Diagnosing the Dreaded Hot Partition

Strategic Partitioning: Beyond the Single Key

Data Modeling: Embracing Denormalization and Avoiding Pitfalls

Throughput Management Beyond Provisioning

The Ecosystem and The Hard Truths

When to Reconsider Cosmos DB

Cloudflare: Introducing Dynamic Workflows for Durable Execution

Google Dev: Agents CLI for Production AI Creation

Google Dev: Production-Ready AI Agents: 5 Lessons from Monolith Refactoring

PHP-fts: Building a Full-Text Search Engine in Pure PHP

Converters

Formatters

Encoder / Decoder

Generators

Design & Utility

The Unseen Bottleneck: Partition Keys and Skewed Distribution

Diagnosing the Dreaded Hot Partition

Strategic Partitioning: Beyond the Single Key

Data Modeling: Embracing Denormalization and Avoiding Pitfalls

Throughput Management Beyond Provisioning

The Ecosystem and The Hard Truths

When to Reconsider Cosmos DB

Cloudflare: Introducing Dynamic Workflows for Durable Execution

You may also like

Google Dev: Agents CLI for Production AI Creation

Google Dev: Production-Ready AI Agents: 5 Lessons from Monolith Refactoring

PHP-fts: Building a Full-Text Search Engine in Pure PHP