Tag Archives: DevOps

Azure APIM POST Request Caching: Cut Backend Load by 70% With a Custom Cache Key

Most engineers assume POST requests cannot be cached. That assumption is costing their teams unnecessary backend load, slower response times, and wasted infrastructure spend — every single day. When I was working on a high-traffic travel booking API on Azure API Management, our backend was getting hammered by identical POST requests with the same payload, dozens of times per minute. The backend processing was heavy. The response never changed for the same input. We were rebuilding the same result over and over.

The fix was custom POST request caching in Azure APIM — using cache-lookup-value and cache-store-value policies with a dynamically built cache key from the request body. After implementing it, backend calls for repeated payloads dropped by over 70%. Response times for cache hits went from 400ms to under 10ms.

In this guide, you will learn exactly how to build a complete Azure APIM POST request caching solution — handling all three real-world content types, building a reliable cache key, and knowing precisely when to use this pattern and when to avoid it.

Why POST Request Caching in Azure APIM Is Worth Your Attention

Azure APIM’s built-in cache-lookup and cache-response policies work out of the box for GET requests — they use the URL as the cache key. However, POST requests send their data in the request body, not the URL. As a result, APIM cannot cache POST responses automatically.

Furthermore, most tutorials only show caching for application/json content types. In real production environments, APIs receive POST requests in three different formats — and your caching policy needs to handle all of them reliably. Specifically, the three content types you will encounter are:

  • application/json — the most common modern API format. Payload arrives as a JSON body.
  • application/x-www-form-urlencoded — used by legacy clients, HTML forms, and many third-party integrations.
  • multipart/form-data — used when clients send both JSON data and file uploads in the same request.

💡 Key principle: APIM never compares request payloads directly. Instead, you extract the fields that drive your response, build a unique string called a cache key, and APIM uses that string as the lookup identifier. Same input fields = same key = cache hit.

When Should You Cache a POST Request in Azure APIM?

Before writing a single line of policy XML, confirm your POST endpoint actually qualifies for caching. Not every POST call should be cached — and caching the wrong one causes serious problems.

Cache these POST calls:

  • Read-heavy search or filter operations sent as POST because the query is too complex for a URL query string — for example, flight searches, product filters, or report generation.
  • Heavy computation endpoints where identical inputs always produce identical outputs — pricing calculations, eligibility checks, or recommendation engines.
  • Third-party API proxies where the upstream call is slow and expensive but the response is stable for a known time window.

Never cache these POST calls:

  • State-changing operations — anything that creates, updates, or deletes data. Caching these means repeated identical requests silently skip the backend, and your data is never written.
  • Payment or transaction endpoints — retrying a cached payment response causes silent double-charge risks or missed payments.
  • Authentication or token endpoints — one-time tokens, OTPs, and session credentials must never be served from cache.
  • Real-time or per-user personalised responses — if the same payload returns different results for different users, a shared cache key will return the wrong user’s data.

⚠️ Rule of thumb: if you could safely call the same POST endpoint 100 times with the same payload and always expect the same response — it is safe to cache. If the 100th call should behave differently from the first — never cache it.

How Azure APIM POST Request Caching Works

Understanding the flow before writing the policy saves a lot of debugging time. Here is what happens on the first call (cache miss) and the second call (cache hit):

First call — cache miss: The request arrives at APIM. The inbound policy reads the request body, extracts the relevant fields, and builds a unique cacheKey string. Next, cache-lookup-value checks whether that key exists in the cache. Because it is the first call, nothing is found. Therefore, APIM forwards the request to the backend. The backend returns a response. In the outbound policy, cache-store-value saves the response body against the cacheKey for a set duration.

Second call — cache hit: The same payload arrives. APIM builds the identical cacheKey string. This time, cache-lookup-value finds the stored response. As a result, APIM immediately returns the cached response using return-response — without ever touching the backend. The entire round-trip is eliminated.

Understanding the Cache Key Design

Your cache key is the most critical part of the entire solution. Get it wrong and you either serve stale data to the wrong callers or fragment your cache so badly that you never get a hit. Here is the key design used in the complete policy below:

cacheKey = "v2_mycustomcache-" + all JSON fields sorted alphabetically + joined by "-"
// Example payload: { "Channel": "DESKTOP", "Type": "TRAIN", "Unit": "BU-UNIT" }
// Result: v2_mycustomcache-DESKTOP-BU-UNIT-TRAIN

Three important design decisions are built into this key. First, the v2_ prefix acts as a version tag — when you need to invalidate all cached responses after a backend change, simply change this prefix and all existing keys instantly become misses. Second, properties are sorted alphabetically with .OrderBy(p => p.Name) — this ensures the key is identical regardless of the order fields arrive in the payload. Third, all field values are joined into a single flat string — making the key readable in logs and easy to debug.

The Complete Azure APIM POST Request Caching Policy

This policy handles all three content types in a single implementation. Copy it into your APIM policy editor and adjust the cache key prefix and duration to match your use case:

<policies>
    <inbound>
        <base />
        <!-- Step 1: Clear cookies to prevent cache pollution -->
        <set-header name="Cookie" exists-action="override">
            <value />
        </set-header>
        <!-- Step 2: Read and preserve the full request body -->
        <set-variable name="requestBodyRaw"
            value="@(context.Request.Body.As<string>(preserveContent: true))" />
        <set-variable name="isCacheable" value="@(false)" />
        <!-- Step 3: Parse body by Content-Type -->
        <choose>
            <!-- Handler 1: application/x-www-form-urlencoded -->
            <when condition="@(context.Request.Headers
                    .GetValueOrDefault('Content-Type','')
                    .ToLower().Contains('application/x-www-form-urlencoded'))">
                <set-variable name="isCacheable" value="@(true)" />
                <set-variable name="jsonEncoded"
                    value="@(((string)context.Variables['requestBodyRaw'])
                        .Substring(((string)context.Variables['requestBodyRaw'])
                        .IndexOf('=') + 1))" />
                <set-variable name="jsonDecoded"
                    value="@(System.Net.WebUtility.UrlDecode(
                        (string)context.Variables['jsonEncoded']))" />
                <set-variable name="innerRequest"
                    value="@((JObject)Newtonsoft.Json.JsonConvert
                        .DeserializeObject(
                            (string)context.Variables['jsonDecoded']))" />
            </when>
            <!-- Handler 2: multipart/form-data -->
            <when condition="@(context.Request.Headers
                    .GetValueOrDefault('Content-Type','')
                    .ToLower().StartsWith('multipart/form-data'))">
                <set-variable name="isCacheable" value="@(true)" />
                <set-variable name="boundary" value="@{
                    string ct = context.Request.Headers
                        .GetValueOrDefault('Content-Type', '');
                    int idx = ct.IndexOf('boundary=');
                    if (idx >= 0)
                        return ct.Substring(idx + 9).Split(';')[0].Trim();
                    return '';
                }" />
                <set-variable name="jsonPart" value="@{
                    string body = (string)context.Variables['requestBodyRaw'];
                    string boundary = (string)context.Variables['boundary'];
                    if (string.IsNullOrEmpty(boundary)) return '';
                    string delim = '--' + boundary;
                    int start = body.IndexOf('name=\'jsonRequest\'');
                    if (start >= 0) {
                        int cs = body.IndexOf('\r\n\r\n', start) + 4;
                        int ce = body.IndexOf(delim, cs);
                        if (ce == -1) ce = body.IndexOf(delim + '--', cs);
                        if (cs > 3 && ce > cs)
                            return body.Substring(cs, ce - cs).Trim();
                    }
                    return '';
                }" />
                <set-variable name="innerRequest" value="@{
                    string json = (string)context.Variables['jsonPart'];
                    if (!string.IsNullOrEmpty(json)) {
                        try { return (JObject)Newtonsoft.Json.JsonConvert
                                .DeserializeObject(json); }
                        catch { return new JObject(); }
                    }
                    return new JObject();
                }" />
            </when>
            <!-- Handler 3: application/json -->
            <when condition="@(context.Request.Headers
                    .GetValueOrDefault('Content-Type','')
                    .ToLower().Contains('application/json'))">
                <set-variable name="isCacheable" value="@(true)" />
                <set-variable name="requestBody"
                    value="@(context.Request.Body
                        .As<JObject>(preserveContent: true))" />
                <set-variable name="jsonRequestString"
                    value="@((string)((JObject)context.Variables['requestBody'])
                        ['jsonRequest'])" />
                <set-variable name="jsonRequestUnescaped"
                    value="@(System.Text.RegularExpressions.Regex
                        .Unescape((string)context.Variables
                            ['jsonRequestString']))" />
                <set-variable name="innerRequest" value="@{
                    try { return (JObject)Newtonsoft.Json.JsonConvert
                            .DeserializeObject((string)context.Variables
                                ['jsonRequestUnescaped']); }
                    catch { return new JObject(); }
                }" />
            </when>
        </choose>
        <!-- Step 4: Build cache key and look up cache -->
        <choose>
            <when condition="@((bool)context.Variables['isCacheable'])">
                <set-variable name="cacheKey" value="@('v2_mycustomcache-' +
                    string.Join('-',
                        ((JObject)context.Variables['innerRequest'])
                            ?.Properties()
                            .OrderBy(p => p.Name)
                            .Select(p => p.Value?.ToString() ?? '')
                        ?? new string[] { })
                )" />
                <cache-lookup-value
                    key="@((string)context.Variables['cacheKey'])"
                    variable-name="cacheResponse" />
                <choose>
                    <when condition="@(context.Variables
                            .ContainsKey('cacheResponse')
                            && context.Variables['cacheResponse'] != null)">
                        <return-response>
                            <set-header name="Content-Type"
                                exists-action="override">
                                <value>application/json</value>
                            </set-header>
                            <set-body>
                                @((string)context.Variables['cacheResponse'])
                            </set-body>
                        </return-response>
                    </when>
                </choose>
            </when>
        </choose>
    </inbound>
    <backend>
        <base />
    </backend>
    <outbound>
        <base />
        <!-- Step 5: Store response in cache on 200 OK -->
        <choose>
            <when condition="@((bool)context.Variables['isCacheable']
                    && context.Response != null
                    && context.Response.Body != null
                    && context.Response.StatusCode == 200)">
                <cache-store-value
                    key="@((string)context.Variables['cacheKey'])"
                    value="@(context.Response.Body
                        .As<string>(preserveContent: true))"
                    duration="172800" />
            </when>
        </choose>
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

Policy Walkthrough: What Each Section Does

Step 1 — Clear the Cookie Header

Cookies are cleared from the inbound request before any caching logic runs. This prevents session cookies from polluting the cache key or causing different users to share cached responses they should not see.

Step 2 — Read and Preserve the Request Body

The request body is read once and stored in requestBodyRaw with preserveContent: true. This is critical — in APIM, reading a request body consumes it by default. Without preserveContent: true, the backend receives an empty body. Furthermore, the isCacheable flag starts as false — it only becomes true if a supported Content-Type is detected.

Step 3 — Content-Type Detection and Azure APIM POST Caching Payload Parsing

The choose block detects which Content-Type was sent and parses the payload accordingly. Each handler extracts the same end result — a JObject named innerRequest containing the relevant fields. As a result, Step 4 (key building) works identically regardless of how the payload arrived.

Step 4 — Build Cache Key and Azure APIM POST Cache Lookup

The cache key is built by sorting all properties of innerRequest alphabetically and joining their values with a hyphen. Next, cache-lookup-value checks whether this exact key string exists in the cache. If a hit is found, return-response immediately returns the cached body — the backend is never called.

Step 5 — Store the Response After a Backend Call

In the outbound policy, if the backend returned a 200 OK and the request was cacheable, cache-store-value saves the response body against the cache key for 172,800 seconds — that is 48 hours. Consequently, any identical request within the next 48 hours gets served from cache instantly.

💡 Adjust the duration value to match your data freshness requirements. For real-time pricing, use 300 seconds (5 minutes). For stable reference data, 172800 seconds (48 hours) or more is appropriate.

Common Mistakes That Break Azure APIM POST Request Caching

  • Not using preserveContent: true when reading the body. This is the most common mistake. Without it, APIM reads and discards the request body — the backend receives nothing and returns a 400 or 500 error. Always set preserveContent: true on every Body.As<>() call.
  • Building a cache key that is too broad. If your key only uses one field from a five-field payload, requests with different values for the other four fields will all hit the same cache entry and get wrong responses. Include every field that meaningfully changes the response.
  • Building a cache key that is too granular. Conversely, including timestamp or session ID fields in the key means you never get a cache hit. Only include stable, response-determining fields.
  • Caching without checking the status code. The outbound cache-store-value must check context.Response.StatusCode == 200 before storing. Otherwise, error responses get cached and served to subsequent callers — causing hard-to-diagnose intermittent failures.
  • Not versioning the cache key prefix. When your backend response schema changes, you need a way to invalidate all cached entries immediately. Without a version prefix like v2_, old cached responses continue being served. Incrementing the prefix instantly invalidates the entire cache without touching any infrastructure.

Internal vs External Redis Cache for Azure APIM POST Caching

The policy above works with both APIM’s built-in internal cache and an external Azure Cache for Redis. Here is when to use each:

  • Internal APIM cache — suitable for development, staging, and single-region deployments. Simple to set up with no extra infrastructure. However, in the classic APIM tiers, internal cache contents do not persist across service updates. The v2 tiers provide persistent built-in cache.
  • External Azure Cache for Redis — recommended for production, multi-region, or high-throughput deployments. Provides persistence, higher capacity, and cache sharing across multiple APIM units. For production workloads, connecting an external Azure Cache for Redis is usually the better option — create a Redis instance in the same region as your APIM instance, then go to APIM → External cache → Add and provide the Redis connection string.

Conclusion: Azure APIM POST Request Caching Done Right

Azure APIM POST request caching is one of the highest-impact performance optimisations available in the APIM policy toolkit — and one of the most underused. By combining careful payload parsing across all three content types with a well-designed cache key, you can transform heavy POST endpoints into near-instant responses for repeated inputs. Furthermore, the versioned key prefix gives you a clean invalidation mechanism whenever your backend schema changes.

In summary, the five things that make this work reliably are: always preserve the request body, handle all three content types, sort your key fields alphabetically, only store on 200 OK responses, and version your cache key prefix from day one. Apply this pattern only to idempotent read-heavy POST operations — and never to state-changing or security-sensitive endpoints.

Quick Reference

  • Read body: always use Body.As<string>(preserveContent: true)
  • Build key: extract fields → sort alphabetically → join with hyphen → prefix with version tag
  • Cache lookup: <cache-lookup-value key="..." variable-name="cacheResponse" />
  • Cache store: <cache-store-value key="..." value="..." duration="172800" />
  • Invalidate all: increment the version prefix in the cache key

Frequently Asked Questions

Q: Does APIM automatically compare the full POST payload to decide on a cache hit?
No. APIM never compares payload content directly. It only checks whether the exact cache key string you generate already exists in the cache store. Your key design entirely determines what counts as a cache hit or miss.

Q: What happens if a field in my payload changes by even one character?
A new cache key is generated, resulting in a cache miss. APIM forwards the request to the backend, gets a fresh response, and stores it under the new key. The old cached entry remains until it expires.

Q: How do I invalidate the cache immediately without waiting for the duration to expire?
The fastest way is to increment the version prefix in your cache key — for example, change v2_mycustomcache- to v3_mycustomcache-. All existing v2 keys instantly become orphaned and new requests build fresh v3 entries. Alternatively, use the APIM Management REST API to delete specific cache entries by key.

Q: Can I use this pattern with Azure Cache for Redis instead of the internal cache?
Yes. The cache-lookup-value and cache-store-value policies work with both APIM’s internal cache and an external Redis cache. No policy changes are needed — APIM automatically uses the external cache when one is configured, falling back to the internal cache if the external cache is unavailable.

Q: The cached response is being returned but the Content-Type header is wrong. How do I fix it?
In the return-response block, always explicitly set the Content-Type header using set-header. When APIM returns a response directly from cache without hitting the backend, it does not automatically carry forward the original response headers — you must set them manually in the policy.

Related Articles

Your AKS Pod Says “Running” — But Your App Is Dying. Here’s the PVC Disk Full – The Fix Takes 15 Minutes

An AKS PVC disk full condition is one of the most deceptive problems in Kubernetes operations. When a PersistentVolumeClaim (PVC) fills to 100% on Azure Kubernetes Service (AKS), the pod using it does not crash immediately — and that is what makes it so dangerous. The pod keeps showing Running in kubectl. No alert fires. Everything looks fine. Then suddenly your application starts throwing cryptic errors that seem completely unrelated to disk space.

This guide covers the fix for any workload running on AKS — whether you are running Solr, PostgreSQL, MongoDB, Elasticsearch, Redis, or any other stateful application that writes data to a PVC. The kubectl commands and the recovery steps are identical regardless of what is running inside the pod.

In our case, the workload was Sitecore with Solr on AKS. The disk-full condition showed up as this IndexWriter is closed — a misleading Lucene error that buried the real cause: java.io.IOException: No space left on device. But whether you see that error or a Postgres could not write to file, a MongoDB No space left on device, or an Elasticsearch flood stage disk watermark exceeded — the fix is exactly the same.

In this guide, you will learn how to confirm the root cause, safely expand your Solr PVC on AKS, recover the application, and rebuild the Sitecore index — all without losing any data. In my environment, Sitecore and Solr are deployed as custom Docker images running on AKS.

Prerequisites

  • kubectl configured and connected to your AKS cluster.
  • Permissions to manage PVCs, StatefulSets, Deployments, and pods in your Kubernetes namespace.
  • Solr deployed on AKS using the Azure managed-premium storage class.
  • Access to the Sitecore Control Panel for index management (for the Sitecore-specific recovery step).

Which Workloads Does This Affect?

Any stateful pod that writes data to a PVC can hit this problem. The AKS PVC disk full fix is identical for all of them. What changes is only the application-level error message and the final recovery step. Here are the most common workloads and the errors each one throws when disk space runs out:

  • Solr / Elasticsearch: java.io.IOException: No space left on devicethis IndexWriter is closed
  • PostgreSQL: could not write to file base/pgsql_tmp: No space left on device
  • MongoDB: No space left on device: couldn't open file for writing
  • MySQL / MariaDB: ERROR 3 (HY000): Error writing file '/tmp/...' (Errcode: 28 - No space left on device)
  • Redis: MISCONF Redis is configured to save RDB snapshots, but it's currently unable to persist on disk
  • Any custom application pod: write failures, silent data loss, or application-level errors that bury the real No space left on device cause deep in the logs.

💡 The kubectl fix is identical for all workloads — expand the PVC, restart the pod, verify recovery. Only the final application-level recovery step differs per workload.

Why the Error Message Is Always Misleading

The most important thing to understand before you touch anything is this: the top-level error your application throws is almost never the real cause. It is a downstream symptom. Always scroll to the very bottom of the stack trace to find the true root cause.

For Sitecore with Solr, the index rebuild job logs show something like this:

Job started: Index_Update_IndexName=sitecore_jss_web_index
#Exception: System.Reflection.TargetInvocationException
---> SolrNet.Exceptions.SolrConnectionException:
  this IndexWriter is closed
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
Caused by: java.io.IOException: No space left on device

Here is the exact chain of events that causes this error. First, the Solr PVC fills to 100% capacity. Next, Lucene tries to merge index segments during the rebuild — a process that requires significant temporary extra disk space. Because no space is available, the write fails. As a result, the IndexWriter closes itself as a safety measure to protect data integrity. After that, Sitecore detects the closed IndexWriter and throws the AlreadyClosedException. Finally, the rebuild job fails — and keeps failing on every retry until the underlying disk problem is resolved.

⚠️ Until you fix the AKS PVC disk full condition, every single index rebuild will fail with the same error — no matter how many times you retry it from Sitecore.

Step-by-Step: AKS PVC Disk Full Fix

Step 1: Confirm the AKS PVC Disk Is Full

Before making any changes, confirm that disk exhaustion is actually the root cause. First, check your PersistentVolumeClaims in the relevant namespace:

kubectl get pvc -n solr

The output will show all PVCs in a Bound status — which looks completely healthy. However, Bound only means the PVC is attached to the pod. It tells you nothing about how much space is actually used inside it:

NAME              NAMESPACE  STATUS  CAPACITY  STORAGECLASS
solr-leader-disk  solr       Bound   10Gi      managed-premium

Next, exec directly into the pod to check actual disk usage:

kubectl exec -it <solr-pod-name> -n solr -- df -h

Look for the mount point where your application stores data — typically /var/solr for Solr, /var/lib/postgresql for Postgres, or /data/db for MongoDB. If you see Use% at 100%, you have confirmed the root cause. Furthermore, grep the pod logs directly for the IOException to be certain:

kubectl logs <solr-pod-name> -n solr | grep -i "no space"

Step 2: AKS PVC Disk Full Fix — Expand the PVC Online

Azure’s managed-premium storage class supports online volume expansion by default. Consequently, you can grow the PVC without stopping the pod, without losing data, and without any application downtime. Note that you can only ever expand a PVC — Kubernetes does not support shrinking.

You have two methods to expand. Choose whichever fits your workflow:

Method 1: kubectl patch (faster — one command)

kubectl patch pvc solr-leader-disk-2023081617 -n solr \
  -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

Method 2: kubectl edit (visual — easier to verify)

kubectl edit pvc solr-leader-disk-2023081617 -n solr

In the editor, find the spec.resources.requests.storage field and change it from 10Gi to 20Gi:

spec:
  resources:
    requests:
      storage: 20Gi   # changed from 10Gi

After saving, watch the PVC status in real time. You will briefly see FileSystemResizePending — that is completely normal. Wait for it to return to Bound:

kubectl get pvc -n solr -w

The resize typically takes 5–10 minutes on Azure. The pod keeps running and all data remains accessible throughout the entire process.

💡 Pro tip: During a Solr index rebuild, Lucene needs up to 3× the current index size as temporary working space. If your current index is 8Gi, plan for at least 30Gi total — not just 20Gi. The same rule applies to Elasticsearch index merges and PostgreSQL VACUUM operations.

Step 3: Restart the Pod to Complete Recovery

This is the step most guides skip — and it is the one that trips people up most often. Even after the AKS PVC disk is expanded, the application process inside the pod is still in a broken state from the original crash. Simply having more disk space does not automatically fix this. You must restart the pod so the application remounts the expanded volume cleanly and resets its internal error state.

For Solr running as a StatefulSet (the most common AKS setup):

kubectl rollout restart statefulset/solr -n solr

Alternatively, if your workload runs as a Deployment:

kubectl rollout restart deployment/solr -n solr

For other workloads, simply replace solr with your StatefulSet or Deployment name and adjust the namespace accordingly. The restart causes a brief interruption of around 1–2 minutes. As a result of the restart, three things happen automatically — the broken application state is cleared, the expanded volume is remounted with the full new capacity, and the application performs its own internal recovery on startup.

Step 4: Verify the Pod and Disk Are Healthy

Before touching the application layer, confirm the pod is fully healthy at the infrastructure level. First, check that all pods are back in a Running state:

kubectl get pods -n solr

Next, check the pod logs for clean startup output with no errors:

kubectl logs <solr-pod-name> -n solr --tail=50

In addition, confirm the new disk size is visible and showing healthy usage inside the pod:

kubectl exec -it <solr-pod-name> -n solr -- df -h

The mount point should now show 20Gi total with plenty of free space. If it still shows 10Gi, the filesystem resize has not completed yet — wait a few more minutes and check again. If it still has not updated after 15 minutes, the pod restart itself will trigger the filesystem resize on remount.

Step 5: Application-Level Recovery (Per Workload)

Once the pod is healthy at the infrastructure level, perform the application-specific recovery step for your workload. Furthermore, this step varies depending on what was running in the pod:

  • Sitecore + Solr: Log in to Sitecore → Control Panel → Indexing Manager → select sitecore_jss_web_index → click Rebuild. The job should now complete successfully with no IndexWriter errors.
  • Elasticsearch: Check cluster health with GET /_cluster/health. If any indices are red or yellow, trigger a manual shard allocation using the Cluster Reroute API.
  • PostgreSQL: Run VACUUM ANALYZE on affected tables to clean up any incomplete transactions. Check for replication lag if running a replica set.
  • MongoDB: Check replica set status with rs.status(). Run db.repairDatabase() if any collections show corruption flags.
  • Redis: Verify persistence is working again with INFO persistence — confirm rdb_last_bgsave_status: ok.
  • Custom app pod: Trigger whatever write operation was failing before the disk was full. Check the application logs to confirm the error is gone.

How to Prevent AKS PVC Disk Full in the Future

The most frustrating thing about this problem is that it is completely preventable. Here is what we put in place after this incident — and what I recommend for every production AKS deployment running stateful workloads:

  • Set Azure Monitor alerts at 80% PVC usage. By the time you hit 100% it is already too late. An alert at 80% gives you comfortable time to expand before anything breaks. In Azure Portal, go to Monitor → Alerts and create a metric alert on Persistent Volume Used Bytes.
  • Use Prometheus and Grafana on AKS. The kubelet_volume_stats_used_bytes metric gives you real-time disk usage per PVC across every namespace. Pair it with a Grafana dashboard and a Slack alert — you will never be caught off guard again.
  • Start bigger for production. A 10Gi PVC is fine for development. In production, start at 50Gi or more for any write-heavy workload like Solr, Elasticsearch, or PostgreSQL. Storage is cheap — downtime is not.
  • Plan for 3× headroom during operations. Lucene index rebuilds, PostgreSQL VACUUM, and MongoDB compaction all need temporary space that can be 2–3× the current data size. Always leave enough headroom before triggering these operations.
  • Schedule regular maintenance operations. For Solr, run periodic OPTIMIZE commands to merge segments and reduce disk footprint. For Postgres, schedule regular VACUUM. These can reduce disk usage by 20–40% over time.
  • Verify allowVolumeExpansion on your storage class. Run kubectl describe storageclass managed-premium and confirm AllowVolumeExpansion: true is set. Azure managed-premium and managed-csi-premium both support it by default — but custom storage classes may not.

Conclusion : AKS PVC Disk Full Fix in Under 15 Minutes

An AKS PVC disk full condition is one of the most deceptive problems in Kubernetes operations. The pod stays Running, no obvious alert fires, and the error your application throws is almost never the one that points to disk space. In our case it was this IndexWriter is closed for Sitecore Solr — but it could just as easily be a Postgres write failure, a MongoDB corruption error, or a Redis persistence warning.

In summary, the fix is always the same three steps — confirm the disk is full with df -h inside the pod, expand the PVC online using kubectl patch or edit, and restart the pod to clear the broken application state. Furthermore, the entire process takes under 15 minutes and preserves all your data completely. Most importantly, it is 100% preventable with the right Azure Monitor alerts in place before the next incident hits.

Quick Reference: Commands for AKS PVC Disk Full Fix

  • Check PVCs: kubectl get pvc -n <namespace>
  • Check disk inside pod: kubectl exec -it <pod> -n <namespace> -- df -h
  • Grep for disk error: kubectl logs <pod> -n <namespace> | grep -i "no space"
  • Expand PVC: kubectl patch pvc <name> -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
  • Watch resize: kubectl get pvc -n <namespace> -w
  • Restart StatefulSet: kubectl rollout restart statefulset/<name> -n <namespace>
  • Restart Deployment: kubectl rollout restart deployment/<name> -n <namespace>
  • Verify pod health: kubectl get pods -n <namespace>

Frequently Asked Questions

Q: Will expanding the PVC delete my data?
No. PVC expansion on AKS is completely non-destructive. Azure grows the underlying managed disk and extends the filesystem. Your data — whether it is Solr index segments, Postgres tables, or MongoDB collections — is fully preserved throughout.

Q: Does the application go down during PVC expansion?
No. The disk expansion itself causes zero downtime. The only brief interruption — around 1–2 minutes — happens when you restart the pod in Step 3. The application itself stays up during the disk resize.

Q: Can I shrink the PVC back after fixing the issue?
No. Kubernetes does not support PVC shrinking. Once expanded, the size is permanent. Plan your initial PVC sizes carefully — especially for production write-heavy workloads.

Q: The df -h still shows the old size after PVC expansion. Why?
The Kubernetes PVC resize and the filesystem resize inside the pod are two separate operations. If the filesystem has not caught up, wait a few more minutes and recheck. If it still shows the old size after 15 minutes, the pod restart in Step 3 will trigger the filesystem resize on the next mount.

Q: My storage class does not support volume expansion. What do I do?
Run kubectl describe storageclass <name> and check for AllowVolumeExpansion: true. If it is not set, you will need to provision a new larger PVC and migrate the data manually using a tool like kubectl cp or a pod-to-pod rsync job.

Q: I have multiple Solr leader and follower PVCs. Do I need to expand all of them?
Yes. Expand every PVC that is full. If only the leader PVC is full, start there — but check the follower PVCs too, as they often fill at a similar rate.

Related Articles