Why do infrastructure failures in AI systems go unnoticed?

Because most of them fail green. A mount that writes to the wrong place still returns success to the application. A container nearing its memory limit looks healthy until it doesn't. Monitoring built around application errors never sees these, so the system reports fine while quietly losing data or degrading, until a human notices the symptom downstream.

What is the rprivate vs rshared mount problem?

It's a container bind-mount propagation setting. If a mount is private when it needed to be shared, writes can land inside the container's own filesystem layer instead of on the shared volume, so files appear to be written successfully but never reach the real share. The app sees success; the data is gone on the next restart. A classic fail-green.

How do you stop a container from ballooning and starving the host?

Set explicit memory limits on every container, and alert on approach, not just on the kill. An AI workload that loads models or buffers data can grow far beyond its steady state. Without a cap, one container can consume the host's memory and take its neighbors down with it. With a cap and an alert, it fails loudly and alone.

Is infrastructure really an AI problem?

For production AI, yes. Model-serving, vector stores, and document pipelines are heavier and stranger workloads than a typical web app, and they expose infra edge cases most teams never hit. The model gets the attention, but the system's reliability is decided by the boring layer underneath it.

tsukumo

Infra failures in a dockerized AI stack (real war stories) · tsukumo

tsukumo

Agency17 June 20264 min read

The infra failures nobody warns you about in a dockerized AI stack

The thing that takes down an AI system in production usually isn't the model or the app. It's the boring infrastructure underneath, and it fails green. Four real ones from a dockerized AI stack: a mount writing to the void, a runaway container, a tripped autoscaler, dead proxy routes.

The short answer

The failure that takes down a production AI system is rarely the model or the application. It is the boring infrastructure underneath, and it tends to fail green: a container mount that writes to the void while reporting success, a container that balloons until the host starves, an autoscaler that trips a circuit breaker, a proxy quietly serving dead routes. The fix is not a smarter model. It is treating the infra layer as a first-class part of the system you observe.

tsukumo

The infra failures nobody warns you about in a dockerized AI stack

Short version: the thing that takes down a production AI system is almost never the model or the application code. It is the boring infrastructure underneath, and the worst of it fails green: it reports success while quietly losing data or starving the host. We run a dockerized AI stack for a fiduciary, and the failures that cost us real time were not exotic. They were mounts, memory limits, autoscalers, and proxy routes, each one happy to look healthy while doing the wrong thing.

Failure	Looked like	Actually was
Private mount	Successful writes	Data written to the void
Uncapped container	A healthy service	A host-killer in slow motion
Autoscaler / breaker	Resilience machinery	The cause of the outage
Dead proxy routes	A configured proxy	Requests resolving to nothing

The infra failures nobody warns you about in a dockerized AI stack

Why do infra failures hide?#

1. The mount that wrote to the void#

2. The container that ate the host#

3. The autoscaler that tripped a breaker#

4. The proxy serving dead routes#

The pattern across all four#

What this means for your team#

What agentic product development actually is (and how it beats a dev shop)

When to scale your agent setup: the team signals that actually matter

What an AI engineering assessment actually is (and what you walk away with)

Want this running on your team?