Skip to content
Engineering Leadership

State Is Harder Than Scale

Production Notes #01 — Many of the most expensive production failures have little to do with scale. They happen because reality refuses to stay synchronized.

Share
LinkedInX

Production Notes #01 · Binary and Beyond

Every engineering conference eventually arrives at the same topics.

Scaling to millions of users.

Distributed systems.

Event-driven architecture.

Microservices.

The latest AI models.

These are important problems. They're also highly visible ones. We benchmark them, build tools around them, write books about them, and celebrate the engineers who solve them.

Yet after years of building enterprise software, I've become convinced that many of the most expensive production failures have very little to do with scale.

They happen because reality refuses to stay synchronized.

When every system tells a different story

Imagine a fairly ordinary business process.

A customer submits an order.

The payment gateway authorizes the transaction.

Inventory is reserved.

The warehouse receives a fulfilment request.

The ERP records the sale.

The CRM updates the customer's profile.

The customer receives a confirmation email.

On a whiteboard, this is a clean sequence of events.

In production, it rarely unfolds that way.

The payment callback arrives twice because the first response timed out.

Inventory is reserved successfully, but the ERP is temporarily unavailable.

A warehouse operator manually adjusts stock before the synchronization job runs.

The customer refreshes the confirmation page because the browser appears to be stuck.

A retry mechanism, behaving exactly as designed, processes the same operation again.

Five minutes later, someone asks a perfectly reasonable question.

"Did the order succeed?"

The payment gateway says yes.

The warehouse says yes.

The ERP says no.

Inventory has been reduced.

The CRM still shows an abandoned cart.

The customer has already received an email confirming the purchase.

Nothing is technically broken.

Every system is simply describing a different version of reality.

We spend too much time thinking about scale

When engineers discuss architecture, scale usually dominates the conversation.

Can the database handle another million rows?

Should this service be asynchronous?

Do we need a queue?

Would event sourcing help?

These are worthwhile discussions.

But they often distract us from a more fundamental question.

How many independent versions of reality does this system create?

Every new service.

Every webhook.

Every scheduled synchronization.

Every third-party API.

Every AI workflow.

Every integration introduces another observer of the same business process.

Each observer has its own timing.

Its own retry logic.

Its own failures.

Its own assumptions.

The complexity doesn't emerge because there are many requests.

It emerges because there are many opinions about what has happened.

There is rarely a single source of truth

One of the first questions teams ask during architecture discussions is:

"Which system is the source of truth?"

It's an understandable question.

It's also one that often oversimplifies the problem.

The payment provider is authoritative about payments.

The warehouse is authoritative about physical inventory.

The ERP owns financial records.

The CRM owns customer relationships.

The identity provider owns authentication.

No single system owns reality.

Each system owns one part of it.

Problems begin when software is designed around the assumption that one application can become the universal source of truth for every business decision.

That assumption survives until the first production incident.

Every integration is really a negotiation

Integrations are often described as data movement.

One API sends information to another.

One webhook triggers a process.

One scheduled job copies records between databases.

But something more important is happening.

Each system is negotiating its understanding of reality.

Sometimes they agree immediately.

Sometimes they disagree for a few seconds.

Sometimes they disagree permanently until a reconciliation process corrects them.

Those disagreements aren't edge cases.

They are normal operating conditions.

The question isn't whether they will occur.

The question is whether your architecture expects them.

This is the work behind legacy modernization and enterprise integrations — not replacing everything at once, but making disagreement survivable.

AI doesn't make this problem disappear

If anything, it makes it more important.

An AI system might retrieve customer information from a CRM, pricing from an ERP, inventory from a commerce platform, policy documents from a knowledge base, and operational metrics from an analytics system.

Every one of those sources changes independently.

Every one of them has different latency, ownership, and update cycles.

When an AI assistant produces an incorrect recommendation, the model is often blamed.

Sometimes that's justified.

Often the real issue is that the system asked the model to reason about inconsistent information.

Good AI depends on good state management.

The model is only one participant in a much larger system.

That's why production AI delivery has to include the plumbing — data pipelines, governance, human review queues, and integration layers — not just the model headline.

Architecture is the art of managing disagreement

I've gradually stopped evaluating architecture by asking questions like:

"Is it scalable?"

or

"Is it modern?"

Instead, I ask different questions.

  • What happens if this message arrives twice?
  • Which system wins when two records conflict?
  • Can every operation be safely repeated?
  • How does the business recover after a partial failure?
  • How long can two systems disagree before someone notices?

Those questions rarely appear in conference talks.

They are also the questions that determine whether a production system survives its first year.

The software is rarely confused. The business is.

Computers do exactly what we ask them to do.

The difficult part is that businesses aren't static.

People intervene manually.

External partners retry requests.

Networks fail.

Customers refresh pages.

Operations teams fix records.

Third-party systems change behaviour.

Reality keeps moving while software tries to keep up.

The challenge isn't writing correct code.

The challenge is building systems that continue making sensible decisions even when different parts of the business temporarily disagree.

A different way to think about complexity

For a long time, I believed complexity arrived with scale.

Larger databases.

More users.

More servers.

Today I think complexity arrives much earlier.

It arrives the moment multiple systems begin making decisions about the same business process.

That's why some of the hardest engineering problems appear in companies serving thousands of users rather than millions.

The volume isn't exceptional.

The number of competing truths is.


The longer I spend building production systems, the less I believe software is fundamentally about code.

Software is an attempt to model reality.

Reality is messy.

Reality changes.

Reality disagrees with itself.

The best architectures aren't the ones that eliminate that mess.

They're the ones that remain trustworthy in spite of it.

Because in enterprise software, the hardest problem isn't handling more requests.

It's handling more versions of the truth.


Originally published on LinkedIn as part of the Binary and Beyond newsletter. Building systems where state, integrations, and AI have to agree under production pressure? Start a conversation.

Agency partner

Need delivery stability without adding headcount?

Quick Brown Fox helps agencies ship complex web platforms, tighten QA, and scale engineering capacity—without becoming a liability to your client relationships.