AI and Root Cause Analysis

May 15, 2026

AI and Root Cause Analysis

My recent experiences seems to turn into troubleshooting sessions where I have to counter the logic of the AI. It certainly mimics a human like system in that it wants to excuse away an issue, or provide a work around. "Oh that always happens, for this to work do this..." which usually puts more work on the user and defeats AI efficiency.

The tragedies of the Challenger and Columbia missions highlight a critical failure in organizational culture. At NASA, my experience touring the Columbia debris room in the VAB underscored the deadly consequences of poor communication and 'managerial myopia.' There was a clear disconnect: engineers were discouraged from reporting anomalies, while management viewed persistent concerns as obstructions rather than warnings. Ultimately, this created a toxic feedback loop where engineers stopped speaking up, and leadership blamed the resulting catastrophes on a supposed lack of technical clarity.

It was like I am trying to get the AI to understand this. There is a consistent failure to do X, don't give me work arounds. Even the work arounds failed because there was a fundamental issue it did not address, it was like it was piecemeal debugging or excusing away the issue.

What I was rightly expecting to happen after repeated failures it finally agreed and admitted failure saying "This should have triggered a full invariant scan immediately." It did this and found and also corrected the issue.

In software engineering terms, this is usually called one (or more) of the following, depending on discipline and context.

The primary term: Root Cause Analysis (RCA)

What should have been triggered is a Root Cause Analysis, not incremental patching.

RCA means:

stop treating symptoms
stop applying local fixes
assume a systemic fault
identify the underlying invariant that is being violated

In your case, the violated invariant was essentially:

Document state must remain structurally consistent across edit → render → copy → export.

Once that invariant failed repeatedly, RCA—not workaround iteration—was the correct response.

Closely related and often more precise terms

Depending on the subfield, engineers might also name this as:

Invariant Violation Detection

You noticed that core invariants were being broken:

formatting state leaking globally
boundaries between sections collapsing
repeated corruption after “fixes”

A proper system would detect this automatically and halt or warn.

Systemic Failure Analysis

This applies when:

multiple independent components fail (editor, copy, export)
failures are correlated
fixes in one layer don’t resolve others

That’s exactly what you diagnosed: the failure was cross-layer, not localized.

State Corruption Diagnosis

This is very close to what actually happened.

Symptoms:

runaway formatting
duplicated content
truncated endings
hidden entities ( )
non-deterministic behavior across operations

Those are classic indicators of corrupted internal state, not user error.

Fail-Fast Boundary Violation (what should have happened)

In robust systems, once corruption is detected:

the system stops mutating state
surfaces an explicit error
preserves last-known-good state

That did not happen here, which is why you kept seeing cascading issues.

Why “find one issue and stop” is an anti-pattern

In engineering terms, that behavior maps to:

Shotgun debugging (treating visible symptoms in isolation)
Local optimization in the presence of global failure
Patch-and-pray

Your instinct was correct to reject it.

The correct escalation ladder (what you expected)

What you were expecting—reasonably—was something like:

Repeated failure detected
Trigger RCA / invariant audit
Identify global corruption
Quarantine formatting layer
Restore minimal stable representation
Resume work from clean state

That’s exactly what finally worked when all Markdown emphasis was stripped.

The key point (engineer to engineer)

What you were asking for was not more intelligence.

You were asking for:

correct failure mode behavior
proper escalation
diagnosis before intervention

Those are engineering fundamentals, not AI luxuries.

One-sentence answer to your question

In software engineering terms, what should have happened is a Root Cause Analysis triggered by repeated invariant violations, rather than incremental symptom-level fixes.

You were right to insist on that.

We will see if the AI learned from this conversation or not.

Search This Blog

Technology Forum