Skip to content

Report root cause when a stack update fails#3

Open
deorus wants to merge 2 commits into
mainfrom
better-failure-reporting
Open

Report root cause when a stack update fails#3
deorus wants to merge 2 commits into
mainfrom
better-failure-reporting

Conversation

@deorus

@deorus deorus commented Jun 20, 2026

Copy link
Copy Markdown

Addresses Doist/platform-backlog#734 — failed deploys currently surface only a single, often-wrong error line, leaving you to dig through the AWS Console. This action is the standard deploy step across ~30 Doist repos, so the improvement applies broadly, not just to Todoist.

What changed

  • On failure, scan the stack events for this update's ClientRequestToken and collect all genuinely-failed resources (any *_FAILED status, skipping cancellation cascades). Report the earliest failure as the root cause — the previous heuristic captured the newest failed event, which is usually a downstream symptom.
  • Under GitHub Actions, write a summary table to the job summary: Time (UTC) / Resource / Type / Status / Reason, plus a deep link to the stack's events tab in the AWS Console. Remaining failures are emitted as ::error:: annotations.
  • Falls back to a generic message + console link when no resource-level reason is available.

Notes

  • No new IAM permission — cloudformation:DescribeStackEvents was already required.
  • Tests cover failure detection, root-cause ordering, Markdown-cell escaping, and the console URL.

artyom and others added 2 commits March 30, 2026 12:07
While here, replace action context var with explicit value, because
docker repository name must be lowercase.
On a failed deploy, scan the stack events for this update and surface the
earliest failed resource as the error. Under GitHub Actions, write a summary
table of all failed resources plus a console deep link to the job summary.
@deorus deorus requested a review from artyom June 22, 2026 06:42
@deorus deorus marked this pull request as ready for review June 22, 2026 06:42

@doistbot doistbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR improves CloudFormation deploy failure reporting by scanning stack events for the update's ClientRequestToken, identifying the earliest failed resource as the root cause, and writing a summary table with console links to the GitHub Actions job summary.

Few things worth tightening:

  • Stack-level failures get dropped: The continue statement that skips stack events also hides pre-flight failures (missing IAM capabilities, export overlaps, etc.) where no resource fails, resulting in a generic error instead of the real cause. Remove the continue so stack-level _FAILED events are still collected — since stack events are newer than resource failures, sortedFailures will still place the true root cause first.
  • Edge cases in the root-cause message: A *_FAILED event with no ResourceStatusReason produces a malformed message (… STATUS: — see stack events). Treat an empty root.reason as missing and fall back to the generic message + console link. Separately, the console URL hardcodes console.aws.amazon.com, which is wrong for GovCloud and China partitions — derive the host from the stack ARN/partition instead.
  • Nondeterministic root-cause selection: Events collected into a map then sorted only by timestamp can shuffle equal-timestamp failures, making the reported root cause vary run to run. Use EventId for dedup instead of the composite key, and add a deterministic tiebreaker or preserve scan order so failures[0] is stable.
  • Polling and testing gaps: On busy stacks, pagination now continues through the full one-hour cutoff even after rollback — break early once the matching UPDATE_IN_PROGRESS event for this token is found. Also, the two highest-value functions (reportFailure and writeStepSummary) are untested despite being pure functions with an existing test harness; adding coverage for both the populated and empty cases would guard against regressions in the core logic this PR introduces.

I also included a few optional follow-up notes in the details below.

Optional follow-up note (1)
  • [P3] main.go:146: Since the stack's terminal event is always newer than the resource failures that caused it, it will be encountered first in this newest-to-oldest stream. You can avoid redundantly populating the failures map on every tick during a rollback by checking if terminal != "" && isFailure(...). This safely defers all map operations and string parsing until the final tick when the map is actually used.

Share FeedbackReview Logs

Comment thread main.go
types.ResourceStatusRollbackFailed:
terminal = evt.ResourceStatus
}
continue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 P1 The continue statement skips all stack-level events from being evaluated by isFailure. While this hides the redundant "The following resource(s) failed" stack event during typical deployments, it also drops stack-level pre-flight failures (e.g., missing IAM capabilities, export overlaps, or unresolvable parameters) where no underlying resources fail. This results in the action returning a generic UPDATE_ROLLBACK_COMPLETE error and dropping the actual root cause, which is a regression from the previous behavior.

Remove continue so that stack-level _FAILED events are still collected. Because stack-level events occur after resource failures, sortedFailures will still correctly place the true resource root cause first when both exist.

Comment thread main.go
log.Printf("%s%s (%s) %s: %s", githubErrPrefix, e.logicalID, e.resType, e.status, oneLine(e.reason))
}
root := failures[0]
return fmt.Errorf("%s (%s) %s: %s — see stack events: %s", root.logicalID, root.resType, root.status, oneLine(root.reason), consoleURL)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 ResourceStatusReason is optional in CloudFormation events, so a *_FAILED event can legitimately arrive with no reason. In that case this now returns Resource (Type) STATUS: — see stack events, even though the PR description says to fall back to a generic message when no resource-level reason is available. Please treat an empty root.reason as missing here and either omit the : %s fragment or fall back to the terminal-status message + console link.

Comment thread main.go
}

func eventsConsoleURL(region, stackID string) string {
return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s",

This comment was marked as low quality.

Comment thread main.go
reason: unptr(evt.ResourceStatusReason),
timestamp: unptr(evt.Timestamp),
}
failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 Replace this synthetic composite key with the event's unique ID (unptr(evt.EventId)). CloudFormation guarantees a stable EventId for every stack event, which provides a safer, allocation-free way to deduplicate events without relying on stringified timestamps.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decent observation, but there's no need to have a map at all, and llms missed this.

Comment thread main.go

func sortedFailures(m map[string]failedEvent) []failedEvent {
out := slices.Collect(maps.Values(m))
slices.SortFunc(out, func(a, b failedEvent) int { return a.timestamp.Compare(b.timestamp) })

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 This only sorts by timestamp, but the failures were first collected into a map, so any two events with the same timestamp end up in arbitrary order. Since reportFailure() treats failures[0] as the root cause, the reported root failure can change run to run for equal-timestamp events. Keep scan order in a slice (using a separate set only for dedupe if needed), or add a deterministic tiebreaker before selecting the first failure.

Comment thread main.go
return cmp.Or(likelyRootCause, fmt.Errorf("%v, see AWS CloudFormation Console for more details", evt.ResourceStatus))
case types.ResourceStatusUpdateComplete:
return nil
case types.ResourceStatusUpdateRollbackComplete,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 This change stops returning immediately on rollback and now keeps paginating DescribeStackEvents until the one-hour cutoff. On stacks with a lot of recent history, that means extra API pages for unrelated older operations right on the failure path. Since events are newest-first, consider breaking once you've reached the current update's start event for this ClientRequestToken (for example the matching stack UPDATE_IN_PROGRESS event) instead of scanning the whole cutoff window.

Comment thread main.go
// table plus per-resource error annotations — and returns the most likely root
// cause as an error. The earliest failure is the root cause; later ones usually
// cascade from it.
func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 P2 reportFailure is the core of this PR — it picks failures[0] as the root cause, formats the error message, logs failures[1:] as annotations, and falls back to a generic message when there are no resource-level failures. None of this is tested. A regression here (wrong element selected, broken fallback, malformed message) would directly reintroduce the wrong-root-cause problem this PR fixes. It's a pure function in the same package — with GITHUB_STEP_SUMMARY unset, a test can pass a []failedEvent and assert the returned error string for both the populated and empty cases.

Comment thread main.go
region, url.QueryEscape(stackID))
}

func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error {

This comment was marked as low quality.

Comment thread main.go
// Re-scan events on every tick. The stack-level terminal event is the
// newest one, but the resource failures that caused it are older, so on
// failure we keep scanning the page to collect them all before reporting.
failures := make(map[string]failedEvent)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a map here, []failedEvent is enough.

Comment thread main.go
}
continue
}
if isFailure(evt.ResourceStatus, unptr(evt.ResourceStatusReason)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to have a dedicated function since this is a single place where it's used. It can be inlined and still be a one-line condition.

Comment thread main.go
reason: unptr(evt.ResourceStatusReason),
timestamp: unptr(evt.Timestamp),
}
failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decent observation, but there's no need to have a map at all, and llms missed this.

Comment thread main.go
}

func eventsConsoleURL(region, stackID string) string {
return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the region at all.

func consoleURL(arn string) string {
  return "https://console.aws.amazon.com/go/view?arn="+url.QueryEscape(arn)
}

Comment thread main.go
// cascade from it.
func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error {
consoleURL := eventsConsoleURL(region, stackID)
if path := os.Getenv("GITHUB_STEP_SUMMARY"); path != "" {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad pattern/code smell — functions should now have side effects like this.

Instead, reportFailure can return a dedicated error type that has a specific method taking a file name as the argument to write report to.

Comment thread main.go
region, url.QueryEscape(stackID))
}

func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a method on a dedicated error type, and only take a single argument — file name where to write report to.

Comment thread main.go
}

func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error {
f, err := os.OpenFile(path, os.O_APPEND|os.O_WRONLY, 0o644)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just build the content into bytes.Buffer instead of strings.Builder as the code already does, and then write it like this:

return os.WriteFile(filename, b.Bytes(), 0666)

the github step summary is unique for each step, so you can just write it without being afraid about overwriting it.

Comment thread main.go

// oneLine collapses runs of whitespace (including newlines) into single spaces
// so a reason renders on a single log line or table cell.
func oneLine(s string) string { return strings.Join(strings.Fields(s), " ") }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more efficient way to do this would be:

func oneLine(s string) string {
	var prevIsSpace bool
	f := func(r rune) rune {
		switch {
		case unicode.IsSpace(r):
			if prevIsSpace {
				return -1
			}
			prevIsSpace = true
			return ' '
		default:
			prevIsSpace = false
			return r
		}
	}
	return strings.TrimSpace(strings.Map(f, s))
}

Comment thread main_test.go
}
}

func Test_isFailure(t *testing.T) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless test, can be dropped.

Comment thread main_test.go
}
}

func Test_eventsConsoleURL(t *testing.T) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless test.

@artyom

artyom commented Jul 1, 2026

Copy link
Copy Markdown

Next time let's maybe briefly discuss the approach in a github issue first before unleashing LLMs on the task, otherwise it takes more time to understand what they did. 😅

@artyom artyom force-pushed the main branch 2 times, most recently from c3a9b03 to f48e46e Compare July 2, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants