Report root cause when a stack update fails by deorus · Pull Request #3 · Doist/update-cloudformation-stack

deorus · 2026-06-20T10:50:44Z

Addresses Doist/platform-backlog#734 — failed deploys currently surface only a single, often-wrong error line, leaving you to dig through the AWS Console. This action is the standard deploy step across ~30 Doist repos, so the improvement applies broadly, not just to Todoist.

What changed

On failure, scan the stack events for this update's ClientRequestToken and collect all genuinely-failed resources (any *_FAILED status, skipping cancellation cascades). Report the earliest failure as the root cause — the previous heuristic captured the newest failed event, which is usually a downstream symptom.
Under GitHub Actions, write a summary table to the job summary: Time (UTC) / Resource / Type / Status / Reason, plus a deep link to the stack's events tab in the AWS Console. Remaining failures are emitted as ::error:: annotations.
Falls back to a generic message + console link when no resource-level reason is available.

Notes

No new IAM permission — cloudformation:DescribeStackEvents was already required.
Tests cover failure detection, root-cause ordering, Markdown-cell escaping, and the console URL.

While here, replace action context var with explicit value, because docker repository name must be lowercase.

On a failed deploy, scan the stack events for this update and surface the earliest failed resource as the error. Under GitHub Actions, write a summary table of all failed resources plus a console deep link to the job summary.

doistbot

This PR improves CloudFormation deploy failure reporting by scanning stack events for the update's ClientRequestToken, identifying the earliest failed resource as the root cause, and writing a summary table with console links to the GitHub Actions job summary.

Few things worth tightening:

Stack-level failures get dropped: The continue statement that skips stack events also hides pre-flight failures (missing IAM capabilities, export overlaps, etc.) where no resource fails, resulting in a generic error instead of the real cause. Remove the continue so stack-level _FAILED events are still collected — since stack events are newer than resource failures, sortedFailures will still place the true root cause first.
Edge cases in the root-cause message: A *_FAILED event with no ResourceStatusReason produces a malformed message (… STATUS: — see stack events). Treat an empty root.reason as missing and fall back to the generic message + console link. Separately, the console URL hardcodes console.aws.amazon.com, which is wrong for GovCloud and China partitions — derive the host from the stack ARN/partition instead.
Nondeterministic root-cause selection: Events collected into a map then sorted only by timestamp can shuffle equal-timestamp failures, making the reported root cause vary run to run. Use EventId for dedup instead of the composite key, and add a deterministic tiebreaker or preserve scan order so failures[0] is stable.
Polling and testing gaps: On busy stacks, pagination now continues through the full one-hour cutoff even after rollback — break early once the matching UPDATE_IN_PROGRESS event for this token is found. Also, the two highest-value functions (reportFailure and writeStepSummary) are untested despite being pure functions with an existing test harness; adding coverage for both the populated and empty cases would guard against regressions in the core logic this PR introduces.

I also included a few optional follow-up notes in the details below.

Optional follow-up note (1)

[P3] main.go:146: Since the stack's terminal event is always newer than the resource failures that caused it, it will be encountered first in this newest-to-oldest stream. You can avoid redundantly populating the failures map on every tick during a rollback by checking if terminal != "" && isFailure(...). This safely defers all map operations and string parsing until the final tick when the map is actually used.

_{Share Feedback • Review Logs}

doistbot · 2026-06-22T06:48:50Z

+						types.ResourceStatusRollbackFailed:
+						terminal = evt.ResourceStatus
 					}
+					continue


🟠 P1 The continue statement skips all stack-level events from being evaluated by isFailure. While this hides the redundant "The following resource(s) failed" stack event during typical deployments, it also drops stack-level pre-flight failures (e.g., missing IAM capabilities, export overlaps, or unresolvable parameters) where no underlying resources fail. This results in the action returning a generic UPDATE_ROLLBACK_COMPLETE error and dropping the actual root cause, which is a regression from the previous behavior.

Remove continue so that stack-level _FAILED events are still collected. Because stack-level events occur after resource failures, sortedFailures will still correctly place the true resource root cause first when both exist.

doistbot · 2026-06-22T06:48:50Z

+		log.Printf("%s%s (%s) %s: %s", githubErrPrefix, e.logicalID, e.resType, e.status, oneLine(e.reason))
+	}
+	root := failures[0]
+	return fmt.Errorf("%s (%s) %s: %s — see stack events: %s", root.logicalID, root.resType, root.status, oneLine(root.reason), consoleURL)


🟡 P2 ResourceStatusReason is optional in CloudFormation events, so a *_FAILED event can legitimately arrive with no reason. In that case this now returns Resource (Type) STATUS: — see stack events, even though the PR description says to fall back to a generic message when no resource-level reason is available. Please treat an empty root.reason as missing here and either omit the : %s fragment or fall back to the terminal-status message + console link.

Sign in to view

+}
+
+func eventsConsoleURL(region, stackID string) string {
+	return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s",


doistbot · 2026-06-22T06:48:50Z

+						reason:    unptr(evt.ResourceStatusReason),
+						timestamp: unptr(evt.Timestamp),
+					}
+					failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe


🟡 P2 Replace this synthetic composite key with the event's unique ID (unptr(evt.EventId)). CloudFormation guarantees a stable EventId for every stack event, which provides a safer, allocation-free way to deduplicate events without relying on stringified timestamps.

Decent observation, but there's no need to have a map at all, and llms missed this.

doistbot · 2026-06-22T06:48:50Z

+
+func sortedFailures(m map[string]failedEvent) []failedEvent {
+	out := slices.Collect(maps.Values(m))
+	slices.SortFunc(out, func(a, b failedEvent) int { return a.timestamp.Compare(b.timestamp) })


🟡 P2 This only sorts by timestamp, but the failures were first collected into a map, so any two events with the same timestamp end up in arbitrary order. Since reportFailure() treats failures[0] as the root cause, the reported root failure can change run to run for equal-timestamp events. Keep scan order in a slice (using a separate set only for dedupe if needed), or add a deterministic tiebreaker before selecting the first failure.

doistbot · 2026-06-22T06:48:50Z

-						return cmp.Or(likelyRootCause, fmt.Errorf("%v, see AWS CloudFormation Console for more details", evt.ResourceStatus))
 					case types.ResourceStatusUpdateComplete:
 						return nil
+					case types.ResourceStatusUpdateRollbackComplete,


🟡 P2 This change stops returning immediately on rollback and now keeps paginating DescribeStackEvents until the one-hour cutoff. On stacks with a lot of recent history, that means extra API pages for unrelated older operations right on the failure path. Since events are newest-first, consider breaking once you've reached the current update's start event for this ClientRequestToken (for example the matching stack UPDATE_IN_PROGRESS event) instead of scanning the whole cutoff window.

doistbot · 2026-06-22T06:48:50Z

+// table plus per-resource error annotations — and returns the most likely root
+// cause as an error. The earliest failure is the root cause; later ones usually
+// cascade from it.
+func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error {


🟡 P2 reportFailure is the core of this PR — it picks failures[0] as the root cause, formats the error message, logs failures[1:] as annotations, and falls back to a generic message when there are no resource-level failures. None of this is tested. A regression here (wrong element selected, broken fallback, malformed message) would directly reintroduce the wrong-root-cause problem this PR fixes. It's a pure function in the same package — with GITHUB_STEP_SUMMARY unset, a test can pass a []failedEvent and assert the returned error string for both the populated and empty cases.

Sign in to view

+		region, url.QueryEscape(stackID))
+}
+
+func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error {


artyom · 2026-07-01T15:44:47Z

+		// Re-scan events on every tick. The stack-level terminal event is the
+		// newest one, but the resource failures that caused it are older, so on
+		// failure we keep scanning the page to collect them all before reporting.
+		failures := make(map[string]failedEvent)


You don't need a map here, []failedEvent is enough.

artyom · 2026-07-01T15:47:03Z

 					}
+					continue
+				}
+				if isFailure(evt.ResourceStatus, unptr(evt.ResourceStatusReason)) {


No reason to have a dedicated function since this is a single place where it's used. It can be inlined and still be a one-line condition.

artyom · 2026-07-01T15:48:11Z

+						reason:    unptr(evt.ResourceStatusReason),
+						timestamp: unptr(evt.Timestamp),
+					}
+					failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe


Decent observation, but there's no need to have a map at all, and llms missed this.

artyom · 2026-07-01T15:52:27Z

+}
+
+func eventsConsoleURL(region, stackID string) string {
+	return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s",


You don't need the region at all.

func consoleURL(arn string) string { return "https://console.aws.amazon.com/go/view?arn="+url.QueryEscape(arn) }

artyom · 2026-07-01T15:57:49Z

+// cascade from it.
+func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error {
+	consoleURL := eventsConsoleURL(region, stackID)
+	if path := os.Getenv("GITHUB_STEP_SUMMARY"); path != "" {


Bad pattern/code smell — functions should now have side effects like this.

Instead, reportFailure can return a dedicated error type that has a specific method taking a file name as the argument to write report to.

artyom · 2026-07-01T15:59:23Z

+		region, url.QueryEscape(stackID))
+}
+
+func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error {


This can be a method on a dedicated error type, and only take a single argument — file name where to write report to.

artyom · 2026-07-01T16:02:59Z

+}
+
+func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error {
+	f, err := os.OpenFile(path, os.O_APPEND|os.O_WRONLY, 0o644)


You can just build the content into bytes.Buffer instead of strings.Builder as the code already does, and then write it like this:

return os.WriteFile(filename, b.Bytes(), 0666)

the github step summary is unique for each step, so you can just write it without being afraid about overwriting it.

artyom · 2026-07-01T16:20:02Z

+
+// oneLine collapses runs of whitespace (including newlines) into single spaces
+// so a reason renders on a single log line or table cell.
+func oneLine(s string) string { return strings.Join(strings.Fields(s), " ") }


A more efficient way to do this would be:

func oneLine(s string) string { var prevIsSpace bool f := func(r rune) rune { switch { case unicode.IsSpace(r): if prevIsSpace { return -1 } prevIsSpace = true return ' ' default: prevIsSpace = false return r } } return strings.TrimSpace(strings.Map(f, s)) }

artyom · 2026-07-01T16:27:16Z

 	}
 }
+
+func Test_isFailure(t *testing.T) {


Useless test, can be dropped.

artyom · 2026-07-01T16:28:04Z

+	}
+}
+
+func Test_eventsConsoleURL(t *testing.T) {


Useless test.

artyom · 2026-07-01T16:32:49Z

Next time let's maybe briefly discuss the approach in a github issue first before unleashing LLMs on the task, otherwise it takes more time to understand what they did. 😅

artyom and others added 2 commits March 30, 2026 12:07

Point forked action to org-local ghcr.io repository

250c1a5

While here, replace action context var with explicit value, because docker repository name must be lowercase.

Report root cause when a stack update fails

413a10e

On a failed deploy, scan the stack events for this update and surface the earliest failed resource as the error. Under GitHub Actions, write a summary table of all failed resources plus a console deep link to the job summary.

deorus requested a review from artyom June 22, 2026 06:42

deorus marked this pull request as ready for review June 22, 2026 06:42

doistbot reviewed Jun 22, 2026

View reviewed changes

artyom requested changes Jul 1, 2026

View reviewed changes

artyom force-pushed the main branch 2 times, most recently from c3a9b03 to f48e46e Compare July 2, 2026 18:00

Uh oh!

Conversation

deorus commented Jun 20, 2026

What changed

Notes

Uh oh!

doistbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as low quality.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as low quality.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artyom commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants