Report root cause when a stack update fails#3
Conversation
While here, replace action context var with explicit value, because docker repository name must be lowercase.
On a failed deploy, scan the stack events for this update and surface the earliest failed resource as the error. Under GitHub Actions, write a summary table of all failed resources plus a console deep link to the job summary.
doistbot
left a comment
There was a problem hiding this comment.
This PR improves CloudFormation deploy failure reporting by scanning stack events for the update's ClientRequestToken, identifying the earliest failed resource as the root cause, and writing a summary table with console links to the GitHub Actions job summary.
Few things worth tightening:
- Stack-level failures get dropped: The
continuestatement that skips stack events also hides pre-flight failures (missing IAM capabilities, export overlaps, etc.) where no resource fails, resulting in a generic error instead of the real cause. Remove thecontinueso stack-level_FAILEDevents are still collected — since stack events are newer than resource failures,sortedFailureswill still place the true root cause first. - Edge cases in the root-cause message: A
*_FAILEDevent with noResourceStatusReasonproduces a malformed message (… STATUS: — see stack events). Treat an emptyroot.reasonas missing and fall back to the generic message + console link. Separately, the console URL hardcodesconsole.aws.amazon.com, which is wrong for GovCloud and China partitions — derive the host from the stack ARN/partition instead. - Nondeterministic root-cause selection: Events collected into a map then sorted only by timestamp can shuffle equal-timestamp failures, making the reported root cause vary run to run. Use
EventIdfor dedup instead of the composite key, and add a deterministic tiebreaker or preserve scan order sofailures[0]is stable. - Polling and testing gaps: On busy stacks, pagination now continues through the full one-hour cutoff even after rollback — break early once the matching
UPDATE_IN_PROGRESSevent for this token is found. Also, the two highest-value functions (reportFailureandwriteStepSummary) are untested despite being pure functions with an existing test harness; adding coverage for both the populated and empty cases would guard against regressions in the core logic this PR introduces.
I also included a few optional follow-up notes in the details below.
Optional follow-up note (1)
- [P3] main.go:146: Since the stack's terminal event is always newer than the resource failures that caused it, it will be encountered first in this newest-to-oldest stream. You can avoid redundantly populating the
failuresmap on every tick during a rollback by checkingif terminal != "" && isFailure(...). This safely defers all map operations and string parsing until the final tick when the map is actually used.
| types.ResourceStatusRollbackFailed: | ||
| terminal = evt.ResourceStatus | ||
| } | ||
| continue |
There was a problem hiding this comment.
🟠 P1 The continue statement skips all stack-level events from being evaluated by isFailure. While this hides the redundant "The following resource(s) failed" stack event during typical deployments, it also drops stack-level pre-flight failures (e.g., missing IAM capabilities, export overlaps, or unresolvable parameters) where no underlying resources fail. This results in the action returning a generic UPDATE_ROLLBACK_COMPLETE error and dropping the actual root cause, which is a regression from the previous behavior.
Remove continue so that stack-level _FAILED events are still collected. Because stack-level events occur after resource failures, sortedFailures will still correctly place the true resource root cause first when both exist.
| log.Printf("%s%s (%s) %s: %s", githubErrPrefix, e.logicalID, e.resType, e.status, oneLine(e.reason)) | ||
| } | ||
| root := failures[0] | ||
| return fmt.Errorf("%s (%s) %s: %s — see stack events: %s", root.logicalID, root.resType, root.status, oneLine(root.reason), consoleURL) |
There was a problem hiding this comment.
🟡 P2 ResourceStatusReason is optional in CloudFormation events, so a *_FAILED event can legitimately arrive with no reason. In that case this now returns Resource (Type) STATUS: — see stack events, even though the PR description says to fall back to a generic message when no resource-level reason is available. Please treat an empty root.reason as missing here and either omit the : %s fragment or fall back to the terminal-status message + console link.
| } | ||
|
|
||
| func eventsConsoleURL(region, stackID string) string { | ||
| return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s", |
This comment was marked as low quality.
This comment was marked as low quality.
Sorry, something went wrong.
| reason: unptr(evt.ResourceStatusReason), | ||
| timestamp: unptr(evt.Timestamp), | ||
| } | ||
| failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe |
There was a problem hiding this comment.
🟡 P2 Replace this synthetic composite key with the event's unique ID (unptr(evt.EventId)). CloudFormation guarantees a stable EventId for every stack event, which provides a safer, allocation-free way to deduplicate events without relying on stringified timestamps.
There was a problem hiding this comment.
Decent observation, but there's no need to have a map at all, and llms missed this.
|
|
||
| func sortedFailures(m map[string]failedEvent) []failedEvent { | ||
| out := slices.Collect(maps.Values(m)) | ||
| slices.SortFunc(out, func(a, b failedEvent) int { return a.timestamp.Compare(b.timestamp) }) |
There was a problem hiding this comment.
🟡 P2 This only sorts by timestamp, but the failures were first collected into a map, so any two events with the same timestamp end up in arbitrary order. Since reportFailure() treats failures[0] as the root cause, the reported root failure can change run to run for equal-timestamp events. Keep scan order in a slice (using a separate set only for dedupe if needed), or add a deterministic tiebreaker before selecting the first failure.
| return cmp.Or(likelyRootCause, fmt.Errorf("%v, see AWS CloudFormation Console for more details", evt.ResourceStatus)) | ||
| case types.ResourceStatusUpdateComplete: | ||
| return nil | ||
| case types.ResourceStatusUpdateRollbackComplete, |
There was a problem hiding this comment.
🟡 P2 This change stops returning immediately on rollback and now keeps paginating DescribeStackEvents until the one-hour cutoff. On stacks with a lot of recent history, that means extra API pages for unrelated older operations right on the failure path. Since events are newest-first, consider breaking once you've reached the current update's start event for this ClientRequestToken (for example the matching stack UPDATE_IN_PROGRESS event) instead of scanning the whole cutoff window.
| // table plus per-resource error annotations — and returns the most likely root | ||
| // cause as an error. The earliest failure is the root cause; later ones usually | ||
| // cascade from it. | ||
| func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error { |
There was a problem hiding this comment.
🟡 P2 reportFailure is the core of this PR — it picks failures[0] as the root cause, formats the error message, logs failures[1:] as annotations, and falls back to a generic message when there are no resource-level failures. None of this is tested. A regression here (wrong element selected, broken fallback, malformed message) would directly reintroduce the wrong-root-cause problem this PR fixes. It's a pure function in the same package — with GITHUB_STEP_SUMMARY unset, a test can pass a []failedEvent and assert the returned error string for both the populated and empty cases.
| region, url.QueryEscape(stackID)) | ||
| } | ||
|
|
||
| func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error { |
This comment was marked as low quality.
This comment was marked as low quality.
Sorry, something went wrong.
| // Re-scan events on every tick. The stack-level terminal event is the | ||
| // newest one, but the resource failures that caused it are older, so on | ||
| // failure we keep scanning the page to collect them all before reporting. | ||
| failures := make(map[string]failedEvent) |
There was a problem hiding this comment.
You don't need a map here, []failedEvent is enough.
| } | ||
| continue | ||
| } | ||
| if isFailure(evt.ResourceStatus, unptr(evt.ResourceStatusReason)) { |
There was a problem hiding this comment.
No reason to have a dedicated function since this is a single place where it's used. It can be inlined and still be a one-line condition.
| reason: unptr(evt.ResourceStatusReason), | ||
| timestamp: unptr(evt.Timestamp), | ||
| } | ||
| failures[fe.logicalID+"\x00"+fe.timestamp.String()] = fe |
There was a problem hiding this comment.
Decent observation, but there's no need to have a map at all, and llms missed this.
| } | ||
|
|
||
| func eventsConsoleURL(region, stackID string) string { | ||
| return fmt.Sprintf("https://%[1]s.console.aws.amazon.com/cloudformation/home?region=%[1]s#/stacks/events?stackId=%s", |
There was a problem hiding this comment.
You don't need the region at all.
func consoleURL(arn string) string {
return "https://console.aws.amazon.com/go/view?arn="+url.QueryEscape(arn)
}| // cascade from it. | ||
| func reportFailure(region, stackID string, terminal types.ResourceStatus, failures []failedEvent) error { | ||
| consoleURL := eventsConsoleURL(region, stackID) | ||
| if path := os.Getenv("GITHUB_STEP_SUMMARY"); path != "" { |
There was a problem hiding this comment.
Bad pattern/code smell — functions should now have side effects like this.
Instead, reportFailure can return a dedicated error type that has a specific method taking a file name as the argument to write report to.
| region, url.QueryEscape(stackID)) | ||
| } | ||
|
|
||
| func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error { |
There was a problem hiding this comment.
This can be a method on a dedicated error type, and only take a single argument — file name where to write report to.
| } | ||
|
|
||
| func writeStepSummary(path, consoleURL string, terminal types.ResourceStatus, failures []failedEvent) error { | ||
| f, err := os.OpenFile(path, os.O_APPEND|os.O_WRONLY, 0o644) |
There was a problem hiding this comment.
You can just build the content into bytes.Buffer instead of strings.Builder as the code already does, and then write it like this:
return os.WriteFile(filename, b.Bytes(), 0666)
the github step summary is unique for each step, so you can just write it without being afraid about overwriting it.
|
|
||
| // oneLine collapses runs of whitespace (including newlines) into single spaces | ||
| // so a reason renders on a single log line or table cell. | ||
| func oneLine(s string) string { return strings.Join(strings.Fields(s), " ") } |
There was a problem hiding this comment.
A more efficient way to do this would be:
func oneLine(s string) string {
var prevIsSpace bool
f := func(r rune) rune {
switch {
case unicode.IsSpace(r):
if prevIsSpace {
return -1
}
prevIsSpace = true
return ' '
default:
prevIsSpace = false
return r
}
}
return strings.TrimSpace(strings.Map(f, s))
}| } | ||
| } | ||
|
|
||
| func Test_isFailure(t *testing.T) { |
| } | ||
| } | ||
|
|
||
| func Test_eventsConsoleURL(t *testing.T) { |
|
Next time let's maybe briefly discuss the approach in a github issue first before unleashing LLMs on the task, otherwise it takes more time to understand what they did. 😅 |
c3a9b03 to
f48e46e
Compare
Addresses Doist/platform-backlog#734 — failed deploys currently surface only a single, often-wrong error line, leaving you to dig through the AWS Console. This action is the standard deploy step across ~30 Doist repos, so the improvement applies broadly, not just to Todoist.
What changed
ClientRequestTokenand collect all genuinely-failed resources (any*_FAILEDstatus, skipping cancellation cascades). Report the earliest failure as the root cause — the previous heuristic captured the newest failed event, which is usually a downstream symptom.::error::annotations.Notes
cloudformation:DescribeStackEventswas already required.