Skip to content

Prevent duplicate actions email #35215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

NorthRealm
Copy link
Contributor

Trying to prevent duplicate action emails by adding an extra check on job status.


Producing the issue:

  • Spin up a local instance with notify email enabled and mailer setup. Trace log.
  • Create a repository.
  • Add a workflow with many jobs.
  • Run the workflow (no need to register runner) and then cancel it manually.
  • Observe trace log and mailbox.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Aug 5, 2025
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label Aug 5, 2025
@NorthRealm
Copy link
Contributor Author

So I was trying to answer @inxcts question before I came across such issue. After applying the patch, I no longer see duplicate email (that's with unfinished job status while run is concluded as done). However, redoing the steps to reproduce and I still observe one duplicate attempt to send email in trace log, which means there are twice that can pass the applied check right after manually cancelling run. I notice MailActionsTrigger is triggered multiple times.
cc @ChristopherHX, could you please elaborate on how's the event triggered? Is there anywhere that sends the same event twice?

@ChristopherHX
Copy link
Contributor

ChristopherHX commented Aug 5, 2025

The event source is scattered. The "show call hierarchy" golang ide feature, of the event function should show every possible source of the event.

Maybe we should restructure this to trigger only completed events from the caller that is able to atomically update the database struct to success/failure/etc...


Add a workflow with many jobs

How many did you test? My unit tests have small number of jobs and could be the cause of being undetected.

Sounds like I need to audit the event source with bigger workflows, workflow_run action triggger and webhook would also trigger twice if this is true.


I thought about adding temporary traces that dumps strack traces then look at the stack traces of the doubled event

@NorthRealm
Copy link
Contributor Author

@ChristopherHX Previously I had only tested the feature with like 1, 2 or 3 dummy jobs. 🤦‍♂️

on:
  push:
  workflow_dispatch:

jobs:
  test: 
    runs-on: ubuntu-latest
    steps:
      - run: exit 0

  test2: 
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - run: exit 0

  test3: 
    needs: [test, test2]
    runs-on: ubuntu-latest
    steps:
      - run: exit 0
  
  test4: 
    needs: [test, test2, test3]
    runs-on: ubuntu-latest
    steps:
      - run: exit 0
  
  test5: 
    needs: [test, test2, test4]
    runs-on: ubuntu-latest
    steps:
      - run: exit 0
  
  test6:
    strategy:
      matrix:
        os: [ubuntu-20.04, ubuntu-22.04, ubuntu-24.04] 
    needs: [test, test2, test3]
    runs-on: ${{ matrix.os }}
    steps:
      - run: exit 0
  
  test7: 
    needs: test6
    runs-on: ubuntu-latest
    steps:
      - run: exit 0
  
  test8: 
    runs-on: ubuntu-latest
    steps:
      - run: exit 0
  
  test9: 
    strategy:
      matrix:
        os: [ubuntu-20.04, ubuntu-22.04, ubuntu-24.04, ubuntu-25.04, windows-2022, windows-2025, macos-13, macos-14, macos-15]
    runs-on: ${{ matrix.os }}
    steps:
      - run: exit 0

  test10: 
    runs-on: ubuntu-latest
    steps:
      - run: exit 0

Here's what I'm currently using to triage.

@NorthRealm
Copy link
Contributor Author

@ChristopherHX In routers/web/repo/actions/view.go, in Approve and Cancel, why use both NotifyWorkflowRunStatusUpdateWithReload and WorkflowRunStatusUpdate?

@ChristopherHX
Copy link
Contributor

Thanks for taking a look :)

@ChristopherHX In routers/web/repo/actions/view.go, in Approve and Cancel, why use both NotifyWorkflowRunStatusUpdateWithReload and WorkflowRunStatusUpdate?

Could be very likely a bug that found it's way into my PR during changes by reviewers, I might missed writing a test for cancel/approve flows so this was not detected in CI

The cases where WorkflowJobStatus update follows sounds more correct

@NorthRealm
Copy link
Contributor Author

NorthRealm commented Aug 5, 2025

@ChristopherHX I applied cdb1e80 yet still see the same one duplicate attempt. I could not figure out cause at this moment so feel free to propose changes or directly push to the branch.

@ChristopherHX
Copy link
Contributor

Yes I can take a look at this as well.


Notice:

approve workflow run notice was expected to be triggered before running the jobs, for cancelling the order cancel jobs then notify workflow run

NorthRealm and others added 2 commits August 5, 2025 19:51
@ChristopherHX
Copy link
Contributor

In routers/web/repo/actions/view.go, in Approve and Cancel, why use both NotifyWorkflowRunStatusUpdateWithReload and WorkflowRunStatusUpdate?

I have added a test and this has been a clear event duplication that is reproducible in tests until removing the dupped WorkflowRunStatusUpdate.

The approve flow is harder to test in my option and has been skipped by me for now.

Now I wonder what other duplication scenario did you see? Is this the same workflow, my test has no runners. Should I add runners?

@NorthRealm
Copy link
Contributor Author

Quick manual test. I can't reproduce it now. I am not entirely sure about it, as I do not understand the difference. Yesterday in cdb1e80 I applied the same patch removing the unneeded WorkflowRunStatusUpdate calls but still observed the issue. While after your bc3a467 the issue seems to be eliminated.

@NorthRealm
Copy link
Contributor Author

@ChristopherHX My manual test has no actual runner. I simply just observe event triggering sending emails. You might want it anyway.

@NorthRealm NorthRealm marked this pull request as ready for review August 8, 2025 13:53
Copy link
Contributor

@ChristopherHX ChristopherHX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding two workflow_run delivery problems

Comment on lines +42 to +47
for _, job := range jobs {
if !job.Status.IsDone() {
log.Trace("composeAndSendActionsWorkflowRunStatusEmail: A job is not done. Will not compose and send actions email.")
return
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now no operation, but still useful for diagnostic of other undetected faults, other than adding a workflow_run webhook and looking at the past deliveries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lunny Based on how job status is aggregated, that check is not 100% reliable. Before patch I got this erroneous email:
1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if a waiting status is considered IsDone?

Also, it’s quite strange that there are three different places checking whether the jobs should be sent.
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. You got better solution?

Copy link
Contributor

@ChristopherHX ChristopherHX Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before patch I got this erroneous email:
1

How to reproduce this bug? This should never send a completed workflow run event.

IMO this should be fixed in the workflow_run event itself and the event should be sent if it is completed not if some are completed (except if you spam rerun and cancelation of random jobs to force inconsistency

Other valid events are before starting any job

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switch back to main branch on c4c1a4b and reproduced the bug, by starting a run manually then immediately canceling it. Trace log show there are 2 email attempts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Gitea handle mailer failure? I forgot to turn on mailbox at first on that day and Gitea printed errors in background. Will emails fail to send just go into smoke?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switch back to main branch on c4c1a4b and reproduced the bug, by starting a run manually then immediately canceling it. Trace log show there are 2 email attempts.

yes 2 email attempts, but that one is fixed here. But I am writing about that my automated test here can not detect the situation that not all jobs are completed if the run completion event has been seen.

by starting a run manually then immediately canceling it.

this is actually what my test added here literally do, but if I add this assert, log.Fatal is never run for me. Even if I run this over and over again. In my point of view there must be some detail other than just cancelling directly after triggering the run without runners.

I placed this code directly in notify.go in WorkflowRunStatusUpdate

	if run.Status.IsDone() {
		jobs, err := actions_model.GetRunJobsByRunID(ctx, run.ID)
		if err != nil {
			log.Error("GetRunJobsByRunID: %v", err)
			return
		}
		for _, job := range jobs {
			if !job.Status.IsDone() {
				log.Fatal("WorkflowRunStatusUpdate: A job is not done. Will not notify workflow run status update.")
				return
			}
		}
	}

Do I have to do manual testing to see this? Even if I revert the duplicated event delivery, I only got a duplicated event instead of an event before all jobs are finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CancelAbandonedJobs is broken, and may send workflow_run events.
Rerun Multiple jobs is called multiple times, so creates multiple events (should be filtered by email via run is Done)

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Aug 8, 2025
@@ -208,8 +208,5 @@ func (m *mailNotifier) RepoPendingTransfer(ctx context.Context, doer, newOwner *
}

func (m *mailNotifier) WorkflowRunStatusUpdate(ctx context.Context, repo *repo_model.Repository, sender *user_model.User, run *actions_model.ActionRun) {
if !run.Status.IsDone() {
return
}
MailActionsTrigger(ctx, sender, repo, run)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last one from me, then the function MailActionsTrigger is unnecessary now, all the code could be extract into this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm/need 1 This PR needs approval from one additional maintainer to be merged. modifies/go Pull requests that update Go code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants