Skip to content

[ENG-8409] fix: better handle errors in celery result backend #881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

aaxelb
Copy link
Contributor

@aaxelb aaxelb commented Jul 28, 2025

  • use celery's own "retry on result-backend error" logic
    • remove die_on_unhandled hack
    • new env var: CELERY_RESULT_BACKEND_MAX_RETRIES (default 173, why not)
    • retry on error caused by get_or_create in overlapping transactions (with less-than-strictest transaction isolation levels)
  • port fixes from celery's BaseKeyValueStoreBackend:
    • avoid clobbering successes with "worker lost" or other errors
    • avoid error trying to get non-existent task results
  • use celery's BaseBackend instead of BaseDictBackend (equivalent for back-compat; let's use the better name)

- use celery's own "retry on result-backend error" logic
    - remove `die_on_unhandled` hack
    - new env var: `CELERY_RESULT_BACKEND_MAX_RETRIES` (default 173)
    - retry on error caused by `get_or_create` in overlapping
      transactions (with less-strict transaction isolation levels)
- port fixes from celery's `BaseKeyValueStoreBackend`:
    - avoid clobbering successes with "worker lost" or other errors
    - avoid error trying to get non-existent task results
- use celery's `BaseBackend` instead of `BaseDictBackend`
  (equivalent for back-compat; let's use the better name)
@aaxelb aaxelb marked this pull request as ready for review July 28, 2025 14:23
@coveralls
Copy link

Coverage Status

coverage: 81.802% (+0.03%) from 81.773%
when pulling 5c89bb7 on aaxelb:fix/8409-worker-lost-error
into 150997a on CenterForOpenScience:develop.

@aaxelb aaxelb requested a review from felliott July 29, 2025 14:27
Copy link
Member

@felliott felliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sensible to me. Only question: is there still a possibility that it could die with an unhandled error? If so, is there somewhere you could hang the sentry_sdk.capture_exception() call on there? Not a deal breaker, feel free to ignore if it doesn't make sense or isn't easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants