Skip to content

ducktape: fix bad test case with acks=1 #26878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 22, 2025

Conversation

wdberkeley
Copy link
Contributor

@wdberkeley wdberkeley commented Jul 17, 2025

test_relaxed_acks had one parameterization running with acks=1, but it expected that the consumer would see exactly the set of offsets that were acked to the producers, even with injected node failures. This is not always the case:

  1. If the leader acks a message but doesn't have time to replicate it before failing, and a new leader is elected while the original leader is down, the original leader will truncate the acked message from its log and no consumer will see it.
  2. If the leader acks a message but doesn't have time to replicate or flush it to disk before failing, and the leader is reelected when it restarts, it may lose the message and no consumer will see it.

Situation 2 was recently seen in CI.

This change adjust validation to allow for messages to go missing when acks=1.

Fixes CORE-12793

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

  • none

andrwng
andrwng previously approved these changes Jul 17, 2025
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 17, 2025

CI test results

test results on build#69205
test_class test_method test_arguments test_kind job_url test_status passed reason
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/69205#01981979-8d63-4be0-a41f-e61f0c4391e2 FLAKY 19/21 upstream reliability is '99.65397923875432'. current run reliability is '90.47619047619048'. drift is 9.17779 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "adjacent_merge", "enable_failures": true, "mixed_versions": false, "with_iceberg": true} ducktape https://buildkite.com/redpanda/redpanda/builds/69205#01981979-8d5c-4856-9bea-fc43efcf56a3 FLAKY 20/21 upstream reliability is '97.5609756097561'. current run reliability is '95.23809523809523'. drift is 2.32288 and the allowed drift is set to 50. The test should PASS
RpkDebugBundleTest test_debug_bundle ducktape https://buildkite.com/redpanda/redpanda/builds/69205#01981979-8d5d-4f22-9913-82ee50342b7e FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
test results on build#69384
test_class test_method test_arguments test_kind job_url test_status passed reason
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "chunked_sliding_window", "enable_failures": false, "mixed_versions": false, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/69384#01982ef3-51f8-4ba5-b447-94dbaf88348f FLAKY 14/21 upstream reliability is '100.0'. current run reliability is '66.66666666666666'. drift is 33.33333 and the allowed drift is set to 50. The test should PASS

@@ -88,7 +88,7 @@ def test_consumer_interruption(self):
@parametrize(write_caching=True, disable_batch_cache=False)
@parametrize(write_caching=True, disable_batch_cache=True)
@parametrize(write_caching=False)
def test_relaxed_acks(self, write_caching, disable_batch_cache=False):
def test_acks(self, write_caching, disable_batch_cache=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the test but just change the validation, it is important to test redpanda with relaxed acks, with this change we are loosing coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do that. Is it important to set write_caching=false with relaxed acks, as this test does? Should we test relaxed acks with write caching on and off?

`test_relaxed_acks` had one parameterization running with `acks=1`, but
it expected that the consumer would see exactly the set of offsets that
were acked to the producers, even with injected node failures. This is
not always the case:
1. If the leader acks a message but doesn't have time to replicate it
   before failing, and a new leader is elected while the original leader
   is down, the original leader will truncate the acked message from its
   log and no consumer will see it.
2. If the leader acks a message but doesn't have time to replicate or
   flush it to disk before failing, and the leader is reelected when it
   restarts, it may lose the message and no consumer will see it.
Situation 2 was recently seen in CI.

This change removes the `acks=1` case and renames the test, because it's
no longer testing relaxed acks, just default `acks=-1`, with different
levels of caching enabled.
@wdberkeley
Copy link
Contributor Author

Force-push to address Michal's feedback: keep the test case, just relax the validation when acks=1.

@wdberkeley wdberkeley merged commit b78df99 into redpanda-data:dev Jul 22, 2025
17 checks passed
@wdberkeley wdberkeley deleted the too-relaxed-acks branch July 22, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants