-
Notifications
You must be signed in to change notification settings - Fork 768
Description
I'm testing a failure scenario by intentionally deploying a broken version of my application. According to my metric template and canary configuration, I expect Flagger to:
- Detect the high error rates (visible in Datadog query results)
- Fail the canary analysis checks
- Halt the rollout of the broken version
Current Behavior:
Despite clear evidence in Datadog showing error rates significantly above my configured threshold (1.1),
Flagger:
- Shows incorrect metric values (flagger_canary_metric_analysis = 1)
- Proceeds with deploying the broken version
- Provides no visibility into how it arrived at this decision (no query results in controller logs)
Debugging Observations:
- Datadog Verification: Raw query results show values up to 20 (far exceeding the 1.1 threshold)
- Flagger Metrics: Internal metric shows 1 (which doesn't match Datadog observations)
- Logging Gap: Controller logs show the executed query but not the actual returned values
- Behavior: Canary progresses when it should fail
I don't see any query result in flagger controller to decide it pass or fail, it's just printing the query which run on datadog.
{"level":"debug","ts":"2025-07-29T11:38:08.241Z","caller":"controller/scheduler_metrics.go:309","msg":"Metric template error-rate.buy query: clamp_min(\n sum:istio.mesh.request.count.total{ │ │ │ │ │ ││ env:development AND reporter:destination AND destination_app:giftcard AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /\n sum:istio.mesh.request.count.total{ │ │ │ │ │ ││ env:development,reporter:destination,destination_app:giftcard}.as_count(),\n 0.05\n) / clamp_min(\n sum:istio.mesh.request.count.total{env:development AND reporter:destination AND destinat │ │ │giftc │ │ ││ ion_app:giftcard-primary AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /\n sum:istio.mesh.request.count.total{env:development,reporter:destination,destinat │ │ │ │ │ ││ ion_app:giftcard-primary}.as_count(),\n 0.05\n)","canary":"giftcard.buy"}
- canary configuration for analysis; according to 1 value is expected from metric template analysis, if there is %10 error rate than primary, i expect to see a failure.
analysis:
interval: 3m
threshold: 5
maxWeight: 50
stepWeights: [20, 50]
metrics:
- name: "error-rate"
templateRef:
name: "error-rate"
namespace: buy
interval: 3m
thresholdRange:
max: 1.1 -> %10 higher than 1
- metric template;
canary_error_rate / primary_error_rate if error is under %5 using 0.05 clamp_min fix the value 0.05 for both canary and primary and result is always 1 to promote.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: error-rate
spec:
provider:
type: datadog
address: "https://api.datadoghq.eu"
secretRef:
name: datadog
query: |-
clamp_min(
sum:istio.mesh.request.count.total{env:${ENV} AND reporter:destination AND destination_app:{{ target }} AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /
sum:istio.mesh.request.count.total{env:${ENV},reporter:destination,destination_app:{{ target }}}.as_count(),
0.05
) / clamp_min(
sum:istio.mesh.request.count.total{env:${ENV} AND reporter:destination AND destination_app:{{ target }}-primary AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /
sum:istio.mesh.request.count.total{env:${ENV},reporter:destination,destination_app:{{ target }}-primary}.as_count(),
0.05
)
- query result is up to 20 as value;

- flagger_canary_metric_analysis is 1 as value,not sure why it's is 1 which is different from query result
