Skip to content

Flagger ignores Datadog error thresholds, deploys failing versions #1815

@GurayCetin

Description

@GurayCetin

I'm testing a failure scenario by intentionally deploying a broken version of my application. According to my metric template and canary configuration, I expect Flagger to:

  • Detect the high error rates (visible in Datadog query results)
  • Fail the canary analysis checks
  • Halt the rollout of the broken version

Current Behavior:
Despite clear evidence in Datadog showing error rates significantly above my configured threshold (1.1),

Flagger:

  • Shows incorrect metric values (flagger_canary_metric_analysis = 1)
  • Proceeds with deploying the broken version
  • Provides no visibility into how it arrived at this decision (no query results in controller logs)

Debugging Observations:

  • Datadog Verification: Raw query results show values up to 20 (far exceeding the 1.1 threshold)
  • Flagger Metrics: Internal metric shows 1 (which doesn't match Datadog observations)
  • Logging Gap: Controller logs show the executed query but not the actual returned values
  • Behavior: Canary progresses when it should fail

I don't see any query result in flagger controller to decide it pass or fail, it's just printing the query which run on datadog.

{"level":"debug","ts":"2025-07-29T11:38:08.241Z","caller":"controller/scheduler_metrics.go:309","msg":"Metric template error-rate.buy query: clamp_min(\n sum:istio.mesh.request.count.total{ │ │ │ │ │ ││ env:development AND reporter:destination AND destination_app:giftcard AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /\n sum:istio.mesh.request.count.total{ │ │ │ │ │ ││ env:development,reporter:destination,destination_app:giftcard}.as_count(),\n 0.05\n) / clamp_min(\n sum:istio.mesh.request.count.total{env:development AND reporter:destination AND destinat │ │ │giftc │ │ ││ ion_app:giftcard-primary AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /\n sum:istio.mesh.request.count.total{env:development,reporter:destination,destinat │ │ │ │ │ ││ ion_app:giftcard-primary}.as_count(),\n 0.05\n)","canary":"giftcard.buy"}

  • canary configuration for analysis; according to 1 value is expected from metric template analysis, if there is %10 error rate than primary, i expect to see a failure.
  analysis:
    interval: 3m
    threshold: 5
    maxWeight: 50
    stepWeights: [20, 50]
    metrics:
      - name: "error-rate"
        templateRef:
          name: "error-rate"
          namespace: buy
        interval: 3m
        thresholdRange:
          max: 1.1 -> %10 higher than 1 
  • metric template;

canary_error_rate / primary_error_rate if error is under %5 using 0.05 clamp_min fix the value 0.05 for both canary and primary and result is always 1 to promote.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
spec:
  provider:
    type: datadog
    address: "https://api.datadoghq.eu"
    secretRef:
      name: datadog
  query: |-
    clamp_min(
      sum:istio.mesh.request.count.total{env:${ENV} AND reporter:destination AND destination_app:{{ target }} AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /
      sum:istio.mesh.request.count.total{env:${ENV},reporter:destination,destination_app:{{ target }}}.as_count(),
      0.05
    ) / clamp_min(
      sum:istio.mesh.request.count.total{env:${ENV} AND reporter:destination AND destination_app:{{ target }}-primary AND (response_code:5* OR grpc_response_status IN (2,4,12,13,14,15))}.as_count() /
      sum:istio.mesh.request.count.total{env:${ENV},reporter:destination,destination_app:{{ target }}-primary}.as_count(),
      0.05
    )
  • query result is up to 20 as value;
Image
  • flagger_canary_metric_analysis is 1 as value,not sure why it's is 1 which is different from query result
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions