[Feature/Optimization] Optimize kmesh-daemon CPU usage during massive xDS configuration updates

## What would you like to be added:

I would like to propose an optimization mechanism for the kmesh-daemon to handle large-scale xDS updates more efficiently.

Specifically, the following improvements should be considered to mitigate resource contention during high-traffic configuration pushes:

- Support On-Demand xDS Loading: Implement a mechanism where Kmesh fetches or processes xDS configurations lazily. Instead of loading the full cluster state upfront, the daemon should only request or process configuration for a service when traffic is actually initiated towards it. This would significantly reduce the processing burden during global updates.

- Batch Processing for Syscalls: Implement batching for eBPF Map updates to reduce the frequency of system calls.

- Flow Control / Rate Limiting: Introduce a mechanism to throttle or queue xDS updates within the daemon to prevent CPU starvation during burst scenarios.

## Why is this needed:

As cloud-native architectures evolve toward larger scales and higher dynamics, Service Mesh performance is critical. Kmesh, leveraging eBPF and programmable kernel technology, successfully eliminates the user-kernel context switching overhead found in traditional Sidecar architectures, significantly reducing forwarding latency.

However, while this architecture offers performance benefits, it introduces stability challenges in large-scale scenarios:

- The Trigger: During massive cluster changes (such as service rolling updates, node failure recovery, or large-scale restarts), Istiod pushes a massive amount of xDS configurations to the data plane.

- The Bottleneck: The kmesh-daemon, deployed on every node, is responsible for receiving these configurations and converting them into kernel eBPF Map states. In high-churn scenarios, this triggers a State of the World (SotW) push.

- The Consequence: The daemon gets overwhelmed by heavy Protobuf deserialization and a storm of high-frequency System Calls (Syscalls) required to update the eBPF maps.

- The Impact: This results in a sudden, severe spike in CPU load on the node. The resource contention can be severe enough to starve business containers, potentially leading to node-level service denial or instability.

Addressing this is crucial for Kmesh to be production-ready in large-scale, dynamic Kubernetes environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature/Optimization] Optimize kmesh-daemon CPU usage during massive xDS configuration updates #1549

What would you like to be added:

Why is this needed:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature/Optimization] Optimize kmesh-daemon CPU usage during massive xDS configuration updates #1549

Description

What would you like to be added:

Why is this needed:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions