-
Notifications
You must be signed in to change notification settings - Fork 137
Description
What would you like to be added:
I would like to propose an optimization mechanism for the kmesh-daemon to handle large-scale xDS updates more efficiently.
Specifically, the following improvements should be considered to mitigate resource contention during high-traffic configuration pushes:
-
Support On-Demand xDS Loading: Implement a mechanism where Kmesh fetches or processes xDS configurations lazily. Instead of loading the full cluster state upfront, the daemon should only request or process configuration for a service when traffic is actually initiated towards it. This would significantly reduce the processing burden during global updates.
-
Batch Processing for Syscalls: Implement batching for eBPF Map updates to reduce the frequency of system calls.
-
Flow Control / Rate Limiting: Introduce a mechanism to throttle or queue xDS updates within the daemon to prevent CPU starvation during burst scenarios.
Why is this needed:
As cloud-native architectures evolve toward larger scales and higher dynamics, Service Mesh performance is critical. Kmesh, leveraging eBPF and programmable kernel technology, successfully eliminates the user-kernel context switching overhead found in traditional Sidecar architectures, significantly reducing forwarding latency.
However, while this architecture offers performance benefits, it introduces stability challenges in large-scale scenarios:
-
The Trigger: During massive cluster changes (such as service rolling updates, node failure recovery, or large-scale restarts), Istiod pushes a massive amount of xDS configurations to the data plane.
-
The Bottleneck: The kmesh-daemon, deployed on every node, is responsible for receiving these configurations and converting them into kernel eBPF Map states. In high-churn scenarios, this triggers a State of the World (SotW) push.
-
The Consequence: The daemon gets overwhelmed by heavy Protobuf deserialization and a storm of high-frequency System Calls (Syscalls) required to update the eBPF maps.
-
The Impact: This results in a sudden, severe spike in CPU load on the node. The resource contention can be severe enough to starve business containers, potentially leading to node-level service denial or instability.
Addressing this is crucial for Kmesh to be production-ready in large-scale, dynamic Kubernetes environments.