Skip to content

docs: Add Managed Kafka Connect terraform sample for Clusters #876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 7, 2025

Conversation

salmany
Copy link
Contributor

@salmany salmany commented Jul 23, 2025

  • Add sample terraform code for creating a Managed Kafka Connect cluster

Description

Fixes https://b.corp.google.com/issues/430087669

Note: If you are not associated with Google, open an issue for discussion before submitting a pull request.

Checklist

Readiness

  • Yes, merge this PR after it is approved
  • No, don't merge this PR after it is approved

Style

Testing

Intended location

API enablement

  • If the sample needs an API enabled to pass testing, I have added the service to the Test setup file

Review

  • If this sample adds a new directory, I have added codeowners to the CODEOWNERS file

for Clusters

* Add sample code for creating a Kafka Connect
cluster
@salmany salmany marked this pull request as ready for review July 23, 2025 19:47
@salmany salmany requested review from a team as code owners July 23, 2025 19:47
Copy link

snippet-bot bot commented Jul 23, 2025

Here is the summary of changes.

You are about to add 3 region tags.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

Corrected tags.
@salmany salmany enabled auto-merge (squash) July 23, 2025 19:54
gcp_config {
access_config {
network_configs {
subnet = "projects/${data.google_project.default.number}/regions/us-central1/subnetworks/default"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this value be linked to a subnet resource? the hardcoding of the region and subnet name is suboptimal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I acknowledge this is suboptimal, I'm hardcoding the subnet resource due to a current limitation with deleting network attachments associated with Managed Kafka Connect clusters. I had previously attempted creating a Subnet resource as part of the code example, but it was causing test failures. (For more information, please see here)

However, hardcoding the subnet resource also seems to be no longer working (although it initially passed the test cases). This could be due to the changing test setup across runs. Do you have any suggestions on how we may proceed to pass the tests given this limitation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've committed a workaround using a provisioner for deleting leaked resources: 612a5ac

Seems to be working with test cases. Marking as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(moved my review comment to this thread)

We do not support local-exec in samples. We do not want to encourage our users to use this functionality.

If there's an upstream issue, it needs to be fixed at the resource level.

If the sample can't be tested (either now, or in general), then we can disable the terraform apply/destroy part of tests.

The google_managed_kafka_cluster takes 30 minutes to create, the google_managed_kafka_connect_cluster takes 20 minutes, unsure if this is expected.

Copy link
Contributor Author

@salmany salmany Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Given that there is a known issue which prevents resource deletion, I believe the best path forward for now would be to disable the terraform apply/destroy part of the tests as you have suggested. Once the issue is resolved, we can update the examples to enable terraform apply/destroy.
Can you please guide how to disable the terraform apply/destroy part of tests?

(Also, it is indeed expected that the google_managed_kafka_cluster and google_managed_kafka_connect_cluster take 30 & 20 minutes to create, respectively. For reference, see here.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glasnt:

To give you some context, there's an issue with GKE networking that leaks Network Attachment resources in a Subnet even after a ConnectCluster is deleted. This means that if we add a Subnet resource to the same Terraform config, upon TF delete action the Subnet resource's deletion will hang, as there's still a dangling Network Attachment left. The leak eventually goes away, but its indefinite - perhaps hours, perhaps days. The leak issue is pretty complex to fix and doesn't have staffing yet - so it may not land for months/quarters.

I think the best way forward is to include the subnet resources hardcoded as strings in the Terraform samples. It allows us to exercise the apply/delete logic for the ConnectCluster, which is what we care about here.

For most customers, their network resources will be managed in a separate file, often by separate teams altogether. And the lifetime of Network/Subnet resources is often longer than those of using resources like ConnectClusters. So our customers may still be ok, even with this unfortunate leak issue preventing them from putting network + ConnectCluster resources in the same file.

How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally: Our samples should reflect best practices in real world applications. We have some samples where the configuration required to make the sample run cleanly is only useful in a CI testing instance, and not code we would suggest a user configure. In these cases, we can disable the terraform apply/destroy (among other steps), but I believe this is the best solution here.

You can disable this by adding a test.yaml file with skip: true. example. (The metadata.name is roughly the concatenation of the folder hierarchy of the sample)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the best-practice sample is not going to run without modification (ie: applying the Network resources separate from the ConnectCluster resource), is it still okay to recommend that? Should we add a comment with the limitation? We don't want to mislead users by showing them a sample that won't work out of the box.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that staggered application is required, you can describe that in the devsite page. You can also use additional [# START and [# END tags to separate out these resources so you can describe them in context. (example)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that sounds good - @salmany please follow the previous comment. Thank you @glasnt !

Copy link
Contributor

@glasnt glasnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(apologies for the single comments instead of review comments, but please address the code review comments)

@glasnt glasnt added the waiting-response Waiting for issue author to respond. label Jul 27, 2025
@salmany salmany requested review from rQ-Qrr and glasnt July 30, 2025 19:54
@glasnt glasnt added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 31, 2025
@glasnt glasnt removed the waiting-response Waiting for issue author to respond. label Aug 6, 2025
@glasnt glasnt removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Aug 6, 2025
@glasnt glasnt merged commit c60aa2b into main Aug 7, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants