Skip to content

[SPARK-52853][SDP] Prevent imperative PySpark methods in declarative pipelines #51590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

JiaqiWang18
Copy link
Contributor

@JiaqiWang18 JiaqiWang18 commented Jul 21, 2025

What changes were proposed in this pull request?

This PR adds a context manager block_imperative_construct() that prevents the execution of imperative Spark operations within declarative pipeline definitions. When these blocked methods are called, users receive clear error messages with guidance on declarative alternatives.

Blocked Methods

Configuration Management
  • spark.conf.set() → Use pipeline spec or spark_conf decorator parameter
Catalog Management
  • spark.catalog.setCurrentCatalog() → Set via pipeline spec or dataset decorator name argument
  • spark.catalog.setCurrentDatabase() → Set via pipeline spec or dataset decorator name argument
Temporary View Management
  • spark.catalog.dropTempView() → Remove temporary view definition directly
  • spark.catalog.dropGlobalTempView() → Remove temporary view definition directly
  • DataFrame.createTempView() → Use @temporary_view decorator
  • DataFrame.createOrReplaceTempView() → Use @temporary_view decorator
  • DataFrame.createGlobalTempView() → Use @temporary_view decorator
  • DataFrame.createOrReplaceGlobalTempView() → Use @temporary_view decorator
UDF Registration
  • spark.udf.register() → Define and register UDFs before pipeline execution
  • spark.udf.registerJavaFunction() → Define and register Java UDFs before pipeline execution
  • spark.udf.registerJavaUDAF() → Define and register Java UDAFs before pipeline execution

Why are the changes needed?

These are imperative construct that can cause friction and unexpected behavior from within a pipeline declaration. E.g. it makes pipeline behavior sensitive to the order that Python files are imported in, which can be unpredictable. There are already existing mechanisms for setting Spark confs for pipelines:

Does this PR introduce any user-facing change?

Yes, it prevents the behavior of setting spark confs imperatively in the pipeline definition file.

How was this patch tested?

Created new test suite to test that the context manager behave as expected and ran spark-pipelines cli manually.

Was this patch authored or co-authored using generative AI tooling?

No

@JiaqiWang18 JiaqiWang18 marked this pull request as ready for review July 21, 2025 18:22
@JiaqiWang18
Copy link
Contributor Author

@sryza @anishm-db

@JiaqiWang18 JiaqiWang18 marked this pull request as draft July 21, 2025 20:04
@JiaqiWang18
Copy link
Contributor Author

addressing discussion

@JiaqiWang18 JiaqiWang18 changed the title [SPARK-52853][SDP] Prevent imperative conf set in declarative pipelines [SPARK-52853][SDP] Prevent imperative PySpark methods in declarative pipelines Jul 21, 2025
@JiaqiWang18
Copy link
Contributor Author

@sryza @anishm-db Ready for review. Re-scoped PR to only block imperative python methods on the client-side via a context manager

@JiaqiWang18 JiaqiWang18 marked this pull request as ready for review July 21, 2025 21:51
@github-actions github-actions bot added the BUILD label Jul 21, 2025


@contextmanager
def block_imperative_construct() -> Generator[None, None, None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def block_imperative_construct() -> Generator[None, None, None]:
def block_imperative_constructs() -> Generator[None, None, None]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to block_session_mutations to be more clear

{
"class": RuntimeConf,
"method": "set",
"suggestion": "Instead set configuration via the pipeline spec "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have this text inside of the error-conditions.json – that way it's in a central place that can be internationalized more easily. Thoughts on having a sub-error code for each of these? E.g. SET_CURRENT_CATALOG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, added sub classes for each method in error-conditons.json

@JiaqiWang18 JiaqiWang18 requested a review from sryza July 23, 2025 22:11
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny more nitpick – then LGTM!

"Session mutation <method> is not allowed in declarative pipelines."
],
"sub_class": {
"RUNTIME_CONF_SET": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: should the SET be on the other side of RUNTIME_CONF for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@JiaqiWang18 JiaqiWang18 requested a review from sryza July 23, 2025 23:35
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Will merge once build is green.

@sryza sryza closed this in dc687d4 Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants