[SPARK-52853][SDP] Prevent imperative PySpark methods in declarative pipelines #51590

JiaqiWang18 · 2025-07-21T05:55:04Z

What changes were proposed in this pull request?

This PR adds a context manager block_imperative_construct() that prevents the execution of imperative Spark operations within declarative pipeline definitions. When these blocked methods are called, users receive clear error messages with guidance on declarative alternatives.

Blocked Methods

Configuration Management

spark.conf.set() → Use pipeline spec or spark_conf decorator parameter

Catalog Management

spark.catalog.setCurrentCatalog() → Set via pipeline spec or dataset decorator name argument
spark.catalog.setCurrentDatabase() → Set via pipeline spec or dataset decorator name argument

Temporary View Management

spark.catalog.dropTempView() → Remove temporary view definition directly
spark.catalog.dropGlobalTempView() → Remove temporary view definition directly
DataFrame.createTempView() → Use @temporary_view decorator
DataFrame.createOrReplaceTempView() → Use @temporary_view decorator
DataFrame.createGlobalTempView() → Use @temporary_view decorator
DataFrame.createOrReplaceGlobalTempView() → Use @temporary_view decorator

UDF Registration

spark.udf.register() → Define and register UDFs before pipeline execution
spark.udf.registerJavaFunction() → Define and register Java UDFs before pipeline execution
spark.udf.registerJavaUDAF() → Define and register Java UDAFs before pipeline execution

Why are the changes needed?

These are imperative construct that can cause friction and unexpected behavior from within a pipeline declaration. E.g. it makes pipeline behavior sensitive to the order that Python files are imported in, which can be unpredictable. There are already existing mechanisms for setting Spark confs for pipelines:

Does this PR introduce any user-facing change?

Yes, it prevents the behavior of setting spark confs imperatively in the pipeline definition file.

How was this patch tested?

Created new test suite to test that the context manager behave as expected and ran spark-pipelines cli manually.

Was this patch authored or co-authored using generative AI tooling?

No

JiaqiWang18 · 2025-07-21T18:22:41Z

@sryza @anishm-db

JiaqiWang18 · 2025-07-21T20:04:40Z

addressing discussion

JiaqiWang18 · 2025-07-21T21:51:32Z

@sryza @anishm-db Ready for review. Re-scoped PR to only block imperative python methods on the client-side via a context manager

python/pyspark/pipelines/tests/test_block_imperative_construct.py

sryza · 2025-07-23T15:50:16Z

python/pyspark/pipelines/block_imperative_construct.py

+
+
+@contextmanager
+def block_imperative_construct() -> Generator[None, None, None]:


Suggested change

def block_imperative_construct() -> Generator[None, None, None]:

def block_imperative_constructs() -> Generator[None, None, None]:

renamed to block_session_mutations to be more clear

sryza · 2025-07-23T15:52:42Z

python/pyspark/pipelines/block_imperative_construct.py

+    {
+        "class": RuntimeConf,
+        "method": "set",
+        "suggestion": "Instead set configuration via the pipeline spec "


It would be better to have this text inside of the error-conditions.json – that way it's in a central place that can be internationalized more easily. Thoughts on having a sub-error code for each of these? E.g. SET_CURRENT_CATALOG?

Make sense, added sub classes for each method in error-conditons.json

python/pyspark/errors/error-conditions.json

python/pyspark/pipelines/block_imperative_construct.py

sryza

One tiny more nitpick – then LGTM!

sryza · 2025-07-23T22:39:07Z

python/pyspark/errors/error-conditions.json

+      "Session mutation <method> is not allowed in declarative pipelines."
+    ],
+    "sub_class": {
+      "RUNTIME_CONF_SET": {


Nitpick: should the SET be on the other side of RUNTIME_CONF for consistency?

good catch!

sryza

LGTM! Will merge once build is green.

SPARK-52853: Prevent imperative conf set in declarative pipelines

66878c1

github-actions bot added the PYTHON label Jul 21, 2025

style

52c7b3b

JiaqiWang18 marked this pull request as ready for review July 21, 2025 18:22

JiaqiWang18 marked this pull request as draft July 21, 2025 20:04

jackywang-db added 7 commits July 21, 2025 13:06

don't block sql set here

cfbf024

block catalog

da51940

block create tempview

ec7f88c

refactor

a7a957d

block udfs

422dba8

refactor tests

4ae627e

fmt

39a1eea

JiaqiWang18 changed the title ~~[SPARK-52853][SDP] Prevent imperative conf set in declarative pipelines~~ [SPARK-52853][SDP] Prevent imperative PySpark methods in declarative pipelines Jul 21, 2025

JiaqiWang18 marked this pull request as ready for review July 21, 2025 21:51

HyukjinKwon reviewed Jul 21, 2025

View reviewed changes

python/pyspark/pipelines/tests/test_block_imperative_construct.py Outdated Show resolved Hide resolved

register new python test file

e90dbce

github-actions bot added the BUILD label Jul 21, 2025

fix import path

dba000b

sryza requested changes Jul 23, 2025

View reviewed changes

jackywang-db added 5 commits July 23, 2025 12:51

address feedback

f5292aa

use error subclasses for blocked methods

43f607f

fmt

d0e7b5d

fmt

8180841

rename to block_session_mutations

18eecde

JiaqiWang18 requested a review from sryza July 23, 2025 22:11

sryza reviewed Jul 23, 2025

View reviewed changes

nit

452402c

JiaqiWang18 requested a review from sryza July 23, 2025 23:35

sryza approved these changes Jul 24, 2025

View reviewed changes

Trigger build again, failure seems unrelated

9b7f5b2

sryza closed this in dc687d4 Jul 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52853][SDP] Prevent imperative PySpark methods in declarative pipelines #51590

[SPARK-52853][SDP] Prevent imperative PySpark methods in declarative pipelines #51590

JiaqiWang18 commented Jul 21, 2025 •

edited

Loading

Uh oh!

JiaqiWang18 commented Jul 21, 2025

Uh oh!

JiaqiWang18 commented Jul 21, 2025

Uh oh!

JiaqiWang18 commented Jul 21, 2025

Uh oh!

Uh oh!

sryza Jul 23, 2025

Uh oh!

JiaqiWang18 Jul 23, 2025

Uh oh!

sryza Jul 23, 2025

Uh oh!

JiaqiWang18 Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sryza left a comment

Uh oh!

sryza Jul 23, 2025

Uh oh!

JiaqiWang18 Jul 23, 2025

Uh oh!

sryza left a comment

Uh oh!

Uh oh!



		@contextmanager
		def block_imperative_construct() -> Generator[None, None, None]:

	def block_imperative_construct() -> Generator[None, None, None]:
	def block_imperative_constructs() -> Generator[None, None, None]:

[SPARK-52853][SDP] Prevent imperative PySpark methods in declarative pipelines #51590

[SPARK-52853][SDP] Prevent imperative PySpark methods in declarative pipelines #51590

Conversation

JiaqiWang18 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Blocked Methods

Configuration Management

Catalog Management

Temporary View Management

UDF Registration

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

JiaqiWang18 commented Jul 21, 2025

Uh oh!

JiaqiWang18 commented Jul 21, 2025

Uh oh!

JiaqiWang18 commented Jul 21, 2025

Uh oh!

Uh oh!

sryza Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

sryza Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

sryza Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JiaqiWang18 commented Jul 21, 2025 •

edited

Loading