AAD credential passthrough

Before Unity Catalog, Databricks used to support something called AAD credential passthrough. You could define permissions once at the source on the storage account and these would automatically be used by any notebooks in the background. Each user would only see data that they had access to when they ran the notebook. Since each user's own credentials were used to access the data, you could also easily audit which data was being accessed by each user.

With the introduction of Unity Catalog, permissions and access auditing have been pushed into the Databricks application, where user access is managed across workspaces. This could be a great option if you're mainly using Databricks. If the same data is being accessed from multiple Azure services, however, you'll need to manage permissions in both Azure and Unity Catalog and ensure that they stay in sync.

Let's have a look at how AAD credential passthrough works and then see if we can replicate the same behavior in a Unity Catalog-enabled workspace somehow.

AAD credential passthrough

Create data lake and assign permissions

First, create a data lake to hold the data. Open the Access Control (IAM) blade on the storage account and select 'Add role assignment' from the '+Add' drop down menu.

Choose the Storage Blob Data Contributor role from the Roles tab to give read/write access to users.

On the Members tab, select the group, user or managed identity to assign the role to.

Create Databricks workspace and cluster

Next, create a Databricks workspace with a hive metastore. Once it has been created, open the workspace and create a compute cluster.

Select "Shared" for the access mode.

Under Advanced options at the bottom of the configuration page, check the box for "Enable credential passthrough for user-level data access". You'll see a notice that AAD credential passthrough is deprecated as of runtime version 15.

For this demo, we're going to use a pre-version 15 runtime.

Note: both the shared access mode and enable credential passthrough settings are required. If you forget either of these, notebook execution will fail with a permissions error.

Run notebook in Databricks

Download the demo notebook from this repo.

Navigate to your user workspace in the workspace menu and right click on it to import the notebook from file.

Open the notebook, attach to your cluster and enter the name of your data lake and container in the code cell at the top.

Finally, click Run All at the top. Your user credentials will be used to write the NYC yellow taxi data to your data lake.

Parameterize notebook

We can add parameters with defaults to the notebook to make it a bit more user-friendly.

Let's add parameters to the notebook for data lake name, container and taxi data year. We can do this with Databricks widgets.

From the Edit menu in the notebook, select Add Parameter.

Add a parameter for each of the variables in the code cell at the top of the notebook.

Get each parameter value and assign it to the respective variable in the cell.

Alternatively, you can add the parameters in a code cell instead.

dbutils.widgets.text("datalake_name", "adlsxxxxxx")
dbutils.widgets.text("container_name", "data")
dbutils.widgets.text("folder_name", "taxidata")
dbutils.widgets.text("year", "-1", "-1 for all years or a specific year between 2009 - 2018")

datalake_name = dbutils.widgets.get("datalake_name")
container_name = dbutils.widgets.get("container_name")
folder_name = dbutils.widgets.get("folder_name")
year = int(dbutils.widgets.get("year"))

Click RunAll to ensure the notebook still works after the changes. You can find the complete parameterized notebook here.

Can we pass through Data Factory Managed Identity Credentials too?

Great, we've got AAD credential passthrough working and we've even parameterized our notebook. What's next? Well, you're probably thinking why not create a Data Factory and add the managed identity permissions to the data lake? That way, we can manage all permissions directly on the data source. Then, regardless of whether it's a user or a managed identity running the notebook in a pipeline, those permissions should automatically get used to access the data, right? Sadly, this doesn't work. Sure, you can execute the notebook with the Data Factory's managed identity but those AAD credentials aren't passed through to the storage account. Not with a job cluster (which you'd usually use to run notebook jobs in a pipeline). And not with the same shared cluster we used to run the notebook in our workspace either. Credential passthrough from Data Factory simply isn't supported, as you can see below.

First, create a Data Factory instance

Then, on the storage account, assign storage blob data contributor to the Data Factory managed identity

On the Databricks workspace resource, assign the contributor role to the Data Factory managed identity. This will allow the Data Factory to access the clusters.

Now, navigate to the Data Factory studio and create a Databricks linked service with managed identity authentication

In the dialog box that pops up, enable interactive authoring to help us configure and test the connection. Click on the pen icon next to the integration runtime dropdown.

Switch to the virtual network tab, enable interactive authoring and then hit apply. It will take a few minutes for a debug cluster to spin up in the background.

Once interactive authoring has been activated, choose your Databricks workspace. Select "job" as the cluster type and choose managed identity authentication.

Choose the cluster runtime version, cluster node VM SKU and python version to use. When you're finished configuring everything, test the connection.

Next, inside your Data Factory studio, create a pipeline with Databricks activity

Find the notebook tile under Databricks on the left and slide it on to the canvas

Click on the activity inside the canvas if it's not already selected. Then on the Azure Databricks tab in the window below, select the linked service you created earlier.

Then on the Settings tab, browse for your notebook in the notebook path field.

Add parameter values to the activity

Publish All to save your changes

Click Debug in the taskbar above the canvas to run the pipeline

The pipeline runs and executes the notebook but the managed identity's credentials aren't passed through to the data lake. Neither with the job cluster:

Nor with the shared cluster we used in the Databricks workspace earlier to successfully pass through the user's credentials to the storage account:

Why didn't it work?

Credential passthrough only works on a shared cluster where passthrough authentication has been enabled and it unfortunately isn't supported from Data Factory. You might be thinking "well, maybe I can just authenticate using the azure identity library inside the python notebook and get it to use the Data Factory managed identity to connect to the storage account that way." The problem is, when you authenticate inside the notebook, it tries to use the Databricks managed identity, which doesn't exist pre-Unity Catalog. It just isn't possible to pass the Data Factory managed identity credentials to the Databricks cluster.

What options do we have to replicate AAD credential passthrough in a Unity Catalog-enabled Databricks workspace?

As we saw above, AAD credential passthrough worked for local notebook users connecting to storage accounts. What are our best options for mimicking this behavior in a Unity Catalog-enabled workspace?

Ideally, we want:

Entra Id user credentials to be transparently passed to the data source in the background without the user having to manage this explicitly inside the notebook themselves
RBAC permissions on the data source to be used so that the same set of permissions are applied consistently to all Azure services accessing the data
support for any data source, not just storage accounts

Unity Catalog authentication

If you're mainly using Databricks, Unity Catalog could be a good option.

Unity Catalog uses storage credentials and external locations to manage access to storage accounts. Once set up, notebook users can focus on business logic and not have to worry about authentication themselves. The downside is that these permissions only exist inside Unity Catalog. From the storage account's perspective, it's the Databricks Access Connector's system managed identity that's accessing it, not the user. If other Azure services are accessing the same data, you'll need to manage permissions in both Azure and Unity Catalog and ensure that they stay in sync.

Reading from databases is also possible using Unity Catalog Lakehouse Federation. You can set up connections to Azure databases like Azure SQL DB and Postgres Flexible Server and mirror the meta data in a foreign catalog. Data is fetched then at runtime from the remote database. Writing isn't currently supported.

Connections to Azure SQL DB support OAuth user-to-machine authentication. The credentials of the user who set up the connection will be used to query the database (not the notebook user's credentials). Postgres Flexible Server connections can currently only use basic authentication with local database users. Row-level security and column masking based on Entra group membership etc. can be applied in Unity Catalog.

See this demo for a detailed walk-through of Unity Catalog authentication.

Device code authentication

Device code authentication allows you to pass the current user's credentials to an external resource but it has the drawback of requiring user interaction. It is also considered to be high risk and can be subject to phishing attacks.

The user needs to manage authentication inside the notebook. The authentication library generates a code that the user has to enter (interactively) in a browser tab. This works in both legacy hive metastore workspaces and Unity Catalog workspaces and isn't limited to just storage accounts either. The code snippet below shows how to use device code authentication to authenticate to an Azure REST API. See the full solution here.

from azure.mgmt.resource import SubscriptionClient
from azure.identity import *
credential = DeviceCodeCredential()
subscription_client = SubscriptionClient(credential)
subs = list(subscription_client.subscriptions.list())

if subs:
    subscriptionId = subs[0].subscription_id
else:
    print("No subscriptions found. Does the security principal have the Reader role for at least 1 subscription?")

import requests
url = f"https://management.azure.com/subscriptions/{subscriptionId}/locations?api-version=2022-12-01"

# Set the authorization header
headers = {
    "Authorization": f"Bearer {credential.get_token('https://management.azure.com/.default').token}",
    "Content-Type": "application/json"
}

response = requests.get(url, headers = headers)
data = response.json()

Conclusion

None of the options we looked at currently provide a perfect solution for automatically passing Databricks notebook user credentials to remote data sources.

Unity Catalog can be a good option if Databricks is the bulk of your solution. It manages permissions, however, in the application layer and makes it difficult to apply a consistent set of centrally managed permissions across multiple Azure services.

Device code authentication allows users to pass their credentials to any remote Azure data source but authentication requires user interaction, carries phishing risk and must be managed inside the notebook.

Users need to weigh pros and cons of the different options and choose the best fit for their use case. In practice, this may mean using a mix of Unity Catalog, workspace managed identity, service principal and device code authentication depending on the use case.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
databricks		databricks
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AAD credential passthrough

Create data lake and assign permissions

Create Databricks workspace and cluster

Run notebook in Databricks

Parameterize notebook

Can we pass through Data Factory Managed Identity Credentials too?

Why didn't it work?

What options do we have to replicate AAD credential passthrough in a Unity Catalog-enabled Databricks workspace?

Unity Catalog authentication

Device code authentication

Conclusion

About

Uh oh!

Releases

Packages

Languages

License

benijake/databricks-passthru-authentication

Folders and files

Latest commit

History

Repository files navigation

AAD credential passthrough

Create data lake and assign permissions

Create Databricks workspace and cluster

Run notebook in Databricks

Parameterize notebook

Can we pass through Data Factory Managed Identity Credentials too?

Why didn't it work?

What options do we have to replicate AAD credential passthrough in a Unity Catalog-enabled Databricks workspace?

Unity Catalog authentication

Device code authentication

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages