Skip to content

We explore Databricks AAD passthrough authentication and look at the best options for replicating this behavior in a Unity Catalog-enabled workspace.

License

Notifications You must be signed in to change notification settings

benijake/databricks-passthru-authentication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Before Unity Catalog, Databricks used to support something called AAD credential passthrough. You could define permissions once at the source on the storage account and these would automatically be used by any notebooks in the background. Each user would only see data that they had access to when they ran the notebook. Since each user's own credentials were used to access the data, you could also easily audit which data was being accessed by each user.

With the introduction of Unity Catalog, permissions and access auditing have been pushed into the Databricks application, where user access is managed across workspaces. This could be a great option if you're mainly using Databricks. If the same data is being accessed from multiple Azure services, however, you'll need to manage permissions in both Azure and Unity Catalog and ensure that they stay in sync.

Let's have a look at how AAD credential passthrough works and then see if we can replicate the same behavior in a Unity Catalog-enabled workspace somehow.

AAD credential passthrough

Create data lake and assign permissions

First, create a data lake to hold the data. Open the Access Control (IAM) blade on the storage account and select 'Add role assignment' from the '+Add' drop down menu. storage access control

Choose the Storage Blob Data Contributor role from the Roles tab to give read/write access to users.
add role assignment

On the Members tab, select the group, user or managed identity to assign the role to. select members

Create Databricks workspace and cluster

Next, create a Databricks workspace with a hive metastore. Once it has been created, open the workspace and create a compute cluster. create compute

Select "Shared" for the access mode. shared access mode

Under Advanced options at the bottom of the configuration page, check the box for "Enable credential passthrough for user-level data access". You'll see a notice that AAD credential passthrough is deprecated as of runtime version 15. enable passthrough

For this demo, we're going to use a pre-version 15 runtime. runtime

Note: both the shared access mode and enable credential passthrough settings are required. If you forget either of these, notebook execution will fail with a permissions error.

Run notebook in Databricks

Download the demo notebook from this repo.

Navigate to your user workspace in the workspace menu and right click on it to import the notebook from file.
import Notebook

Open the notebook, attach to your cluster and enter the name of your data lake and container in the code cell at the top. attach Notebook

Finally, click Run All at the top. Your user credentials will be used to write the NYC yellow taxi data to your data lake. Run All

Parameterize notebook

We can add parameters with defaults to the notebook to make it a bit more user-friendly.

Let's add parameters to the notebook for data lake name, container and taxi data year. We can do this with Databricks widgets.

From the Edit menu in the notebook, select Add Parameter. add parameters

Add a parameter for each of the variables in the code cell at the top of the notebook. parameters

Get each parameter value and assign it to the respective variable in the cell. assign to variables

Alternatively, you can add the parameters in a code cell instead.

dbutils.widgets.text("datalake_name", "adlsxxxxxx")
dbutils.widgets.text("container_name", "data")
dbutils.widgets.text("folder_name", "taxidata")
dbutils.widgets.text("year", "-1", "-1 for all years or a specific year between 2009 - 2018")

datalake_name = dbutils.widgets.get("datalake_name")
container_name = dbutils.widgets.get("container_name")
folder_name = dbutils.widgets.get("folder_name")
year = int(dbutils.widgets.get("year"))

Click RunAll to ensure the notebook still works after the changes. You can find the complete parameterized notebook here.

Can we pass through Data Factory Managed Identity Credentials too?

Great, we've got AAD credential passthrough working and we've even parameterized our notebook. What's next? Well, you're probably thinking why not create a Data Factory and add the managed identity permissions to the data lake? That way, we can manage all permissions directly on the data source. Then, regardless of whether it's a user or a managed identity running the notebook in a pipeline, those permissions should automatically get used to access the data, right? Sadly, this doesn't work. Sure, you can execute the notebook with the Data Factory's managed identity but those AAD credentials aren't passed through to the storage account. Not with a job cluster (which you'd usually use to run notebook jobs in a pipeline). And not with the same shared cluster we used to run the notebook in our workspace either. Credential passthrough from Data Factory simply isn't supported, as you can see below.

First, create a Data Factory instance create data factory

Then, on the storage account, assign storage blob data contributor to the Data Factory managed identity permission MSI managed identity

On the Databricks workspace resource, assign the contributor role to the Data Factory managed identity. This will allow the Data Factory to access the clusters.

assign DBX contributor contributor role assign to ADF

Now, navigate to the Data Factory studio and create a Databricks linked service with managed identity authentication linked service databricks LS

In the dialog box that pops up, enable interactive authoring to help us configure and test the connection. Click on the pen icon next to the integration runtime dropdown. enable interactive

Switch to the virtual network tab, enable interactive authoring and then hit apply. It will take a few minutes for a debug cluster to spin up in the background.

enable and apply

Once interactive authoring has been activated, choose your Databricks workspace. Select "job" as the cluster type and choose managed identity authentication. choose workspace

Choose the cluster runtime version, cluster node VM SKU and python version to use. When you're finished configuring everything, test the connection. configure cluster

Next, inside your Data Factory studio, create a pipeline with Databricks activity create pipeline

Find the notebook tile under Databricks on the left and slide it on to the canvas notebook activity

Click on the activity inside the canvas if it's not already selected. Then on the Azure Databricks tab in the window below, select the linked service you created earlier. select linked service

Then on the Settings tab, browse for your notebook in the notebook path field. browse for notebook

Add parameter values to the activity add parameters to activity

Publish All to save your changes

publish all

Click Debug in the taskbar above the canvas to run the pipeline debug to test

The pipeline runs and executes the notebook but the managed identity's credentials aren't passed through to the data lake. Neither with the job cluster: job cluster fails

Nor with the shared cluster we used in the Databricks workspace earlier to successfully pass through the user's credentials to the storage account: shared cluster fails

Why didn't it work?

Credential passthrough only works on a shared cluster where passthrough authentication has been enabled and it unfortunately isn't supported from Data Factory. You might be thinking "well, maybe I can just authenticate using the azure identity library inside the python notebook and get it to use the Data Factory managed identity to connect to the storage account that way." The problem is, when you authenticate inside the notebook, it tries to use the Databricks managed identity, which doesn't exist pre-Unity Catalog. It just isn't possible to pass the Data Factory managed identity credentials to the Databricks cluster. no databricks workspace identity

What options do we have to replicate AAD credential passthrough in a Unity Catalog-enabled Databricks workspace?

As we saw above, AAD credential passthrough worked for local notebook users connecting to storage accounts. What are our best options for mimicking this behavior in a Unity Catalog-enabled workspace?

Ideally, we want:

  • Entra Id user credentials to be transparently passed to the data source in the background without the user having to manage this explicitly inside the notebook themselves
  • RBAC permissions on the data source to be used so that the same set of permissions are applied consistently to all Azure services accessing the data
  • support for any data source, not just storage accounts

Unity Catalog authentication

If you're mainly using Databricks, Unity Catalog could be a good option.

Unity Catalog uses storage credentials and external locations to manage access to storage accounts. Once set up, notebook users can focus on business logic and not have to worry about authentication themselves. The downside is that these permissions only exist inside Unity Catalog. From the storage account's perspective, it's the Databricks Access Connector's system managed identity that's accessing it, not the user. If other Azure services are accessing the same data, you'll need to manage permissions in both Azure and Unity Catalog and ensure that they stay in sync.

Reading from databases is also possible using Unity Catalog Lakehouse Federation. You can set up connections to Azure databases like Azure SQL DB and Postgres Flexible Server and mirror the meta data in a foreign catalog. Data is fetched then at runtime from the remote database. Writing isn't currently supported.

Connections to Azure SQL DB support OAuth user-to-machine authentication. The credentials of the user who set up the connection will be used to query the database (not the notebook user's credentials). Postgres Flexible Server connections can currently only use basic authentication with local database users. Row-level security and column masking based on Entra group membership etc. can be applied in Unity Catalog.

See this demo for a detailed walk-through of Unity Catalog authentication.

Device code authentication

Device code authentication allows you to pass the current user's credentials to an external resource but it has the drawback of requiring user interaction. It is also considered to be high risk and can be subject to phishing attacks.

The user needs to manage authentication inside the notebook. The authentication library generates a code that the user has to enter (interactively) in a browser tab. This works in both legacy hive metastore workspaces and Unity Catalog workspaces and isn't limited to just storage accounts either. The code snippet below shows how to use device code authentication to authenticate to an Azure REST API. See the full solution here.

from azure.mgmt.resource import SubscriptionClient
from azure.identity import *
credential = DeviceCodeCredential()
subscription_client = SubscriptionClient(credential)
subs = list(subscription_client.subscriptions.list())

if subs:
    subscriptionId = subs[0].subscription_id
else:
    print("No subscriptions found. Does the security principal have the Reader role for at least 1 subscription?")
import requests
url = f"https://management.azure.com/subscriptions/{subscriptionId}/locations?api-version=2022-12-01"

# Set the authorization header
headers = {
    "Authorization": f"Bearer {credential.get_token('https://management.azure.com/.default').token}",
    "Content-Type": "application/json"
}

response = requests.get(url, headers = headers)
data = response.json()

Conclusion

None of the options we looked at currently provide a perfect solution for automatically passing Databricks notebook user credentials to remote data sources.

Unity Catalog can be a good option if Databricks is the bulk of your solution. It manages permissions, however, in the application layer and makes it difficult to apply a consistent set of centrally managed permissions across multiple Azure services.

Device code authentication allows users to pass their credentials to any remote Azure data source but authentication requires user interaction, carries phishing risk and must be managed inside the notebook.

Users need to weigh pros and cons of the different options and choose the best fit for their use case. In practice, this may mean using a mix of Unity Catalog, workspace managed identity, service principal and device code authentication depending on the use case.

About

We explore Databricks AAD passthrough authentication and look at the best options for replicating this behavior in a Unity Catalog-enabled workspace.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published