Skip to content

Commit cb5cfe8

Browse files
ericvergnaudnfx
andauthored
Added static code analysis results to assessment dashboard (#2696)
## Changes Add job/query problem widgets to the dashboard Add directfs access widget to the dashboard ### Linked issues Resolves #2595 ### Functionality None ### Tests - [x] added integration tests using mock data - [x] manually tested widgets, see below: https://github.com/user-attachments/assets/3684c30f-761a-4de6-bc67-de650c5d5353 --------- Co-authored-by: Eric Vergnaud <eric.vergnaud@databricks.com> Co-authored-by: Serge Smertin <259697+nfx@users.noreply.github.com>
1 parent 564e504 commit cb5cfe8

15 files changed

+344
-27
lines changed

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -396,6 +396,9 @@ which can be used for further analysis and decision-making through the [assessme
396396
9. `assess_pipelines`: This task scans through all the Pipelines and identifies those pipelines that have Azure Service Principals embedded in their configurations. A list of all the pipelines with matching configurations is stored in the `$inventory.pipelines` table.
397397
10. `assess_azure_service_principals`: This task scans through all the clusters configurations, cluster policies, job cluster configurations, Pipeline configurations, and Warehouse configuration and identifies all the Azure Service Principals who have been given access to the Azure storage accounts via spark configurations referred in those entities. The list of all the Azure Service Principals referred in those configurations is saved in the `$inventory.azure_service_principals` table.
398398
11. `assess_global_init_scripts`: This task scans through all the global init scripts and identifies if there is an Azure Service Principal who has been given access to the Azure storage accounts via spark configurations referred in those scripts.
399+
12. `assess_dashboards`: This task scans through all the dashboards and analyzes embedded queries for migration problems. It also collects direct filesystem access patterns that require attention.
400+
13. `assess_workflows`: This task scans through all the jobs and tasks and analyzes notebooks and files for migration problems. It also collects direct filesystem access patterns that require attention.
401+
399402

400403
![report](docs/assessment-report.png)
401404

@@ -711,11 +714,16 @@ in the Migration dashboard.
711714

712715
> Please note that this is an experimental workflow.
713716
714-
The `experimental-workflow-linter` workflow lints accessible code belonging to all workflows/jobs present in the
715-
workspace. The linting emits problems indicating what to resolve for making the code Unity Catalog compatible.
717+
The `experimental-workflow-linter` workflow lints accessible code from 2 sources:
718+
- all workflows/jobs present in the workspace
719+
- all dashboards/queries present in the workspace
720+
The linting emits problems indicating what to resolve for making the code Unity Catalog compatible.
721+
The linting also locates direct filesystem access that need to be migrated.
716722

717-
Once the workflow completes, the output will be stored in `$inventory_database.workflow_problems` table, and displayed
718-
in the Migration dashboard.
723+
Once the workflow completes:
724+
- problems are stored in the `$inventory_database.workflow_problems`/`$inventory_database.query_problems` table
725+
- direct filesystem access are stored in the `$inventory_database.directfs_in_paths`/`$inventory_database.directfs_in_queries` table
726+
- all the above are displayed in the Migration dashboard.
719727

720728
![code compatibility problems](docs/code_compatibility_problems.png)
721729

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
## Code compatibility problems
2+
3+
The tables below assist with verifying if workflows and dashboards are Unity Catalog compatible. It can be filtered on the path,
4+
problem code and workflow name.
5+
Each row:
6+
- Points to a problem detected in the code using the code path, query or workflow & task reference and start/end line & column;
7+
- Explains the problem with a human-readable message and a code.
8+
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
/*
2+
--title 'Workflow migration problems'
3+
--width 6
4+
--overrides '{"spec":{
5+
"encodings":{
6+
"columns": [
7+
{"fieldName": "path", "booleanValues": ["false", "true"], "linkUrlTemplate": "/#workspace/{{ link }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "path"},
8+
{"fieldName": "code", "booleanValues": ["false", "true"], "type": "string", "displayAs": "string", "title": "code"},
9+
{"fieldName": "message", "booleanValues": ["false", "true"], "type": "string", "displayAs": "string", "title": "message"},
10+
{"fieldName": "workflow_name", "booleanValues": ["false", "true"], "linkUrlTemplate": "/jobs/{{ workflow_id }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "workflow_name"},
11+
{"fieldName": "task_key", "booleanValues": ["false", "true"], "imageUrlTemplate": "{{ @ }}", "linkUrlTemplate": "/jobs/{{ workflow_id }}/tasks/{{ @ }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "task_key"},
12+
{"fieldName": "start_line", "booleanValues": ["false", "true"], "type": "integer", "displayAs": "number", "title": "start_line"},
13+
{"fieldName": "start_col", "booleanValues": ["false", "true"], "type": "integer", "displayAs": "number", "title": "start_col"},
14+
{"fieldName": "end_line", "booleanValues": ["false", "true"], "type": "integer", "displayAs": "number", "title": "end_line"},
15+
{"fieldName": "end_col", "booleanValues": ["false", "true"], "type": "integer", "displayAs": "number", "title": "end_col"}
16+
]},
17+
"invisibleColumns": [
18+
{"name": "link", "booleanValues": ["false", "true"], "linkUrlTemplate": "{{ @ }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "link"},
19+
{"name": "workflow_id", "booleanValues": ["false", "true"], "linkUrlTemplate": "{{ @ }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "workflow_id"}
20+
]
21+
}}'
22+
*/
23+
SELECT
24+
substring_index(path, '@databricks.com/', -1) as path,
25+
path as link,
26+
code,
27+
message,
28+
job_id AS workflow_id,
29+
job_name AS workflow_name,
30+
task_key,
31+
start_line,
32+
start_col,
33+
end_line,
34+
end_col
35+
FROM inventory.workflow_problems
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
/*
2+
--title 'Dashboard compatibility problems'
3+
--width 6
4+
--overrides '{"spec":{
5+
"encodings":{
6+
"columns": [
7+
{"fieldName": "code", "booleanValues": ["false", "true"], "type": "string", "displayAs": "string", "title": "code"},
8+
{"fieldName": "message", "booleanValues": ["false", "true"], "type": "string", "displayAs": "string", "title": "message"},
9+
{"fieldName": "dashboard_name", "booleanValues": ["false", "true"], "linkUrlTemplate": "/sql/dashboards/{{ dashboard_id }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "Dashboard"},
10+
{"fieldName": "query_name", "booleanValues": ["false", "true"], "linkUrlTemplate": "/sql/editor/{{ query_id }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "Query"}
11+
]},
12+
"invisibleColumns": [
13+
{"name": "dashboard_parent", "booleanValues": ["false", "true"], "linkUrlTemplate": "{{ @ }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "dashboard_parent"},
14+
{"name": "dashboard_id", "booleanValues": ["false", "true"], "linkUrlTemplate": "{{ @ }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "dashboard_id"},
15+
{"name": "query_parent", "booleanValues": ["false", "true"], "linkUrlTemplate": "{{ @ }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "query_parent"},
16+
{"name": "query_id", "booleanValues": ["false", "true"], "linkUrlTemplate": "{{ @ }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "type": "string", "displayAs": "link", "title": "query_id"}
17+
]
18+
}}'
19+
*/
20+
SELECT
21+
dashboard_id,
22+
dashboard_parent,
23+
dashboard_name,
24+
query_id,
25+
query_parent,
26+
query_name,
27+
code,
28+
message
29+
FROM inventory.query_problems
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
height: 4
3+
---
4+
5+
# Direct filesystem access problems
6+
7+
The table below assists with verifying if workflows and dashboards require direct filesystem access.
8+
As a reminder, `dbfs:/` is not supported in Unity Catalog, and more generally direct filesystem access is discouraged.
9+
Rather, data should be accessed via Unity tables.
10+
11+
Each row:
12+
- Points to a direct filesystem access detected in the code using the code path, query or workflow & task reference and start/end line & column;
13+
- Provides the _lineage_ i.e. which `workflow -> task -> notebook...` execution sequence leads to that access.
14+
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
/*
2+
--title 'Direct filesystem access problems'
3+
--width 6
4+
--overrides '{"spec":{
5+
"encodings":{
6+
"columns": [
7+
{"fieldName": "location", "title": "location", "type": "string", "displayAs": "string", "booleanValues": ["false", "true"]},
8+
{"fieldName": "is_read", "title": "is_read", "type": "boolean", "displayAs": "boolean", "booleanValues": ["false", "true"]},
9+
{"fieldName": "is_write", "title": "is_write", "type": "boolean", "displayAs": "boolean", "booleanValues": ["false", "true"]},
10+
{"fieldName": "source", "title": "source", "type": "string", "displayAs": "link", "linkUrlTemplate": "{{ source_link }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "booleanValues": ["false", "true"]},
11+
{"fieldName": "timestamp", "title": "last_modified", "type": "datetime", "displayAs": "datetime", "dateTimeFormat": "ll LTS (z)", "booleanValues": ["false", "true"]},
12+
{"fieldName": "lineage", "title": "lineage", "type": "string", "displayAs": "link", "linkUrlTemplate": "{{ lineage_link }}", "linkTextTemplate": "{{ @ }}", "linkTitleTemplate": "{{ @ }}", "linkOpenInNewTab": true, "booleanValues": ["false", "true"]},
13+
{"fieldName": "lineage_data", "title": "lineage_data", "type": "complex", "displayAs": "json", "booleanValues": ["false", "true"]},
14+
{"fieldName": "assessment_start", "title": "assessment_start", "type": "datetime", "displayAs": "datetime", "dateTimeFormat": "ll LTS (z)", "booleanValues": ["false", "true"]},
15+
{"fieldName": "assessment_end", "title": "assessment_end", "type": "datetime", "displayAs": "datetime", "dateTimeFormat": "ll LTS (z)", "booleanValues": ["false", "true"]}
16+
]},
17+
"invisibleColumns": [
18+
{"fieldName": "source_link", "title": "source_link", "type": "string", "displayAs": "string", "booleanValues": ["false", "true"]},
19+
{"fieldName": "lineage_type", "title": "lineage_type", "type": "string", "displayAs": "string", "booleanValues": ["false", "true"]},
20+
{"fieldName": "lineage_id", "title": "lineage_id", "type": "string", "displayAs": "string", "booleanValues": ["false", "true"]},
21+
{"fieldName": "lineage_link", "title": "lineage_link", "type": "string", "displayAs": "string", "booleanValues": ["false", "true"]}
22+
]
23+
}}'
24+
*/
25+
SELECT
26+
path as location,
27+
is_read,
28+
is_write,
29+
if( startswith(source_id, '/'), substring_index(source_id, '@databricks.com/', -1), split_part(source_id, '/', 2)) as source,
30+
if( startswith(source_id, '/'), concat('/#workspace/', source_id), concat('/sql/editor/', split_part(source_id, '/', 2))) as source_link,
31+
source_timestamp as `timestamp`,
32+
case
33+
when lineage.object_type = 'WORKFLOW' then concat('Workflow: ', lineage.other.name)
34+
when lineage.object_type = 'TASK' then concat('Task: ', split_part(lineage.object_id, '/', 2))
35+
when lineage.object_type = 'NOTEBOOK' then concat('Notebook: ', substring_index(lineage.object_id, '@databricks.com/', -1))
36+
when lineage.object_type = 'FILE' then concat('File: ', substring_index(lineage.object_id, '@databricks.com/', -1))
37+
when lineage.object_type = 'DASHBOARD' then concat('Dashboard: ', lineage.other.name)
38+
when lineage.object_type = 'QUERY' then concat('Query: ', lineage.other.name)
39+
end as lineage,
40+
lineage.object_type as lineage_type,
41+
lineage.object_id as lineage_id,
42+
case
43+
when lineage.object_type = 'WORKFLOW' then concat('/jobs/', lineage.object_id)
44+
when lineage.object_type = 'TASK' then concat('/jobs/', split_part(lineage.object_id, '/', 1), '/tasks/', split_part(lineage.object_id, '/', 2))
45+
when lineage.object_type = 'NOTEBOOK' then concat('/#workspace/', lineage.object_id)
46+
when lineage.object_type = 'FILE' then concat('/#workspace/', lineage.object_id)
47+
when lineage.object_type = 'DASHBOARD' then concat('/sql/dashboards/', lineage.object_id)
48+
when lineage.object_type = 'QUERY' then concat('/sql/editor/', split_part(lineage.object_id, '/', 2))
49+
end as lineage_link,
50+
lineage.other as lineage_data,
51+
assessment_start,
52+
assessment_end
53+
from (SELECT
54+
path,
55+
is_read,
56+
is_write,
57+
source_id,
58+
source_timestamp,
59+
explode(source_lineage) as lineage,
60+
assessment_start_timestamp as assessment_start,
61+
assessment_end_timestamp as assessment_end
62+
FROM inventory.directfs)

src/databricks/labs/ucx/source_code/graph.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -337,7 +337,8 @@ def __repr__(self):
337337

338338
@property
339339
def lineage(self) -> list[LineageAtom]:
340-
return [LineageAtom(object_type="PATH", object_id=str(self.path))]
340+
object_type = "NOTEBOOK" if is_a_notebook(self.path) else "FILE"
341+
return [LineageAtom(object_type=object_type, object_id=str(self.path))]
341342

342343

343344
class SourceContainer(abc.ABC):

src/databricks/labs/ucx/source_code/jobs.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -86,8 +86,8 @@ def __repr__(self):
8686
@property
8787
def lineage(self) -> list[LineageAtom]:
8888
job_name = (None if self._job.settings is None else self._job.settings.name) or "unknown job"
89-
job_lineage = LineageAtom("JOB", str(self._job.job_id), {"name": job_name})
90-
task_lineage = LineageAtom("TASK", self._task.task_key)
89+
job_lineage = LineageAtom("WORKFLOW", str(self._job.job_id), {"name": job_name})
90+
task_lineage = LineageAtom("TASK", f"{self._job.job_id}/{self._task.task_key}")
9191
return [job_lineage, task_lineage]
9292

9393

@@ -469,8 +469,8 @@ def _collect_task_dfsas(
469469
job_name = job.settings.name if job.settings and job.settings.name else "<anonymous>"
470470
for dfsa in DfsaCollectorWalker(graph, set(), self._path_lookup, session_state):
471471
atoms = [
472-
LineageAtom(object_type="JOB", object_id=job_id, other={"name": job_name}),
473-
LineageAtom(object_type="TASK", object_id=task.task_key),
472+
LineageAtom(object_type="WORKFLOW", object_id=job_id, other={"name": job_name}),
473+
LineageAtom(object_type="TASK", object_id=f"{job_id}/{task.task_key}"),
474474
]
475475
yield dataclasses.replace(dfsa, source_lineage=atoms + dfsa.source_lineage)
476476

src/databricks/labs/ucx/source_code/queries.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def refresh_report(self, sql_backend: SqlBackend, inventory_database: str):
6666
linted_queries.add(query.id)
6767
problems = self.lint_query(query)
6868
all_problems.extend(problems)
69-
dfsas = self.collect_dfsas_from_query(query)
69+
dfsas = self.collect_dfsas_from_query("no-dashboard-id", query)
7070
all_dfsas.extend(dfsas)
7171
# dump problems
7272
logger.info(f"Saving {len(all_problems)} linting problems...")
@@ -123,7 +123,7 @@ def _lint_and_collect_from_dashboard(
123123
dashboard_name=dashboard_name,
124124
)
125125
)
126-
dfsas = self.collect_dfsas_from_query(query)
126+
dfsas = self.collect_dfsas_from_query(dashboard_id, query)
127127
for dfsa in dfsas:
128128
atom = LineageAtom(
129129
object_type="DASHBOARD",
@@ -155,11 +155,11 @@ def lint_query(self, query: LegacyQuery) -> Iterable[QueryProblem]:
155155
)
156156

157157
@classmethod
158-
def collect_dfsas_from_query(cls, query: LegacyQuery) -> Iterable[DirectFsAccess]:
158+
def collect_dfsas_from_query(cls, dashboard_id: str, query: LegacyQuery) -> Iterable[DirectFsAccess]:
159159
if query.query is None:
160160
return
161161
linter = DirectFsAccessSqlLinter()
162-
source_id = query.id or "no id"
162+
source_id = f"{dashboard_id}/{query.id}"
163163
source_name = query.name or "<anonymous>"
164164
source_timestamp = cls._read_timestamp(query.updated_at)
165165
source_lineage = [LineageAtom(object_type="QUERY", object_id=source_id, other={"name": source_name})]

0 commit comments

Comments
 (0)