Skip to content

Limit crawl workflows task in assessment to workflows that ran in the last 30 days. #3963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 30, 2025

Conversation

FastLee
Copy link
Contributor

@FastLee FastLee commented Mar 26, 2025

Fix #3960

@FastLee FastLee changed the title Addressed issues, added tests. Limit crawl workflows task in assessment to workflows that ran in the last 30 days. Apr 1, 2025
@FastLee FastLee force-pushed the fix/limit-workflow-crawler branch from 840eb7e to 00f21cc Compare April 28, 2025 14:05
@FastLee FastLee marked this pull request as ready for review April 28, 2025 14:09
@FastLee FastLee requested a review from a team as a code owner April 28, 2025 14:09
Copy link

✅ 87/87 passed, 17 skipped, 2h17m4s total

Running from acceptance #8579

@FastLee FastLee enabled auto-merge April 28, 2025 15:30
@@ -181,7 +181,7 @@ def assess_workflows(self, ctx: RuntimeContext):
"""Scans all jobs for migration issues in notebooks.
Also stores direct filesystem accesses for display in the migration dashboard."""
# TODO: Ensure these are captured in the history log.
ctx.workflow_linter.refresh_report(ctx.sql_backend, ctx.inventory_database)
ctx.workflow_linter.refresh_report(ctx.sql_backend, ctx.inventory_database, last_run_days=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make last_run_days configurable and supplied by the customer if they have different requirements for e.g. last 2-3 months

@FastLee FastLee added this pull request to the merge queue Apr 30, 2025
@FastLee FastLee requested a review from pritishpai April 30, 2025 14:18
Merged via the queue into main with commit 6161f99 Apr 30, 2025
9 checks passed
@FastLee FastLee deleted the fix/limit-workflow-crawler branch April 30, 2025 14:21
gueniai added a commit that referenced this pull request May 6, 2025
* Adds requirement for matching account groups to be created before assessment to the docs ([#4017](#4017)). The account group setup requirements have been clarified to ensure successful assessment and group migration workflows, mandating that account groups matching workspace local groups are created beforehand, which can be achieved manually or programmatically via various methods. The assessment workflow has been enhanced to retrieve workspace assets and securable objects from the Hive metastore for compatibility assessment with UC, storing the results in the inventory database for further analysis. Additionally, the documentation now stresses the necessity of running the `validate-groups-membership` command prior to initiating the group migration workflow, and recommends running the `create-account-groups` command beforehand if the required account groups do not already exist, to guarantee a seamless execution of the assessment and migration processes.
* Fixed Service Principal instructions for installation ([#3967](#3967)). The installation requirements for UCX have been updated to reflect changes in Service Principal support, where it is no longer supported for workspace installations, but may be supported for account-level installations. As a result, account-level identity setup now requires connection via Service Principal with Account Admin and Workspace Admin privileges in all workspaces. All other installation requirements remain unchanged, including the need for a Databricks Premium or Enterprise workspace, network access to the Databricks Workspace and the Internet, a created Unity Catalog Metastore, and a PRO or Serverless SQL Warehouse for rendering reports. Additionally, users with external Hive Metastores, such as AWS Glue, must consult the relevant guide for specific instructions to ensure proper setup.
* Fixed migrate tables when default catalog is set ([#4012](#4012)). The handling of the default catalog in the Hive metastore has been enhanced to ensure correct behavior when the default catalog is set. Specifically, the `DESCRIBE SCHEMA EXTENDED` and `SHOW TBLPROPERTIES` queries have been updated to include the `hive_metastore` prefix when fetching database descriptions and constructing table identifiers, respectively, unless the table is located in a mount point, in which case the `delta` prefix is used. This change addresses a previously reported issue with migrating tables when the default catalog is set, ensuring that table properties are correctly fetched and tables are properly identified. The update has been applied to multiple test cases, including those for skipping tables, upgraded tables, and mapping tables, to guarantee correct execution of queries with the default catalog name, which is essential when the default catalog is set to `hive_metastore`.
* Limit crawl workflows task in assessment to workflows that ran in the last 30 days ([#3963](#3963)). The JobInfo class has been enhanced with a new `last_run` attribute to store the timestamp of the last pipeline execution, allowing for better monitoring and assessment. The `from_job` method has been updated to initialize this attribute consistently. Additionally, the `assess_workflows` method now filters workflows to only include those that have run within the last 30 days, achieved through the introduction of a `last_run_days` parameter in the `refresh_report` method. This parameter enables time-based filtering of job runs, and a new inner function `lint_job_limited` handles the filtering logic. The `lint_job` method has also been updated to accept the `last_run_days` parameter and check if a job has run within the specified time frame. Furthermore, a new test method `test_workflow_linter_refresh_report_time_bound` has been added to verify the correct functioning of the `WorkflowLinter` class when limited to recent workflow runs, ensuring that it produces the expected results and writes to the correct tables.
* Pause migration progress workflow schedule ([#3995](#3995)). The migration progress workflow schedule is now paused by default, with its `pause_status` set to `PAUSED`, to prevent automatic execution and potential failures due to missing prerequisites. This change is driven by the experimental nature of the workflow, which may fail if a UCX catalog has not been created by the customer. To ensure successful execution, users are advised to unpause the workflow after running the `create-ucx-catalog` command, allowing them to control when the workflow runs and verify that necessary prerequisites are in place.
* Skip integration tests with legacy dashboard creation as it is deprecated ([#4009](#4009)). The configuration has been updated with a new section for `mypy_extensions` containing an empty list, in addition to the existing `mypy-extensions` section, introducing redundancy. Furthermore, several integration tests related to legacy dashboard creation have been skipped due to deprecation, as indicated by Databricks no longer supporting this feature. The skipped tests include those for dashboard creation, migration, query linter functionality, and job progress encoder failures, all of which have been marked with a skip reason citing the deprecation of legacy dashboard creation. These changes are temporary and will be revisited in collaboration with the product team to potentially enable these tests only for Labs workspaces, with the ultimate goal of accommodating the deprecation of legacy dashboard creation and ensuring compatibility with future developments.
* Warns instead of an error while finding an acc group in workspace ([#4016](#4016)). The behavior of the account group reflection functionality has been updated to handle duplicate groups more robustly. When encountering a group that already exists in the workspace, the function now logs a warning instead of an error, allowing it to continue executing uninterrupted. This change accommodates the introduction of nested account groups from workspace local groups, which can lead to groups being present in the workspace that are also being migrated. The warning message clearly indicates that the group is being skipped due to its existing presence in the workspace, providing transparency into the reflection process.
@gueniai gueniai mentioned this pull request May 6, 2025
gueniai added a commit that referenced this pull request May 6, 2025
* Adds requirement for matching account groups to be created before
assessment to the docs
([#4017](#4017)). The
account group setup requirements have been clarified to ensure
successful assessment and group migration workflows, mandating that
account groups matching workspace local groups are created beforehand,
which can be achieved manually or programmatically via various methods.
The assessment workflow has been enhanced to retrieve workspace assets
and securable objects from the Hive metastore for compatibility
assessment with UC, storing the results in the inventory database for
further analysis. Additionally, the documentation now stresses the
necessity of running the `validate-groups-membership` command prior to
initiating the group migration workflow, and recommends running the
`create-account-groups` command beforehand if the required account
groups do not already exist, to guarantee a seamless execution of the
assessment and migration processes.
* Fixed Service Principal instructions for installation
([#3967](#3967)). The
installation requirements for UCX have been updated to reflect changes
in Service Principal support, where it is no longer supported for
workspace installations, but may be supported for account-level
installations. As a result, account-level identity setup now requires
connection via Service Principal with Account Admin and Workspace Admin
privileges in all workspaces. All other installation requirements remain
unchanged, including the need for a Databricks Premium or Enterprise
workspace, network access to the Databricks Workspace and the Internet,
a created Unity Catalog Metastore, and a PRO or Serverless SQL Warehouse
for rendering reports. Additionally, users with external Hive
Metastores, such as AWS Glue, must consult the relevant guide for
specific instructions to ensure proper setup.
* Fixed migrate tables when default catalog is set
([#4012](#4012)). The
handling of the default catalog in the Hive metastore has been enhanced
to ensure correct behavior when the default catalog is set.
Specifically, the `DESCRIBE SCHEMA EXTENDED` and `SHOW TBLPROPERTIES`
queries have been updated to include the `hive_metastore` prefix when
fetching database descriptions and constructing table identifiers,
respectively, unless the table is located in a mount point, in which
case the `delta` prefix is used. This change addresses a previously
reported issue with migrating tables when the default catalog is set,
ensuring that table properties are correctly fetched and tables are
properly identified. The update has been applied to multiple test cases,
including those for skipping tables, upgraded tables, and mapping
tables, to guarantee correct execution of queries with the default
catalog name, which is essential when the default catalog is set to
`hive_metastore`.
* Limit crawl workflows task in assessment to workflows that ran in the
last 30 days
([#3963](#3963)). The
JobInfo class has been enhanced with a new `last_run` attribute to store
the timestamp of the last pipeline execution, allowing for better
monitoring and assessment. The `from_job` method has been updated to
initialize this attribute consistently. Additionally, the
`assess_workflows` method now filters workflows to only include those
that have run within the last 30 days, achieved through the introduction
of a `last_run_days` parameter in the `refresh_report` method. This
parameter enables time-based filtering of job runs, and a new inner
function `lint_job_limited` handles the filtering logic. The `lint_job`
method has also been updated to accept the `last_run_days` parameter and
check if a job has run within the specified time frame. Furthermore, a
new test method `test_workflow_linter_refresh_report_time_bound` has
been added to verify the correct functioning of the `WorkflowLinter`
class when limited to recent workflow runs, ensuring that it produces
the expected results and writes to the correct tables.
* Pause migration progress workflow schedule
([#3995](#3995)). The
migration progress workflow schedule is now paused by default, with its
`pause_status` set to `PAUSED`, to prevent automatic execution and
potential failures due to missing prerequisites. This change is driven
by the experimental nature of the workflow, which may fail if a UCX
catalog has not been created by the customer. To ensure successful
execution, users are advised to unpause the workflow after running the
`create-ucx-catalog` command, allowing them to control when the workflow
runs and verify that necessary prerequisites are in place.
* Skip integration tests with legacy dashboard creation as it is
deprecated ([#4009](#4009)).
The configuration has been updated with a new section for
`mypy_extensions` containing an empty list, in addition to the existing
`mypy-extensions` section, introducing redundancy. Furthermore, several
integration tests related to legacy dashboard creation have been skipped
due to deprecation, as indicated by Databricks no longer supporting this
feature. The skipped tests include those for dashboard creation,
migration, query linter functionality, and job progress encoder
failures, all of which have been marked with a skip reason citing the
deprecation of legacy dashboard creation. These changes are temporary
and will be revisited in collaboration with the product team to
potentially enable these tests only for Labs workspaces, with the
ultimate goal of accommodating the deprecation of legacy dashboard
creation and ensuring compatibility with future developments.
* Warns instead of an error while finding an acc group in workspace
([#4016](#4016)). The
behavior of the account group reflection functionality has been updated
to handle duplicate groups more robustly. When encountering a group that
already exists in the workspace, the function now logs a warning instead
of an error, allowing it to continue executing uninterrupted. This
change accommodates the introduction of nested account groups from
workspace local groups, which can lead to groups being present in the
workspace that are also being migrated. The warning message clearly
indicates that the group is being skipped due to its existing presence
in the workspace, providing transparency into the reflection process.
@pritishpai pritishpai mentioned this pull request May 7, 2025
1 task
github-merge-queue bot pushed a commit that referenced this pull request May 8, 2025
<!-- REMOVE IRRELEVANT COMMENTS BEFORE CREATING A PULL REQUEST -->
## Changes
Added missed documentation commit for #3963

### Linked issues
#3960 

### Functionality

- [x] added relevant user documentation

Co-authored-by: Liran Bareket <liran.bareket@databricks.com>
gueniai added a commit that referenced this pull request Jul 3, 2025
* Added documentation for [#3963](#3963) ([#4020](#4020)). The workflow assessment functionality has been enhanced with an experimental task that analyzes recently executed workflows for migration problems, providing links to relevant documentation and recommendations for addressing identified issues. This task, labeled as experimental, now only runs for workflows executed within the last 30 days, but users can opt to analyze all workflows by running a specific workflow. The assessment findings, including any migration problems detected, can be viewed in the assessment dashboard, offering a centralized location for monitoring and addressing potential issues, and helping users to ensure a smoother migration process.
* Enhance documentation for UCX ([#4024](#4024)). The UCX documentation has undergone significant enhancements to improve user experience and provide comprehensive guidance for contributors and users. The main page has been revamped with a simplified and accelerated Unity Catalog migration process, featuring a prominent call-to-action and key features such as comprehensive assessment, automated migrations, and detailed reporting. Additional pages, including a `Getting Started` section, have been added to guide users through installation, running, and operating the toolkit, with links to relevant sections such as installation, running, and reference materials. The contributing section has been updated for consistency, and a new `How to Contribute` section has been introduced, providing clear resources for submitting issues, pull requests, and contributing to the documentation. The documentation structure has been reorganized, with updated sidebar positions and revised descriptions to better reflect the content and purpose of each section, ultimately aiming to provide better user documentation, clarity, and a more intuitive interface for navigating and utilizing the UCX toolkit.
* Fixes fmt and unit test failures from new blueprint release ([#4048](#4048)). The dependency on the databricks-labs-blueprint has been updated to a version range of 0.11.0 or higher but less than 0.12.0, incorporating new features and bug fixes from the latest blueprint release. To ensure compatibility with this updated version, the codebase has been updated to address breaking changes introduced in the recent blueprint release, including the addition of type hints to MockInstallation and the DEFAULT_CONFIG variable, which is now defined as a dictionary with string keys and RootJsonValue values. Furthermore, a previously failing unit test has been fixed, and the test_installation_recovers_invalid_dashboard function has been refactored into two separate test functions to verify the recovery of invalid dashboards due to InvalidParameterValue and NotFound exceptions, utilizing the MockInstallation class and caplog fixture to capture and verify log messages. These changes aim to resolve issues with the new blueprint release, enable previously failing acceptance tests, and improve the overall robustness of the installation process.
* Updated for Databricks SDK 0.56+ ([#4178](#4178)). The project's dependencies have been updated to support Databricks SDK version 0.56 and above, with the upper bound set to less than 0.58.0, to ensure compatibility with the evolving SDK. This update includes breaking changes, and as a result, various modifications have been made to the code, such as adding type hints to functions to improve linting, replacing `PermissionsList` with `GetPermissionsResponse`, and accessing `SecurableType` enum values using the `value` attribute. Additionally, several test functions have been updated to reflect these changes, including the addition of return type hints and the use of `create_autospec` to create mock objects. These updates aim to maintain the project's functionality and ensure seamless compatibility with the latest Databricks SDK version, while also improving code quality and test coverage. The changes affect various aspects of the code, including grants management, permissions retrieval, and test cases for different scenarios, such as migrating managed tables, external tables, and tables in mounts.
* Workaround for acceptance with dependabot PR ([#4029](#4029)). The library's dependency on a key SDK has been updated to support a broader version range, now compatible with versions from 0.44.0 up to but not including 0.54.0, enhancing flexibility and potentially allowing for the incorporation of new features or bug fixes introduced in these versions. Additionally, an internal function responsible for setting default catalog settings has been refined to handle `NotFound` exceptions more robustly. Specifically, the function now checks for the presence of metadata before attempting to retrieve an etag, preventing potential errors that could occur when metadata is missing, thereby improving the overall stability and reliability of the library.
@gueniai gueniai mentioned this pull request Jul 3, 2025
gueniai added a commit that referenced this pull request Jul 3, 2025
* Added documentation for
[#3963](#3963)
([#4020](#4020)). The
workflow assessment functionality has been enhanced with an experimental
task that analyzes recently executed workflows for migration problems,
providing links to relevant documentation and recommendations for
addressing identified issues. This task, labeled as experimental, now
only runs for workflows executed within the last 30 days, but users can
opt to analyze all workflows by running a specific workflow. The
assessment findings, including any migration problems detected, can be
viewed in the assessment dashboard, offering a centralized location for
monitoring and addressing potential issues, and helping users to ensure
a smoother migration process.
* Enhance documentation for UCX
([#4024](#4024)). The UCX
documentation has undergone significant enhancements to improve user
experience and provide comprehensive guidance for contributors and
users. The main page has been revamped with a simplified and accelerated
Unity Catalog migration process, featuring a prominent call-to-action
and key features such as comprehensive assessment, automated migrations,
and detailed reporting. Additional pages, including a `Getting Started`
section, have been added to guide users through installation, running,
and operating the toolkit, with links to relevant sections such as
installation, running, and reference materials. The contributing section
has been updated for consistency, and a new `How to Contribute` section
has been introduced, providing clear resources for submitting issues,
pull requests, and contributing to the documentation. The documentation
structure has been reorganized, with updated sidebar positions and
revised descriptions to better reflect the content and purpose of each
section, ultimately aiming to provide better user documentation,
clarity, and a more intuitive interface for navigating and utilizing the
UCX toolkit.
* Fixes fmt and unit test failures from new blueprint release
([#4048](#4048)). The
dependency on the databricks-labs-blueprint has been updated to a
version range of 0.11.0 or higher but less than 0.12.0, incorporating
new features and bug fixes from the latest blueprint release. To ensure
compatibility with this updated version, the codebase has been updated
to address breaking changes introduced in the recent blueprint release,
including the addition of type hints to MockInstallation and the
DEFAULT_CONFIG variable, which is now defined as a dictionary with
string keys and RootJsonValue values. Furthermore, a previously failing
unit test has been fixed, and the
test_installation_recovers_invalid_dashboard function has been
refactored into two separate test functions to verify the recovery of
invalid dashboards due to InvalidParameterValue and NotFound exceptions,
utilizing the MockInstallation class and caplog fixture to capture and
verify log messages. These changes aim to resolve issues with the new
blueprint release, enable previously failing acceptance tests, and
improve the overall robustness of the installation process.
* Updated for Databricks SDK 0.56+
([#4178](#4178)). The
project's dependencies have been updated to support Databricks SDK
version 0.56 and above, with the upper bound set to less than 0.58.0, to
ensure compatibility with the evolving SDK. This update includes
breaking changes, and as a result, various modifications have been made
to the code, such as adding type hints to functions to improve linting,
replacing `PermissionsList` with `GetPermissionsResponse`, and accessing
`SecurableType` enum values using the `value` attribute. Additionally,
several test functions have been updated to reflect these changes,
including the addition of return type hints and the use of
`create_autospec` to create mock objects. These updates aim to maintain
the project's functionality and ensure seamless compatibility with the
latest Databricks SDK version, while also improving code quality and
test coverage. The changes affect various aspects of the code, including
grants management, permissions retrieval, and test cases for different
scenarios, such as migrating managed tables, external tables, and tables
in mounts.
* Workaround for acceptance with dependabot PR
([#4029](#4029)). The
library's dependency on a key SDK has been updated to support a broader
version range, now compatible with versions from 0.44.0 up to but not
including 0.54.0, enhancing flexibility and potentially allowing for the
incorporation of new features or bug fixes introduced in these versions.
Additionally, an internal function responsible for setting default
catalog settings has been refined to handle `NotFound` exceptions more
robustly. Specifically, the function now checks for the presence of
metadata before attempting to retrieve an etag, preventing potential
errors that could occur when metadata is missing, thereby improving the
overall stability and reliability of the library.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

assess_workflows task fails frequently when the workspace has a large number of workflows.
2 participants