Skip to content

Optimize resolve_dependencies for large workflows #126

Open
@incorvia

Description

@incorvia

Description

The current implementation of the resolve_dependencies method in the workflow is inefficient for workflows with a large number of jobs and dependencies. Specifically, the method uses find_job, which performs a linear search over the jobs array for every dependency. This results in a time complexity of O(n * d), where n is the number of jobs and d is the number of dependencies.

For workflows with thousands or hundreds of thousands of jobs and dependencies, this becomes a significant bottleneck, causing high execution times during workflow creation.


Steps to Reproduce

  1. Create a workflow with a large number of jobs (e.g., 100,000).
  2. Define dependencies between these jobs.
  3. Measure the time taken for the resolve_dependencies method to complete.

Impact

  • Performance degrades significantly as the number of jobs and dependencies increases.
  • For workflows with 100,000 jobs and a proportional number of dependencies, this can take tens of seconds or longer, impacting user experience and system throughput.
  • In fact, I was unable to get a workflow with 100k dependencies to resolve within the bounds of my patience!

Proposed Solution

To optimize the resolve_dependencies method, we introduce a precomputed lookup table (@job_lookup) stored in the workflow's state. This lookup table allows all calls to find_job to perform O(1) lookups, rather than performing O(n) linear searches through the jobs array. The original regex-based fallback logic is preserved.

Updated Implementation

def resolve_dependencies
  @dependencies.each do |dependency|
    from = find_job(dependency[:from])
    to   = find_job(dependency[:to])

    to.incoming << dependency[:from]
    from.outgoing << dependency[:to]
  end
end

# Precompute the lookup table
def build_job_lookup
  @job_lookup = {
    klass: jobs.each_with_object({}) { |job, lookup| lookup[job.klass.to_s] = job },
    name: jobs.each_with_object({}) { |job, lookup| lookup[job.name.to_s] = job }
  }
end

# Optimized find_job with regex-based fallback
def find_job(identifier)
  # Build the lookup table if it doesn't exist
  build_job_lookup unless @job_lookup

  # Use regex to determine which lookup to use
  match_data = /(?<klass>\w*[^-])-(?<identifier>.*)/.match(identifier.to_s)

  if match_data.nil?
    # Lookup by klass if the pattern doesn't match
    @job_lookup[:klass][identifier.to_s]
  else
    # Lookup by name if the pattern matches
    @job_lookup[:name][identifier.to_s]
  end
end

Changes

  1. Precomputed Lookup Table:

    • The build_job_lookup method creates a hash (@job_lookup) with two sub-hashes:
      • :klass: Maps job.klass.to_s to the job object.
      • :name: Maps job.name.to_s to the job object.
    • The lookup table is built on-demand and ensures O(1) lookups.
  2. Regex-Based Fallback:

    • The logic in find_job uses a regex to determine whether to look up the job by name or klass, preserving the behavior of the original implementation.
  3. Automatic Rebuild:

    • The lookup table is rebuilt if it doesn’t exist (@job_lookup is nil).
    • To ensure consistency, the lookup table should be invalidated when jobs is modified (e.g., by setting @job_lookup = nil).
  4. Improved Performance:

    • Eliminates linear searches over the jobs array in find_job.
    • Reduces the time complexity of resolve_dependencies to O(d).

Performance Comparison

Workflow Size Old Time Complexity Old Execution Time New Time Complexity New Execution Time
1,000 Jobs O(n * d) ~1 second O(d) <0.1 second
10,000 Jobs O(n * d) ~10 seconds O(d) ~1 second
100,000 Jobs O(n * d) Couldn’t measure O(d) ~20 seconds

In testing, workflows with 100,000 jobs and dependencies were reduced from unmanageable execution times to approximately 20 seconds.


Testing

  1. Functional Tests:

    • Verify that find_job continues to resolve jobs correctly for both klass and name.
  2. Performance Tests:

    • Compare the execution time of resolve_dependencies before and after the optimization for large workflows.
  3. Edge Cases:

    • Test scenarios where:
      • A dependency matches klass (e.g., from: "JobA").
      • A dependency matches name (e.g., from: "JobA-123").
      • A dependency is invalid (e.g., from: "NonexistentJob").

Advantages

  1. Backward Compatibility:

    • The functionality remains identical to the original implementation.
    • No breaking changes are introduced.
  2. Consistent Performance:

    • find_job now consistently performs O(1) lookups, even when called multiple times.
  3. Scalability:

    • Handles workflows with 100,000+ jobs and dependencies efficiently.
  4. Simplified Logic:

    • Centralized lookup logic in find_job ensures all calls benefit from the optimization.

Conclusion

This optimization drastically improves the performance of resolve_dependencies for large workflows, while preserving the exact behavior of the original implementation. It should be a backward-compatible change with no impact on functionality, only improving execution time.

Would you be open to a PR implementing this change? I’d be happy to contribute! 🚀

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions