Skip to content

Evalator hangs while training #589

@jiqiujia

Description

@jiqiujia

Environment:

  • Python version 3.7
  • Spark version 2.4
  • TensorFlow version 2.5
  • TensorFlowOnSpark version 2.2.3
  • Cluster version hadoop

Describe the bug:
I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output.
image

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions