Skip to content

TensorBoard files gets deleted, Profiler returns 0 Millis for communication time! #550

@orwa-te

Description

@orwa-te

Environment:

  • Python version [3.7.7]
  • Spark version [3.0.0]
  • TensorFlow version [2.3.0]
  • TensorFlowOnSpark version [2.2.2]
  • Cluster version [Standalone]

Describe the bug:
I have 2 issues regarding the TensorBoard when executing a training process of my model on 2 worker nodes:

1- The first one is that after the training process completed, the TensorBoard files get deleted immediately on worker 1 while they are kept at worker 0 although I can use TensorBoard to check details while the training process is running.
2- I am trying to profile my model to check the details of consumed time for batches 3 to 5 while training the model in the Profiler page but I get 0 ms for communication time, more specifically the Device Collective Communication and Device to Device Time. However the Average Step Time gives reasonable values like 19368.9 ms!
From the Hosts drop-down list I can see that there is only one detected host in the cluster, not 2. Why does this happen?

image

Logs:
If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.

Spark Submit Command Line:
spark-submit --master spark://master:7077 train_file.py --cluster_size 2 --epochs 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions