-
Notifications
You must be signed in to change notification settings - Fork 944
Description
Environment:
- Python version [3.7.7]
- Spark version [3.0.0]
- TensorFlow version [2.3.0]
- TensorFlowOnSpark version [2.2.2]
- Cluster version [Standalone]
Describe the bug:
I have 2 issues regarding the TensorBoard when executing a training process of my model on 2 worker nodes:
1- The first one is that after the training process completed, the TensorBoard files get deleted immediately on worker 1 while they are kept at worker 0 although I can use TensorBoard to check details while the training process is running.
2- I am trying to profile my model to check the details of consumed time for batches 3 to 5 while training the model in the Profiler page but I get 0 ms
for communication time, more specifically the Device Collective Communication
and Device to Device Time
. However the Average Step Time
gives reasonable values like 19368.9 ms!
From the Hosts
drop-down list I can see that there is only one detected host in the cluster, not 2. Why does this happen?
Logs:
If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.
Spark Submit Command Line:
spark-submit --master spark://master:7077 train_file.py --cluster_size 2 --epochs 1