Feb 03, 2022
Troubleshooting¶
Error Retrieving or Displaying Information¶
Symptom¶
This describes any problem where the UI fails to retrieve information from DKube, or does not show or update the information.
Cause/Action¶
This problem can occur when there is a temporary miscommunication between the UI and the DKube application, or when the network is not accessible. Refreshing the browser window will typically fix this issue.
Problems Accessing DKube url¶
Symptom¶
The DKube url is not accessible or not behaving properly.
Cause/Action¶
The Chrome and Firefox browsers are supported. Please ensure that you are using one of these, and that it is up-to-date.
Notebook or Job Loading Slowly¶
Symptom¶
After submitting a Notebook or Run, it stays in the Queued, Waiting for GPUs, or Starting status for a long time.
Cause/Action¶
The first time that a Run or Notebook is submitted after installation with a new version of TensorFlow, it pulls the image, which takes additional time. The time to pull is dependent on the network speed, but on average it will take about 2 minutes.
Subsequent downloads for the same version of the framework will start more quickly.
Error Status for Notebook or Run¶
Symptom¶
After submission, a Notebook or Run goes into the Error state.
Cause/Action¶
The run has identified an internal problem that prevents proper operation. The first step is to clone and re-submit the run.
GPUs are Not Available on Master Node¶
Symptom¶
GPUs are installed on the Master Node, but are not showing up in DKube.
Cause/Action¶
The ability to schedule GPUs on the Master Node is an installation option. Check with the cluster administrator to determine if DKube was installed with this capability.
Cause/Action¶
The “DKube” Notebook is launched in the stopped state. You must start it in order to use it. Once it has been started, it will remain in the running state until stopped.
Jupyter Notebook Takes a Long Time to Load¶
Symptom¶
JupyterLab Notebook does not load immediately from the icon on Notebook screen.
Cause/Action¶
This is expected behavior. It can take as long as a minute for JupyterLab to load.
404 Error Message from Notebook or Run Screen¶
Symptom¶
A 404 error message appears when:
Selecting the Tensorboard icon on an instance from the Training Run screen
Selecting the Notebook icon on an instance from the Notebook screen
Cause/Action¶
This can happen if either icon is selected too soon after starting the Notebook or Job. It can take several minutes before the selections are active. Wait for several minutes and retry the activity.
Insufficient GPU Message Even Though the Pool is Large Enough¶
Symptom¶
An error message appears when a Run is submitted, showing that there are not enough GPUs available. The submission is lower than the number of GPUs in the Pool, so that the Job should be queued.
Cause/Action¶
If this happens in a clustered system, it means that there are not enough GPUs to satisfy the submission on a single node. The Advanced option should be used when submitting the Run. This behavior is described in section Clustered Pools
GitHub Users Do Not Appear from On-Board Popup¶
Symptom¶
In the User On-Board Popup on the Operator screen, the user list for the organization does not appear.
Cause/Action¶
This can occur if the GitHub authorization credentials have been activated or changed. Refresh the User list as explained in section Add (On-Board) User