Feb 03, 2022

Troubleshooting

Error Retrieving or Displaying Information

Symptom

This describes any problem where the UI fails to retrieve information from DKube, or does not show or update the information.

Cause/Action

This problem can occur when there is a temporary miscommunication between the UI and the DKube application, or when the network is not accessible. Refreshing the browser window will typically fix this issue.

Problems Accessing DKube url

Symptom

The DKube url is not accessible or not behaving properly.

Cause/Action

The Chrome and Firefox browsers are supported. Please ensure that you are using one of these, and that it is up-to-date.


Notebook or Job Loading Slowly

Symptom

After submitting a Notebook or Run, it stays in the Queued, Waiting for GPUs, or Starting status for a long time.

Cause/Action

The first time that a Run or Notebook is submitted after installation with a new version of TensorFlow, it pulls the image, which takes additional time. The time to pull is dependent on the network speed, but on average it will take about 2 minutes.

Subsequent downloads for the same version of the framework will start more quickly.


Error Status for Notebook or Run

Symptom

After submission, a Notebook or Run goes into the Error state.

Cause/Action

The run has identified an internal problem that prevents proper operation. The first step is to clone and re-submit the run.


GPUs are Not Available on Master Node

Symptom

GPUs are installed on the Master Node, but are not showing up in DKube.

Cause/Action

The ability to schedule GPUs on the Master Node is an installation option. Check with the cluster administrator to determine if DKube was installed with this capability.

Cause/Action

The “DKube” Notebook is launched in the stopped state. You must start it in order to use it. Once it has been started, it will remain in the running state until stopped.


Jupyter Notebook Takes a Long Time to Load

Symptom

JupyterLab Notebook does not load immediately from the icon on Notebook screen.

Cause/Action

This is expected behavior. It can take as long as a minute for JupyterLab to load.


404 Error Message from Notebook or Run Screen

Symptom

A 404 error message appears when:

  • Selecting the Tensorboard icon on an instance from the Training Run screen

  • Selecting the Notebook icon on an instance from the Notebook screen

Cause/Action

This can happen if either icon is selected too soon after starting the Notebook or Job. It can take several minutes before the selections are active. Wait for several minutes and retry the activity.


Insufficient GPU Message Even Though the Pool is Large Enough

Symptom

An error message appears when a Run is submitted, showing that there are not enough GPUs available. The submission is lower than the number of GPUs in the Pool, so that the Job should be queued.

Cause/Action

If this happens in a clustered system, it means that there are not enough GPUs to satisfy the submission on a single node. The Advanced option should be used when submitting the Run. This behavior is described in section Clustered Pools


GitHub Users Do Not Appear from On-Board Popup

Symptom

In the User On-Board Popup on the Operator screen, the user list for the organization does not appear.

Cause/Action

This can occur if the GitHub authorization credentials have been activated or changed. Refresh the User list as explained in section Add (On-Board) User


Data Science View Selection is Not Available after an Upgrade

Symptom

After upgrading DKube, only the Operator view is available. The normal Data Science toggle selection is not shown.

Cause/Action

Log out and back in again. The normal view selection will then be visible.