Training a ML Model in DKubeX / Running Ray Jobs

MLOps Tutorials

You can train ML models in DKubeX by running Ray jobs on the Ray clusters created. To learn how to create a Ray job in DKubeX, use the following steps.

Prerequisites

  • You need to have a Ray cluster created and in Ready state. To learn how to create a Ray cluster in DKubeX, please refer to Creating a Ray Cluster in DKubeX.

  • For this tutorial, you need to have the training script available on your DKubeX workspace. To download the training script for this tutorial, use the following steps:

    1. On the My Workspace page of your DKubeX setup, click on the Terminal app to open the Terminal CLI.

    2. In the Terminal CLI, run the following command to download the dkubex-examples repo containing the example files for DKubeX to be used in this user guide:

      git clone https://github.com/dkubeio/dkubex-examples.git
      

Once both of the prerequisites are met, you can proceed with training a ML model using the following steps:

Creating a Ray Job

  • Open the Jobs page on your DKubeX workspace. This page lists all the Ray jobs that are currently running or have been executed previously.

  • To create a new Ray job, click on the “Create Job” button (shown as a “+” button on the top left corner of the Jobs page).

    Create Job Button (+)

    Create Job Button (+)

  • On the General page, provide the following details:

    General Page – Job Creation

    General Page – Job Creation

    • On the General section, select Ray as the type of job to be launched and provide the following details:

      Field

      Description

      Name

      Provide a unique name for the Ray job. For this tutorial, provide mnistrayjob.

      Cluster

      Select the Ray cluster on which this job will run. For this tutorial, select the mnistraycluster Ray cluster created in the Creating a Ray Cluster in DKubeX tutorial.

    • Once done, click on the Next button to proceed to the Configuration page.

  • On the Configuration page, provide the following details:

    Configuration Page – Job Creation

    Configuration Page – Job Creation

    • In the Setup Creation section, provide the following details:

      Field

      Description

      Startup Commands

      Provide the command that will run the training script. For this tutorial, provide the following command, replacing the <your-username> placeholder with your actual DKubeX username: python /home/<your-username>/dkubex-examples/ray/torch_fashion_mnist_example.py

      Pip Packages

      Provide the list of pip packages that are required to run the training script (Optional). make sure to separate the package names by comma (,) and not provide any space. You can specify the package version for any particular package by using ==. For this tutorial, provide the following packages: tensorflow,torch,torchvision

    • Optionally, in the Environment Variables section, provide the following details:

      Field

      Description

      Key

      Provide the key/name of the environment variable.

      Value

      Provide the value for the environment variable.

      You can add multiple environment variables by clicking on the + Add button.

    • Optionally, in the Entrypoint section, provide the following details:

      Field

      Description

      CPUs

      Number of CPUs to be allocated to the Ray job (Optional).

      Memory

      Amount of memory (in GB) to be allocated to the Ray job (Optional).

      GPUs

      Number of GPUs to be allocated to the Ray job (Optional).

  • Once done, click on the Submit button to launch the training job on the selected Ray cluster.

    Submitted Ray Job

    Submitted Ray Job

    • To see the details of the submitted Ray job, click on the job name (mnistrayjob in this tutorial) from the Jobs page.

    • To access the details and logs of the job from the Ray dashboard, click on the Ray Dashboard button on the right side of that job entry.

      Ray Dashboard for Ray Job

      Ray Dashboard for Ray Job

  • Once the job run is finished, the status of the job will change to Succeeded on the Jobs page.

    Succeeded Ray Job

    Succeeded Ray Job

Registering the Trained ML Model on MLFlow

Once the training job is finished successfully, you can register the trained ML model on MLFlow on DKubeX. This is required to use the trained model for deployment and inferencing. The steps to register the trained model to MLFlow is provided below:

  • Click on the MLFlow button on the right side of the job entry on the Jobs page. This will open the MLFlow experiment record page of the training run.

    Ray Job Experiment Page on MLFlow

    Ray Job Experiment Page on MLFlow

  • On the top right corner of the experiment page, click on the Register Model button to register the trained model. The Register Model dialogue box will open.

    Register Model Dialog Box

    Register Model Dialog Box

  • From the Model dropdown, select the + Create New Model option and provide a unique name for the model to be registered. For this tutorial, provide mnistmodel as the model name. Once done, click on the Register button to register the trained model on MLFlow. The trained model will be registered on MLFlow as version 1, and the model page with the version details will open on MLFlow.

    Registered Model Page on MLFlow

    Registered Model Page on MLFlow

Your ML Model is now trained on DKubeX, and registered on MLFlow. You can now proceed to deploy the trained model for inferencing, the steps for which you can find in the Deploying and Inferencing a ML Model in DKubeX tutorial.