Installing DKubeX on RKE2 Cluster on Azure

Prerequisites

  • The minimum hardware requirements for installing DKubeX on a RKE2 cluster on Azure are as follows:

    Hardware Requirement

    Version/Details

    CPU

    12-16 cores

    RAM

    128GB

    Disk

    512GB

    NFS (If external NFS server is being used)

    1TB- v4.0/4.1

    Note

    If you are using an Azure NFS server, the NFS version will be v4.0.

  • The minimum software requirements for installing DKubeX on a RKE2 cluster on Azure are as follows:

    Software Requirement

    Version/Details

    RKE2 with Rancher

    v2.7.6

    K8S

    v1.26 or higher

    OS Version

    Ubuntu 22.04

    Network Provider

    canal

  • You need to open the following range of ports to successfully install and access Rancher and DKubeX.

    Port

    Description

    6443

    Kubernetes API server

    22

    SSH

    32443

    DKubeX UI port

    30000-32767

    NodePorts range

  • You need to install helm & kubectl on your CPU (installer) node.

    • To install helm, run the commands on the CPU node terminal:

      sudo apt install curl -y
      curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
      chmod 700 get_helm.sh
      ./get_helm.sh
      helm version
      
    • To install kubectl, run the commands on the CPU node terminal:

      • Download the latest kubectl release with the command:

        curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
        
      • To validate the binary, use the following commands:

        curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
        echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
        

        If the validation is successful, it should show the following:

        kubectl: OK
        

        If the check fails, sha256 exits with nonzero status and prints output similar to:

        kubectl: FAILED
        sha256sum: WARNING: 1 computed checksum did NOT match
        
      • Install kubectl using the following command:

        sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
        
      • Test to ensure the version you installed is up-to-date:

        kubectl version --client
        

      Important

      For more information regarding the installation of kubectl, refer to Install and Set Up kubectl on Linux. For more information regarding the installation of helm, refer to Installing Helm.

Creating a RKE2 Cluster

Important

For detailed information regarding installing RKE2, refer to Rancher Kubernetes: A Quick Installation Guide for RKE2.

To create a RKE2 cluster on your installer node, use the following steps:

  • Log into the root environment by running the following command:

    sudo su
    
  • Run the installer service for the RKE2 server by running the following command.

    curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION=v1.26.12+rke2r1 sh -
    

    Note

    You can provide any other version of RKE2 (higher than v1.26.12) by changing the version in the command.

  • Enable and start the RKE2 server service by running the following command.

    systemctl enable rke2-server.service
    systemctl start rke2-server.service
    

    Note

    You can check the RKE2 server logs by running the following command in another installer node terminal.

    journalctl -u rke2-server -f
    
  • Enable kubectl by running the following command. This exports the kubeconfig file from the RKE2 service to the server environment.

    mkdir ~/.kube
    cp /etc/rancher/rke2/rke2.yaml ~/.kube/config
    
  • Get the server (installer) node IP and token by running the following commands. These will be necessary later if you are going to add a new worker (agent) node to the RKE2 cluster.

    echo "Master Node IP: $(hostname -I | awk '{print $1}')"
    echo "Node Token: $(cat /var/lib/rancher/rke2/server/node-token)"
    

Adding a worker node to the RKE2 cluster

To set up an worker (Agent) node to the RKE2 cluster, use the steps provided below.

  • On the worker node terminal, log into the root environment by running the following command:

    sudo su
    
  • Run the installer service for the RKE2 agent by running the following command.

    curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION=v1.26.12+rke2r1 INSTALL_RKE2_TYPE="agent" sh -
    

    Note

    You can provide any other version of RKE2 (higher than v1.26.12) by changing the version in the command. Make sure to use the same version as the RKE2 server.

  • Enable the RKE2 agent service by running the following command.

    systemctl enable rke2-agent.service
    
  • Export the installer node details that you got earlier by running the following commands. Replace <master node ip> with the installer node IP and <master node token> with the node token of the installer node.

    export MASTER_NODE_SERVER_IP="<master node ip>"
    export NODE_TOKEN="<master node token>"
    
  • Create and configure the agent by running the following commands:

    mkdir -p /etc/rancher/rke2/
    cat << EOF > /etc/rancher/rke2/config.yaml
    server: https://${MASTER_NODE_SERVER_IP}:9345
    token: ${NODE_TOKEN}
    EOF
    
  • Add the worker node to the RKE2 cluster by starting the agent service by running the following command:

    systemctl start rke2-agent.service
    

    Note

    • You can check the RKE2 server logs by running the following command in another installer node terminal.

      journalctl -u rke2-server -f
      
    • You can check the RKE2 agent logs by running the following command in the worker node terminal.

      journalctl -u rke2-agent -f
      
  • Check the node status by running the following command. Once the node status is Ready, the worker node is ready to use.

    kubectl get nodes
    
  • You will see that currently the worker node will not have any roles assigned to it. To assign worker role to the worker node, assign a kubernetes label to the node by running the particular command. Replace the <worker node name> part with the name of the worker node which you will get by listing the nodes with the previous command.

    kubectl label nodes <worker node name> node-role.kubernetes.io/worker=worker
    

Attention

The following step should be done at this point only if you are adding GPU nodes to the RKE2 cluster after installing DKubeX on it. If you are installing DKubeX on a RKE2 cluster without GPU nodes, then you can skip this step and proceed with completing the other prerequisites.

  • For RKE2, we’ve to explicitly install the gpu-operator on the CPU node using the below commands. Wait until the GPU operator goes into the running state.

    Note

    This is based on the references given under Rancher Kubernetes Engine 2 section in the following link: Getting Started- NVIDIA GPU Operator 23.9.0 documentation.

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
    helm repo update
    
    helm install gpu-operator --wait -n gpu-operator \
    --create-namespace nvidia/gpu-operator \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true
    
  • Before proceeding with DKubeX installation on the cluster, we need to make sure the below steps are done on ALL the nodes of the RKE2 cluster.

Note

Use the following steps also if you are adding new worker nodes post DKubeX installation.

  • Install the nfs-common package to avoid the mount failure issues we might encounter during DKubeX installation.

    sudo apt install nfs-common -y
    

    Note

    This step is as per the discussion in the github rancher threads. Below is the reference for the same.

    rancher/rancher- failed to mount volume backed by NFS in imported K3s cluster #25169

  • To avoid space issues on the device, follow the steps provided below and execute them in ALL the nodes.

    • Open the sysctl.conf file using the following command:

      sudo vim /etc/sysctl.conf
      
    • Update the file-descriptors settings by adding the 3 following lines to the end of the config file in /etc/sysctl.conf, and save the file:

      fs.file-max = 2097152
      fs.inotify.max_user_instances=2097152
      fs.inotify.max_user_watches=1048576
      user.max_user_namespaces=286334455
      
    • Load these new settings by executing the command:

      sudo sysctl -p
      

Installing DKubeX

  • Add the dkubex-helm repository on your system.

    helm repo add dkubex-helm https://oneconvergence.github.io/dkubex-helm --insecure-skip-tls-verify
    
  • Update the helm repository.

    helm repo update
    
  • From the Helm repository, get the values.yaml file on your local system by using the following command.

    helm show values dkubex-helm/dkubex > values.yaml
    
    • You need to provide details regarding the version of DKubeX you are going to install and its components in the format provided in the values.yaml file. Open the editor by using the following command:

      vim values.yaml
      
    • You need to fill the following fields (if required) on the values.yaml file:

      Field

      Description

      image_tag

      The version of DKubeX you are going to install. For example, v0.8.6.3.

      admin_password

      The password of the admin user of DKubeX.

      registry -> name

      The name of the docker registry repository you are going to use for DKubeX installation.

      registry -> username

      The username for the docker registry container you are going to use for DKubeX installation.

      registry -> password

      The password for the docker registry container you are going to use for DKubeX installation.

      storage_type

      The type of storage you are going to use for DKubeX. For example, nfs.

      nfs -> create_server

      To create a new local NFS server, set to true. If you are using an external NFS server, set to false.

      nfs -> storage_path

      The host path for creating NFS server. For external NFS server, provide the mount path as mentioned in the last step in Installation: Creating an NFS Server on Azure Console.

      nfs -> server

      The IP address of the NFS server. For external NFS server, provide the NFS share mount URL as mentioned in the last step in Installation: Creating an NFS Server on Azure Console.

      nfs -> version

      The version of NFS you are going to use. Currently supported versions are 4.1 and 4.0. If you are using an external NFS server on Azure, provide 4.0.

      hostpath -> storage_path

      The host path for creating local NFS server. (For Mac, use this instead of nfs)

      user_home_nfs_server_version

      NFS version for user home. Currently supported versions are 4.1 and 4.0. If you are using an external NFS server on Azure, provide 4.0.

      wipe_data

      To delete data from the storage, set to true, else false.

      gpu_operator -> enabled

      Set as false as we are installing the gpu-operator explicitly earlier.

      kubeflow -> enabled

      (Optional)To install Kubeflow for DKubeX, set to true, else false.

      auth -> enabled

      To enable OAuth for DKubeX, set to true, else false. You can also set OAuth post-installation. For more information, refer to Setting up Authentication.

      auth -> provider

      The OAuth provider you are going to use. Currently 6 OAuth providers are supported: ADFS, Azure, GitHub, Google, Keycloak and Okta.

      auth -> issuer_url

      The issuer URL of the OAuth provider you are going to use. (Not required for GitHub OAuth App)

      auth -> client_id

      The client ID of the OAuth application you are going to use.

      auth -> client_secret

      The client secret of the OAuth application you are going to use.

      auth -> redirect_url

      The callback URL of the OAuth application you are going to use.

      auth -> organization

      The organization name of which the users are going to be authenticated.

      auth -> email_domain

      The email domain of the users who are allowed to be authenticated.

      auth -> azure_tenant

      (Only for Azure) The tenant ID of the Azure OAuth application you are going to use.

      auth -> realm

      (Only for Keycloak) The realm name of the Keycloak OAuth application you are going to use.

      auth -> allowed_role

      (Only for Keycloak) The role name of the Keycloak OAuth application you are going to use.

      auth -> allowed_group

      (Only for Keycloak) The group name of the Keycloak OAuth application you are going to use.

      mlflow

      (Optional) Provide the details regarding the MLflow server you are going to use.

      flyte -> enabled

      (Optional) To install Flyte for DKubeX, set to true, else false. If set to true, you need to provide the details regarding the Flyte account you are going to use.

      node_selector_label

      The custom kubernetes label key that will be added to the worker nodes of the RKE2 cluster. Provide a custom label key and label your GPU nodes using this. Example: node.kubernetes.io/nodetype

  • You need to add kubernetes labels to all the worker nodes in the RKE cluster on which you are going to install DKubeX. You can do this by running the following command on your terminal. Replace the $node-name$ part with the name of the node you are going to add the label to, $key$ as the key that you are going to use during DKubeX installation, and $value$ as the node type.

    Note

    Use the following steps also if you are adding new worker nodes post DKubeX installation.

    • List all the nodes in the cluster by running the following command on your terminal. Check for the node role column which will tell which nodes have only worker roles.

      kubectl get nodes
      
    • Label the worker nodes with its type by running the following command on your terminal using the node_selector_label key set in the values.yaml earlier.

      kubectl label node <node-name> <key>=<value>
      

    Note

    The value of the node_selector_label can be used as an input to -t or --type while creating a ray cluster/deploying/finetuning.

  • Run Helm installation of DKubeX on your setup by using the following command. Replace the <release-name> part with the version of DKubeX you are going to install.

    helm install -f values.yaml <release-name> dkubex-helm/dkubex --timeout 1500s
    
  • You can see and follow the installation logs by running the following commands on your terminal.

    kubectl logs -l job-name=dkubex-installer --follow --tail=-1
    
  • You can access your DKubeX setup by going to the following URL on your browser. Replace the <node-ip> part with the IP address of the node on which you have installed DKubeX.

    URL

    Example

    https://<node-ip>:32443

    https://123.45.67.890:32443

  • If you have not added OAuth configuration during installation, going to the previous URL opens the setup with a default user workspace.

Setting up Authentication

Note

This is an optional step provided only in case you have set auth -> enabled to false in the values.yaml file during installation. You can skip this step if you don’t want to set up authentication for your DKubeX setup.

If you have not set up the authentication for your DKubeX setup during installation, you can do it on the DKubeX Admin page by following the steps provided in the following page: Auth.

Note

For more information regarding the admin page, refer to Admin Guide.

  • You need to have a pre-created OAuth application.

    Note

    Currently DKubeX supports OAuth App by ADFS, Azure, GitHub, Google, Keycloak and Okta OAuth providers.

  • Open the admin page of your DKubeX setup by going to the following URL on your browser. Replace the <node-ip> part with the IP address of the node on which you have installed DKubeX.

    https://<node-ip>:32443/admin
    

Upgrading DKubeX

To upgrade the setup, use the following steps:

  • Update the Helm repository.

    helm repo update
    
  • Get the name of the deployed release by running the following command.

    helm list -a
    
  • Get the values regarding the current deployed release on a .yaml file by running the following command. Replace the <deployed-release-name> part with the release name you got in the previous step. After that, you need to provide details regarding the version of DKubeX you are going to upgrade to and its components in the format provided in the values-upgrade.yaml file.

    helm get values <deployed-release-name> --all  > values-upgrade.yaml
    
  • Run the Helm upgrade job to upgrade the DKubeX version you are using by running the following command. Replace the <deployed-release-name> part with the current release name, and the <new-dkubex-version> with the version you are going to upgrade your DKubeX setup to.

    helm upgrade -f values-upgrade.yaml $deployed-release-name$ dkubex-helm/dkubex --set image_tag=<new-dkube-version> --timeout 1500s
    
  • You can see and follow the upgradation logs by running the following commands on your terminal.

    kubectl logs -l job-name=dkubex-upgrade-hook --follow --tail=-1
    

Uninstalling DKubeX

To uninstall the setup, use the following steps:

  • Get the name of the deployed release by running the following command.

    helm list -a
    
  • Run the following command to uninstall the currently deployed DKubeX setup. Replace the <deployed-release-name> part with the current release name.

    helm uninstall <deployed-release-name> --timeout 900s
    
  • You can see and follow the uninstallation logs by running the following commands on your terminal.

    kubectl logs -l job-name=dkubex-uninstaller-hook --follow --tail=-1