Installing DKubeX on Tanzu Cluster¶

Prerequisites¶

The minimum hardware requirements for installing DKubeX on a Tanzu cluster are as follows:

Hardware Requirement

Version/Details

CPU

12-16 cores

RAM

128GB

Disk

512GB

NFS (If external NFS server is being used)

1TB- v4.0/4.1
The minimum software requirements for installing DKubeX on a Tanzu cluster are as follows:

Software Requirement

Version/Details

RKE2 with Rancher

v2.10.1

K8S

v1.28.15 or higher

OS Version

Ubuntu 22.04

Network Provider

canal
You need to open the following range of ports to successfully install and access Rancher and DKubeX.

Port

Description

6443

Kubernetes API server

443

Rancher UI

22

SSH

32443

DKubeX UI port

30000-32767

NodePorts range
You need to install helm & kubectl on your CPU (installer) node.
- To install helm, run the commands on the CPU node terminal:
```
sudo apt install curl -y
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
helm version
```
- To install kubectl, run the commands on the CPU node terminal:
  - Download the latest kubectl release with the command:
    curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
  - To validate the binary, use the following commands:
    curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256" echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
    If the validation is successful, it should show the following:
    kubectl: OK
    If the check fails, sha256 exits with nonzero status and prints output similar to:
    kubectl: FAILED sha256sum: WARNING: 1 computed checksum did NOT match
  - Install kubectl using the following command:
    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
  - Test to ensure the version you installed is up-to-date:
    kubectl version --client
  Important
  
  For more information regarding the installation of kubectl, refer to Install and Set Up kubectl on Linux. For more information regarding the installation of helm, refer to Installing Helm.

Hardware Requirement	Version/Details
CPU	12-16 cores
RAM	128GB
Disk	512GB
NFS (If external NFS server is being used)	1TB- v4.0/4.1

Software Requirement	Version/Details
RKE2 with Rancher	v2.10.1
K8S	v1.28.15 or higher
OS Version	Ubuntu 22.04
Network Provider	canal

Port	Description
`6443`	Kubernetes API server
`443`	Rancher UI
`22`	SSH
`32443`	DKubeX UI port
`30000-32767`	NodePorts range

Use the following steps to install Docker and rancher on your CPU node:

Uninstall all conflicting packages using the command:

for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done

Update the package lists for available software packages by using the following command:
```
sudo apt-get update
```
Install ca-certificates, curl, and gnupg using the following command:
```
sudo apt-get install ca-certificates curl gnupg -y
```

To set up the docker apt repository, use the following commands in order:

Note

You will only need to do this for the first time installing Docker to a host machine.

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

Use the following commands to add the repository to the apt sources:

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

Install the latest versions of the Docker packages using the command:

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y

To test if the installaion is successful, use the following command to run the hello-world image.

Note

Running this command will download a test image. The image will then be run in a container followed by a printed confirmation message.
```
sudo docker run hello-world
```

Creating a Rancher Cluster¶

Once the Docker installation has been completed on the installer node, install Rancher using the command:

sudo docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged rancher/rancher:v2.10.1

Once the Docker and Rancher installation has been completed on the CPU node, you need to log into the Rancher UI and create a Rancher cluster.
- To log into the Rancher UI, use the following steps:
  - Copy the public IP of the CPU node, paste and open it on your browser. This opens the Rancher configuration page for that node.
    
    https://<CPU_Node_IP>
  - To find your container ID, use the command on CPU node terminal window:
    sudo docker container ps
  - Copy the container ID and replace the <Container_ID> with your ID and run the following comand This will provide you with the password for Rancher for that node.
    sudo docker logs <Container_ID> 2>&1 | grep "Bootstrap Password:"
  - Once you obtain the Bootstrap Password, copy it and paste it in the space provided for the password in the Rancher page.
  - Create a new password on the next screen and check the boxes provided below the Server URL space & click on the Continue button. This will open the Rancher dashboard for the node.
- Once you are logged into the Rancher UI, follow the steps provided below to create a RKE2 cluster.
  - In the Rancher dashboard, click on the Create option to create an RKE2 Cluster.
  - Select the Custom option provided at the bottom of the list to create a custom cluster.
  - Provide the Cluster with a name. Optionally you can provide a description for the cluster.
  - Choose the Networking option in the Cluster Configuration list of options provided on the left side of the page.
    - On the Addressing section, provide cluster CIDR from a private network.
    Note
    
    You can choose cluster CIDR from any of the following classes.
    - 192.168.255.255 (Class C)
    - 172.16.255.255 to 172.31.255.255 (class B)
    - 10.255.255.255 (Class A)
  - Choose the Advanced option in the Cluster Configuration list of options provided on the left of the page.
  - Click on the Add Argument option then copy the line provided below & paste it in the space provided. Once you have done so, click on the Create option provided on the bottom right corner of the page. This will start provisioning the cluster.
    max-pods=250
- On the dashboard page of the newly created cluster, click and open the Registration tab.
- You need to register all the nodes (including the CPU/installer node) to the cluster. Follow the steps given below to do so.
  - On Step 1- Node Role, make sure all the checkboxes (etcd, Control Plane, Worker) are checked.
  - On Step 2- Registration Command, click on the insecure checkbox.
  - Copy the registration command provided and run it on the CPU node terminal. This will start the registration process for the node.
  - On Step 1- Node Role, make sure that only the Worker checkbox is checked.
  - On Step 2- Registration Command, click on the insecure checkbox.
  - Copy the registration command provided and run it on the worker node terminal. This will start the registration process for the node.
  Note
  
  Click on the Provisioning Log option to view more details regarding the provisioning process.
- Wait until the cluster status (shown beside the cluster name) changes to Active. This will take a few minutes. Once it happens, your Tanzu cluster is ready to use.
You have to add the new kubernetes context for your cluster to your CPU node.
- Run the following commands on the CPU node terminal:
```
mkdir .kube
cd .kube
vim config
```
- On the Rancher Dashboard, in the Clusters option, select the button menu with 3 dots to the right of the Config option. Then select the Copy KubeConfig to Clipboard option from the from the dropdown. Paste the configuration in the config file on your terminal, save the config file and go back to the base directory by running cd.

Attention

The following step should be done at this point only if you are adding GPU nodes to the Tanzu cluster after installing DKubeX on it. If you are installing DKubeX on a Tanzu cluster without GPU nodes, then you can skip this step and proceed with completing the other prerequisites.

For Tanzu, we’ve to explicitly install the gpu-operator on the CPU node using the below commands. Wait until the GPU operator goes into the running state.

Note

This is based on the references given under Rancher Kubernetes Engine 2 section in the following link: Getting Started- NVIDIA GPU Operator 23.9.0 documentation.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator --wait -n gpu-operator \
--create-namespace nvidia/gpu-operator \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true

Before proceeding with DKubeX installation on the cluster, we need to make sure the below steps are done on ALL the nodes of the Tanzu cluster.
- Install the nfs-common package to avoid the mount failure issues we might encounter during DKubeX installation.
```
sudo apt install nfs-common -y
```
  Note
  
  This step is as per the discussion in the github rancher threads. Below is the reference for the same.
  
  rancher/rancher- failed to mount volume backed by NFS in imported K3s cluster #25169
- To avoid space issues on the device, follow the steps provided below and execute them in ALL the nodes.
  - Open the sysctl.conf file using the following command:
    sudo vim /etc/sysctl.conf
  - Update the file-descriptors settings by adding the 3 following lines to the end of the config file in /etc/sysctl.conf, and save the file:
    fs.file-max = 2097152 fs.inotify.max_user_instances=2097152 fs.inotify.max_user_watches=1048576 user.max_user_namespaces=286334455
  - Load these new settings by executing the command:
    sudo sysctl -p

Attention

If you are installing DKubeX on a Tanzu cluster without GPU/worker nodes, and are going to add GPU nodes later, then please follow the steps given in the Adding GPU Nodes to Tanzu Cluster section after registering the worker node to the Tanzu cluster.

Installing DKubeX¶

Add the dkubex-helm repository on your system.

helm repo add dkubex-helm https://oneconvergence.github.io/dkubex-helm --insecure-skip-tls-verify

Update the helm repository.
```
helm repo update
```

From the Helm repository, get the values.yaml file on your local system by using the following command.

helm show values dkubex-helm/dkubex > values.yaml

You need to provide details regarding the version of DKubeX you are going to install and its components in the format provided in the values.yaml file. Open the editor by using the following command:
```
vim values.yaml
```

You need to fill the following fields (if required) on the values.yaml file:

Field	Description
image_tag	The version of DKubeX you are going to install. For example, v0.8.4.1.
admin_password	The password of the admin user of DKubeX.
registry -> name	The name of the docker registry repository you are going to use for DKubeX installation.
registry -> username	The username for the docker registry container you are going to use for DKubeX installation.
registry -> password	The password for the docker registry container you are going to use for DKubeX installation.
storage_type	The type of storage you are going to use for DKubeX. For example, nfs.
nfs -> create_server	To create a new local NFS server, set to true, else false.
nfs -> server	The IP address of the local NFS server.
nfs -> storage_path	The host path for creating local NFS server.
nfs -> version	The version of NFS you are going to use. Currently supported versions are 4.1 and 4.0
hostpath -> storage_path	The host path for creating local NFS server. (For Mac, use this instead of nfs)
wipe_data	To delete data from the storage, set to true, else false.
gpu_operator -> enabled	Set as false as we are installing the gpu-operator explicitly earlier.
kubeflow -> enabled	(Optional)To install Kubeflow for DKubeX, set to true, else false.
auth -> enabled	To enable OAuth for DKubeX, set to true, else false. You can also set OAuth post-installation. For more information, refer to Setting up Authentication.
auth -> provider	The OAuth provider you are going to use. Currently 6 OAuth providers are supported: ADFS, Azure, GitHub, Google, Keycloak and Okta.
auth -> issuer_url	The issuer URL of the OAuth provider you are going to use. (Not required for GitHub OAuth App)
auth -> client_id	The client ID of the OAuth application you are going to use.
auth -> client_secret	The client secret of the OAuth application you are going to use.
auth -> redirect_url	The callback URL of the OAuth application you are going to use.
auth -> organization	The organization name of which the users are going to be authenticated.
auth -> email_domain	The email domain of the users who are allowed to be authenticated.
auth -> azure_tenant	(Only for Azure) The tenant ID of the Azure OAuth application you are going to use.
auth -> realm	(Only for Keycloak) The realm name of the Keycloak OAuth application you are going to use.
auth -> allowed_role	(Only for Keycloak) The role name of the Keycloak OAuth application you are going to use.
auth -> allowed_group	(Only for Keycloak) The group name of the Keycloak OAuth application you are going to use.
mlflow	(Optional) Provide the details regarding the MLflow server you are going to use.
flyte -> enabled	(Optional) To install Flyte for DKubeX, set to true, else false. If set to true, you need to provide the details regarding the Flyte account you are going to use.
node_selector_label	The kubernetes label key that will be added to the worker nodes of the RKE2 cluster.

# Default values for dkubex.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

app_namespace: "d3x"
image_tag: 0.8.4.1
admin_password: "adminpass123"
provider : "dkubex" # dkubex/eks
eks:
    autoscaler_arn : ""
    cluster_name: ""

# Docker registry for installation
registry:
    # Format: registry/[repo]
    name: "docker.io/dkubex123"

    # Container registry username
    username: "dkubex123"

    # Container registry password
    password: "Abc@xyz123"

# nfs or hostpath
storage_type: "nfs"
nfs:
    # true for creating local nfs server
    create_server: true
    # specify host path for creating internal nfs
    storage_path: "/var/dkubex"
    internal_nfs_node_selector:
        node_selector_key: "kubernetes.io/os"
        node_selector_value: "linux"
        taint : ""
    # Format: <server ip>:<path>
    nfs_server: "kubernetes:/"
    # specify nfs version supported are 4.1 and 4.0
    version: "4.1"
hostpath:
    # specify  host path
    storage_path: "/var/dkubex"

#specify the home_nfs_server if the user home is different
user_home_nfs_server: ""
# specify nfs version for user home supported are 4.1 and 4.0
user_home_nfs_server_version: "4.1"

wipe_data: true #delete data from the storage
# Format: dialect+driver://username:password@host:port/database
database_url: ""
image_pull_policy: "Always"
loadbalancer:
    enabled: false
    eks:
        name: "dkubex"
        #aws cert arn
        cert_arn : ""
        cross_zone_lb: false
        internal_lb: true
        scheme: ""
        subnets: ""
        tags: ""

# docker registry prifix
reg_prefix : ""

# enables gpu for dkubex
gpu_operator:
enabled: false
driver: false #install driver
toolkit: false #install toolkit

# enables datadog  for dkubex
datadog:
  enabled: false
  site: ""                #Datadog site
  key: ""                 #datadog key
  cluster_name: ""        #Set a unique cluster name to allow scoping hosts and Cluster Checks easily

# enables fm for dkubex
fm:
    enabled: true
    s3:
        enabled : false
        bucket_name: ""
        weaviate_backup_path: ""
        aws_access_key_id: ""
        aws_access_secret: ""

# enables kubeflow for dkubex
kubeflow:
    enabled: true

#enables sssd integration with dkubex can use sssd/ldap/local_ldap
sssd:
    enabled: true
    type: ldap #sssd/ldap
    ldap_server:
        #create local ldap server for user management
        enabled: true
        auto_user_add: true #this will add authenticated user to the ldap if user doesnot exist.

    #if ldap_server is not enabled details for ldap and sssd
    #details of Ad when type is sssd and ldap server is not enabled
    ad: ""
    bind_password: ""
    ou: ""
    bind_user: ""
    #details of ldap when type is ldap and ldap server is not enabled
    ldap_url: ""
    ladap_search_base: ""


#enables Oauth for dkubex
auth:
    enabled: true
    provider: "github" #okta/github
    issuer_url: ""
    client_id: "32879e2efcd6jf65lga2"
    client_secret: "9b7457993ffajhlj5376ec05fb2ae2c0b0c11f9"
    redirect_url: "https://123.45.67.890:32443/oauth2/callback"
    organisation: "oneconvergence"
    email_domain: "oneconvergence.com"
    azure_tenant: "" #tenent id for azure ad
    realm: "" # realm for keycload
    allowed_role: "" #keycload allowed role name
    allowed_group: "" # keycload allowed group name

mlflow:
    database_url: ""
    artifacts_destination: ""
    aws_access_key_id: ""
    aws_access_secret: ""
flyte:
    enabled: false
    accountNumber: #aws-account number
    accountRegion: #aws-region
    bucketName: #s3bucket name
    d3xUrl: #external url to access d3x
    cert_arn : ""
node_selector_label: "node.kubernetes.io/nodetype"
pod_security_enforce: ""
controller:
    node_selector: "kubernetes.io/os"
    node_selector_value: "linux"
    node_taint: ""

weka:
    enabled: false
    file_system_name: ""
    ips: ""
    username: ""
    password : ""

You need to add kubernetes labels to all the worker nodes in the RKE cluster on which you are going to install DKubeX. You can do this by running the following command on your terminal. Replace the $node-name$ part with the name of the node you are going to add the label to, $key$ as the key that you are going to use during DKubeX installation, and $value$ as the node type.
- List all the nodes in the cluster by running the following command on your terminal. Check for the node role column which will tell which nodes have only worker roles.
```
kubectl get nodes
```
- Label the worker nodes with its type by running the following command on your terminal using the node_selector_label key set in the values.yaml earlier.
  kubectl label node <node-name> <key>=<value>
  kubectl label node ip-172-31-1-132 node.kubernetes.io/nodetype=a10
Note

The value of the node_selector_label can be used as an input to -t or --type while creating a ray cluster/deploying/finetuning.
Run Helm installation of DKubeX on your setup by using the following command. Replace the <release-name> part with the version of DKubeX you are going to install.
```
helm install -f values.yaml --version 0.1.37 <release-name> dkubex-helm/dkubex --timeout 1500s
```
You can see and follow the installation logs by running the following commands on your terminal.
```
kubectl logs -l job-name=dkubex-installer --follow --tail=-1
```
You can access your DKubeX setup by going to the following URL on your browser. Replace the <node-ip> part with the IP address of the node on which you have installed DKubeX.

URL

Example

https://<node-ip>:32443

https://123.45.67.890:32443
If you have not added OAuth configuration during installation, going to the previous URL opens the setup with a default user workspace.

URL	Example
`https://<node-ip>:32443`	`https://123.45.67.890:32443`

Setting up Authentication¶

Note

This is an optional step provided only in case you have set auth -> enabled to false in the values.yaml file during installation. You can skip this step if you don’t want to set up authentication for your DKubeX setup.

If you have not set up the authentication for your DKubeX setup during installation, you can do it on the DKubeX Admin page by following the steps provided in the following page: Auth.

Note

For more information regarding the admin page, refer to Admin Guide.

You need to have a pre-created OAuth application.

Note

Currently DKubeX supports OAuth App by ADFS, Azure, GitHub, Google, Keycloak and Okta OAuth providers.
Open the admin page of your DKubeX setup by going to the following URL on your browser. Replace the <node-ip> part with the IP address of the node on which you have installed DKubeX.
https://<node-ip>:32443/admin
https://123.45.67.890:32443/admin

Adding GPU Nodes to Tanzu Cluster¶

Attention

Please use the steps provided in this section only if you are adding GPU nodes to the Tanzu cluster after installing DKubeX on it.
Make sure you have registered the worker node to the cluster as specified in the Prerequisites section.

You need to install helm & kubectl on your CPU (installer) node.
- To install helm, run the commands on the CPU node terminal:
```
sudo apt install curl -y
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
helm version
```
- To install kubectl, run the commands on the CPU node terminal:
  - Download the latest kubectl release with the command:
    curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
  - To validate the binary, use the following commands:
    curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256" echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
    If the validation is successful, it should show the following:
    kubectl: OK
    If the check fails, sha256 exits with nonzero status and prints output similar to:
    kubectl: FAILED sha256sum: WARNING: 1 computed checksum did NOT match
  - Install kubectl using the following command:
    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
  - Test to ensure the version you installed is up-to-date:
    kubectl version --client
  Important
  
  For more information regarding the installation of kubectl, refer to Install and Set Up kubectl on Linux. For more information regarding the installation of helm, refer to Installing Helm.
Before proceeding with DKubeX installation on the cluster, we need to make sure the below steps are done on the worker node.
- Install the nfs-common package to avoid the mount failure issues we might encounter during DKubeX installation.
```
sudo apt install nfs-common
```
  Note
  
  This step is as per the discussion in the github rancher threads. Below is the reference for the same.
  
  rancher/rancher- failed to mount volume backed by NFS in imported K3s cluster #25169
- To avoid space issues on the device, follow the steps provided below and execute them on the worker node terminal.
  - Open the sysctl.conf file using the following command:
    sudo vim /etc/sysctl.conf
  - Update the file-descriptors settings by adding the 3 following lines to the end of the config file in /etc/sysctl.conf, and save the file:
    fs.file-max = 2097152 fs.inotify.max_user_instances=2097152 fs.inotify.max_user_watches=1048576 user.max_user_namespaces=286334455
  - Load these new settings by executing the command:
    sudo sysctl -p
You need to add kubernetes labels to all the worker nodes in the RKE cluster on which you are going to install DKubeX. You can do this by running the following command on your terminal. Replace the <node-name> part with the name of the node you are going to add the label to, <key> as the key that you are going to use during DKubeX installation, and <value> as the node type.
- List all the nodes in the cluster by running the following command on your terminal. Check for the node role column which will tell which nodes have only worker roles.
```
kubectl get nodes
```
- Label the worker nodes with its type by running the following command on your terminal using the node_selector_label key set in the values.yaml earlier.
  kubectl label node <node-name> <key>=<value>
  kubectl label node ip-172-31-1-132 node.kubernetes.io/nodetype=a10
Note

The value of the node_selector_label can be used as an input to -t or --type while creating a ray cluster/deploying/finetuning.

Upgrading DKubeX¶

To upgrade the setup, use the following steps:

Update the Helm repository.
```
helm repo update
```
Get the name of the deployed release by running the following command.
```
helm list -a
```
Get the values regarding the current deployed release on a .yaml file by running the following command. Replace the <deployed-release-name> part with the release name you got in the previous step. After that, you need to provide details regarding the version of DKubeX you are going to upgrade to and its components in the format provided in the values-upgrade.yaml file.
```
helm get values <deployed-release-name> --all  > values-upgrade.yaml
```
Run the Helm upgrade job to upgrade the DKubeX version you are using by running the following command. Replace the <deployed-release-name> part with the current release name, and the <new-dkubex-version> with the version you are going to upgrade your DKubeX setup to.
```
helm upgrade -f values-upgrade.yaml $deployed-release-name$ dkubex-helm/dkubex --set image_tag=<new-dkube-version> --timeout 1500s
```
You can see and follow the upgradation logs by running the following commands on your terminal.
```
kubectl logs -l job-name=dkubex-upgrade-hook --follow --tail=-1
```

Uninstalling DKubeX¶

To uninstall the setup, use the following steps:

Get the name of the deployed release by running the following command.
```
helm list -a
```
Run the following command to uninstall the currently deployed DKubeX setup. Replace the <deployed-release-name> part with the current release name.
```
helm uninstall <deployed-release-name> --timeout 900s
```
You can see and follow the uninstallation logs by running the following commands on your terminal.
```
kubectl logs -l job-name=dkubex-uninstaller-hook --follow --tail=-1
```