Installing DKubeX on RKE2 Cluster on Azure¶
Prerequisites¶
The minimum hardware requirements for installing DKubeX on a Azure cluster are as follows:
Hardware Requirement
Version/Details
CPU
12-16 cores
RAM
128GB
Disk
512GB
NFS (If external NFS server is being used)
1TB- v4.0/4.1
The minimum software requirements for installing DKubeX on a Azure cluster are as follows:
Software Requirement
Version/Details
RKE2 with Rancher
v2.7.6
K8S
v1.26.8 or higher
OS Version
Ubuntu 22.04
Network Provider
canal
You need to open the following range of ports to successfully install and access Rancher and DKubeX.
Port
Description
6443
Kubernetes API server
22
SSH
32443
DKubeX UI port
30000-32767
NodePorts range
You need to install helm & kubectl on your CPU (installer) node.
To install helm, run the commands on the CPU node terminal:
sudo apt install curl -y curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh helm version
To install kubectl, run the commands on the CPU node terminal:
Download the latest kubectl release with the command:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
To validate the binary, use the following commands:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256" echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
If the validation is successful, it should show the following:
kubectl: OK
If the check fails, sha256 exits with nonzero status and prints output similar to:
kubectl: FAILED sha256sum: WARNING: 1 computed checksum did NOT match
Install kubectl using the following command:
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
Test to ensure the version you installed is up-to-date:
kubectl version --client
Important
For more information regarding the installation of kubectl, refer to Install and Set Up kubectl on Linux. For more information regarding the installation of helm, refer to Installing Helm.
If you are using an external NFS server on Azure, visit the Installation: Creating an NFS Server on Azure Console page and follow the instructions provided to create an external NFS server.
Installing Rancher and Creating a Rancher Cluster¶
Important
For detailed information regarding installing Rancher CLI, refer to Rancher Kubernetes: A Quick Installation Guide for RKE2.
To install Rancher and create a Rancher cluster on your installer node, use the following steps:
Log into the root environment by running the following command:
sudo su
Run the installer service for the RKE2 server by running the following command.
curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION=v1.26.12+rke2r1 sh -
Enable the RKE2 server service by running the following command.
systemctl enable rke2-server.service
Start the RKE2 server service by running the following command. Once the command finishes running, your RKE2 service will be up.
systemctl start rke2-server.service
Note
You can check the RKE2 server logs by running the following command in another installer node terminal.
journalctl -u rke2-server -f
Enable
kubectl
by running the following command. This exports the kubeconfig file from the RKE2 service to the server environment.mkdir ~/.kube cp /etc/rancher/rke2/rke2.yaml ~/.kube/config
Get the server (installer) node token by running the following command. This token will be necessary later if you are going to add a new worker (agent) node to the RKE2 cluster.
cat /var/lib/rancher/rke2/server/node-token
Adding a worker node to the RKE2 cluster¶
To set up an worker (Agent) node to the RKE2 cluster, use the steps provided below.
On the worker node terminal, log into the root environment by running the following command:
sudo su
Run the installer service for the RKE2 agent by running the following command.
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sh -
Enable the RKE2 agent service by running the following command.
systemctl enable rke2-agent.service
To connect an agent node with the server a
config.yaml
file is required on the agent node. This file includes the server address and secret token for setting up the connection.mkdir -p /etc/rancher/rke2/ configure vim /etc/rancher/rke2/config.yaml
Provide the details provided below in the file and save it. Replace
<server>
with the IP address of the installer node and<server node token>
with the server (installer) node token which you got earlier. Once done, save the file.server: https://<server>:9345 token: <server node token>
Add the worker node to the RKE2 cluster by starting the agent service by running the following command:
systemctl start rke2-agent.service
Note
You can check the RKE2 server logs by running the following command in another installer node terminal.
journalctl -u rke2-server -f
Check the node status by running the following command. Once the node status is
Ready
, the worker node is ready to use.kubectl get nodes
You will see that currently the worker node will not have any roles assigned to it. To assign
worker
role to the worker node, assign a kubernetes label to the node by running the particular command. Replace the<worker node name>
part with the name of the worker node which you will get by listing the nodes with the previous command.kubectl label nodes <worker node name> node-role.kubernetes.io/worker=worker
kubectl label nodes dkubex-worker-node node-role.kubernetes.io/worker=worker
Installing DKubeX¶
Add the dkubex-helm repository on your system.
helm repo add dkubex-helm https://oneconvergence.github.io/dkubex-helm --insecure-skip-tls-verify
Update the helm repository.
helm repo update
From the Helm repository, get the values.yaml file on your local system by using the following command.
helm show values dkubex-helm/dkubex > values.yaml
You need to provide details regarding the version of DKubeX you are going to install and its components in the format provided in the values.yaml file. Open the editor by using the following command:
vim values.yaml
You need to fill the following fields (if required) on the values.yaml file:
Field
Description
image_tag
The version of DKubeX you are going to install. For example, v0.8.5.4.1.
admin_password
The password of the admin user of DKubeX.
registry -> name
The name of the docker registry repository you are going to use for DKubeX installation.
registry -> username
The username for the docker registry container you are going to use for DKubeX installation.
registry -> password
The password for the docker registry container you are going to use for DKubeX installation.
storage_type
The type of storage you are going to use for DKubeX. For example, nfs.
nfs -> create_server
To create a new local NFS server, set to true. If you are using an external NFS server, set to false.
nfs -> storage_path
The host path for creating NFS server. For external NFS server, provide the mount path as mentioned in the last step in Installation: Creating an NFS Server on Azure Console.
nfs -> server
The IP address of the NFS server. For external NFS server, provide the NFS share mount URL as mentioned in the last step in Installation: Creating an NFS Server on Azure Console.
nfs -> version
The version of NFS you are going to use. Currently supported versions are 4.1 and 4.0. If you are using an external NFS server on Azure, provide
4.0
.hostpath -> storage_path
The host path for creating local NFS server. (For Mac, use this instead of nfs)
user_home_nfs_server_version
NFS version for user home. Currently supported versions are 4.1 and 4.0. If you are using an external NFS server on Azure, provide
4.0
.wipe_data
To delete data from the storage, set to true, else false.
gpu_operator -> enabled
Set as false as we are installing the gpu-operator explicitly earlier.
kubeflow -> enabled
(Optional)To install Kubeflow for DKubeX, set to true, else false.
auth -> enabled
To enable OAuth for DKubeX, set to true, else false. You can also set OAuth post-installation. For more information, refer to Setting up Authentication.
auth -> provider
The OAuth provider you are going to use. Currently 6 OAuth providers are supported: ADFS, Azure, GitHub, Google, Keycloak and Okta.
auth -> issuer_url
The issuer URL of the OAuth provider you are going to use. (Not required for GitHub OAuth App)
auth -> client_id
The client ID of the OAuth application you are going to use.
auth -> client_secret
The client secret of the OAuth application you are going to use.
auth -> redirect_url
The callback URL of the OAuth application you are going to use.
auth -> organization
The organization name of which the users are going to be authenticated.
auth -> email_domain
The email domain of the users who are allowed to be authenticated.
auth -> azure_tenant
(Only for Azure) The tenant ID of the Azure OAuth application you are going to use.
auth -> realm
(Only for Keycloak) The realm name of the Keycloak OAuth application you are going to use.
auth -> allowed_role
(Only for Keycloak) The role name of the Keycloak OAuth application you are going to use.
auth -> allowed_group
(Only for Keycloak) The group name of the Keycloak OAuth application you are going to use.
mlflow
(Optional) Provide the details regarding the MLflow server you are going to use.
flyte -> enabled
(Optional) To install Flyte for DKubeX, set to true, else false. If set to true, you need to provide the details regarding the Flyte account you are going to use.
node_selector_label
The custom kubernetes label key that will be added to the worker nodes of the RKE2 cluster. Provide a custom label key and label your GPU nodes using this. Example: node.kubernetes.io/nodetype
# Default values for dkubex. # This is a YAML-formatted file. # Declare variables to be passed into your templates. app_namespace: "d3x" image_tag: "0.8.5.4.1" admin_password: "admipass123" provider : "dkubex" # dkubex/eks eks: autoscaler_arn : "" cluster_name: "" # Docker registry for installation registry: # Format: registry/[repo] name: "docker.io/dkubex123" # Container registry username username: "dkubex123" # Container registry password password: "Abc@xyz123" # nfs or hostpath storage_type: "nfs" nfs: # true for creating internal nfs server create_server: true # specify host path for creating internal nfs storage_path: "/var/dkubex" internal_nfs_node_selector: node_selector_key: "kubernetes.io/os" node_selector_value: "linux" taint : "" # Format: <server ip>:<path> nfs_server: "kubernetes:/" # specify nfs version supported are 4.1 and 4.0 version: "4.1" hostpath: # specify host path storage_path: "/var/dkubex" #specify the home_nfs_server if the user home is different user_home_nfs_server: "" # specify nfs version for user home supported are 4.1 and 4.0 user_home_nfs_server_version: "4.1" wipe_data: true #delete data from the storage # Format: dialect+driver://username:password@host:port/database database_url: "" image_pull_policy: "Always" loadbalancer: enabled: false eks: name: "dkubex" #aws cert arn cert_arn : "" cross_zone_lb: false internal_lb: true scheme: "" subnets: "" tags: "" # docker registry prifix reg_prefix : "" # enables gpu for dkubex gpu_operator: enabled: false driver: false #install driver toolkit: false #install toolkit # enables datadog for dkubex datadog: enabled: false site: "" #Datadog site key: "" #datadog key cluster_name: "" #Set a unique cluster name to allow scoping hosts and Cluster Checks easily # enables fm for dkubex fm: enabled: true s3: enabled : false bucket_name: "" # Bucket is used for both s3-mount on fm-controller and weaviate s3 backup weaviate_backup_path: "" aws_access_key_id: "" aws_access_secret: "" # enables kubeflow for dkubex kubeflow: enabled: false #enables sssd integration with dkubex can use sssd/ldap/local_ldap sssd: enabled: true type: ldap #sssd/ldap ldap_server: #create local ldap server for user management enabled: true auto_user_add: true #this will add authenticated user to the ldap if user doesnot exist. #if ldap_server is not enabled details for ldap and sssd #details of Ad when type is sssd and ldap server is not enabled ad: "" bind_password: "" ou: "" bind_user: "" #details of ldap when type is ldap and ldap server is not enabled ldap_url: "" ladap_search_base: "" #enables Oauth for dkubex, user can configure this from admin ui after installation too auth: enabled: true provider: "github" #okta/github issuer_url: "" client_id: "32879e2efcd6jf65lga2" client_secret: "9b7457993ffajhlj5376ec05fb2ae2c0b0c11f9" redirect_url: "https://123.45.67.890:32443/oauth2/callback" organisation: "oneconvergence" email_domain: "oneconvergence.com" azure_tenant: "" #tenent id for azure ad realm: "" # realm for keycload allowed_role: "" #keycload allowed role name allowed_group: "" # keycload allowed group name mlflow: replica_count : 1 database_url: "" artifacts_destination: "" aws_access_key_id: "" aws_access_secret: "" flyte: enabled: false accountNumber: #aws-account number accountRegion: #aws-region bucketName: #s3bucket name d3xUrl: #external url to access d3x cert_arn : "" node_selector_label: "node.kubernetes.io/nodetype" pod_security_enforce: "" # Dkubex components/controllers will be scheduled onto control-plane nodes. control_plane: node_selector: "kubernetes.io/os" node_selector_value: "linux" node_taint: "dkubex/controlplane=true:NoSchedule" enabled: false mlflow_controller: node_selector: "kubernetes.io/os" node_selector_value: "linux" node_taint: "" weka: enabled: false file_system_name: "" ips: "" username: "" password : "" workspace: enabled: true
You need to add kubernetes labels to all the worker nodes in the RKE cluster on which you are going to install DKubeX. You can do this by running the following command on your terminal. Replace the $node-name$ part with the name of the node you are going to add the label to, $key$ as the key that you are going to use during DKubeX installation, and $value$ as the node type.
Note
Use the following steps also if you are adding new worker nodes post DKubeX installation.
List all the nodes in the cluster by running the following command on your terminal. Check for the node role column which will tell which nodes have only worker roles.
kubectl get nodes
Label the worker nodes with its type by running the following command on your terminal using the node_selector_label key set in the values.yaml earlier.
kubectl label node <node-name> <key>=<value>
kubectl label node ip-172-31-1-132 node.kubernetes.io/nodetype=a10
Note
The value of the node_selector_label can be used as an input to -t or --type while creating a ray cluster/deploying/finetuning.
Run Helm installation of DKubeX on your setup by using the following command. Replace the <release-name> part with the version of DKubeX you are going to install.
helm install -f values.yaml <release-name> dkubex-helm/dkubex --timeout 1500s
You can see and follow the installation logs by running the following commands on your terminal.
kubectl logs -l job-name=dkubex-installer --follow --tail=-1
You can access your DKubeX setup by going to the following URL on your browser. Replace the <node-ip> part with the IP address of the node on which you have installed DKubeX.
URL
Example
https://<node-ip>:32443
https://123.45.67.890:32443
If you have not added OAuth configuration during installation, going to the previous URL opens the setup with a default user workspace.
Setting up Authentication¶
Note
This is an optional step provided only in case you have set auth -> enabled to false in the values.yaml file during installation. You can skip this step if you don’t want to set up authentication for your DKubeX setup.
If you have not set up the authentication for your DKubeX setup during installation, you can do it on the DKubeX Admin page by following the steps provided in the following page: Auth.
Note
For more information regarding the admin page, refer to Admin Guide.
You need to have a pre-created OAuth application.
Note
Currently DKubeX supports OAuth App by ADFS, Azure, GitHub, Google, Keycloak and Okta OAuth providers.
Open the admin page of your DKubeX setup by going to the following URL on your browser. Replace the <node-ip> part with the IP address of the node on which you have installed DKubeX.
https://<node-ip>:32443/admin
https://123.45.67.890:32443/admin
Upgrading DKubeX¶
To upgrade the setup, use the following steps:
Update the Helm repository.
helm repo update
Get the name of the deployed release by running the following command.
helm list -a
Get the values regarding the current deployed release on a .yaml file by running the following command. Replace the <deployed-release-name> part with the release name you got in the previous step. After that, you need to provide details regarding the version of DKubeX you are going to upgrade to and its components in the format provided in the values-upgrade.yaml file.
helm get values <deployed-release-name> --all > values-upgrade.yaml
Run the Helm upgrade job to upgrade the DKubeX version you are using by running the following command. Replace the <deployed-release-name> part with the current release name, and the <new-dkubex-version> with the version you are going to upgrade your DKubeX setup to.
helm upgrade -f values-upgrade.yaml $deployed-release-name$ dkubex-helm/dkubex --set image_tag=<new-dkube-version> --timeout 1500s
You can see and follow the upgradation logs by running the following commands on your terminal.
kubectl logs -l job-name=dkubex-upgrade-hook --follow --tail=-1
Uninstalling DKubeX¶
To uninstall the setup, use the following steps:
Get the name of the deployed release by running the following command.
helm list -a
Run the following command to uninstall the currently deployed DKubeX setup. Replace the <deployed-release-name> part with the current release name.
helm uninstall <deployed-release-name> --timeout 900s
You can see and follow the uninstallation logs by running the following commands on your terminal.
kubectl logs -l job-name=dkubex-uninstaller-hook --follow --tail=-1