Using SkyPilot¶
SkyPilot is a framework tailored for executing Large Language Models (LLMs) across various cloud platforms, ensuring optimal cost efficiency, abundant GPU availability, and streamlined execution management. It assists in provisioning tasks to utilize the most advantageous resources available while minimizing expenses.
Prerequisites¶
Note
Currently SkyPilot supports virtual VM clusters like EKS and RKE, but not physical/local clusters.
Onboarding user¶
You must have a cloud account (AWS/GCP) or any other account created to configure for onboarding SkyPilot.
If you are in an organization, an IAM User Account is needed to set up SkyPilot policy for the account.
For example; If you have an AWS IAM user account, use the following URL:
Also, in case of using AWS cloud account, an Access Key and Secret Access Key is required to access the AWS account resources.
To obtain the access key and secret access key, click on the following page and follow the steps provided:
Setting up SkyPilot¶
On your terminal, search for the sky controller pod using the following command. Ensure that the pod is running.
kubectl get po -A | grep sky
You need to enter the container of the pod from the sky controller pod name you got from the previous step. Use the following command. Replace $sky controller pod$ with the pod name.
kubectl exec -it $sky controller pod$ -n d3x bash
kubectl exec -it sky-controller-v2-0 -n d3x bash
Use the following command and provide your AWS credentials:
aws configure
Note
This command prompts the user to provide their configuration details (AWS Access Key ID, AWS Secret Access Key, Default Region and Output Format). These credentials are required to authenticate the user’s CLI commands and allow the user to access the AWS resources associated with their AWS account.
Providing required files¶
You need to put the files required to run the SkyPilot job in your workspace. You can use the filebrowser application in DKubeX for this or directly get them from an available repository by the DKubeX CLI.
For the examples provided in this guide, we are using the dkubex-examples repository. This repo consists of the wine and llama2 model for training.
Clone the repository and access the files using the following commands:
git clone https://github.com/dkubeio/dkubex-examples.git
cd dkubex-examples
git checkout apps-v2
cd sky
Running Skypilot Job¶
Two examples regarding running Skypilot jobs are provided here. To run the example you want, click on the appropriate link provided below.
Example |
Description |
Link |
---|---|---|
Wine Model Finetuning |
This example demonstrates how to finetune a wine model using Skypilot. |
|
Llama2 Finetuning |
This example demonstrates how to finetune a llama2 model using Skypilot. |
Additional Commands¶
Cost Reporting¶
Use this command to provide the cost and duration of the user’s model when tried on different accelerators and help the user examine to locate the best resource to run the model:
d3x sky cost-report
Benchmarking¶
Use this command to launch the benchmark clusters on different accelerators. Replace the <benchmark_name> part with the name you want to provide your benchmark.
For wine model finetuning example, use the following command:
d3x sky bench launch -y -n wine wine-benchmark.yaml --benchmark <benchmark_name>
For llama2 finetuning example, use the following command:
Note
For successful training with benchmark, you need A10 , V100 and T4 accelerators in your cloud you have configured with.
d3x sky bench launch -y -n llama2 llama2-benchmark.yaml --benchmark <benchmark_name>
Use this command to display the benchmark report:
d3x sky bench show <benchmark_name>
Use this command to list the benchmark history:
d3x sky bench ls
Checkpointing¶
For job recovery due to preemptions in Managed Spot jobs, the user application code can checkpoint its progress periodically to a SkyPilot Storage mounted cloud bucket. Hence, The program can reload the latest checkpoint when restarted.
For wine model finetuning example, use the following command:
d3x sky launch spot -y --env MLFLOW_TRACKING_TOKEN=$APIKEY -n wine wine.yaml
For llama2 finetuning example, use the following command:
d3x sky launch spot -y --env MLFLOW_TRACKING_TOKEN=$APIKEY -n llama2 llama2.yaml
Use this command to display the storage bucket where the artifacts are stored at every checkpoint:
d3x sky storage ls