Using SkyPilot

SkyPilot is a framework tailored for executing Large Language Models (LLMs) across various cloud platforms, ensuring optimal cost efficiency, abundant GPU availability, and streamlined execution management. It assists in provisioning tasks to utilize the most advantageous resources available while minimizing expenses.

Prerequisites

Note

Currently SkyPilot supports virtual VM clusters like EKS and RKE, but not physical/local clusters.

Onboarding user

  • You must have a cloud account (AWS/GCP) or any other account created to configure for onboarding SkyPilot.

  • If you are in an organization, an IAM User Account is needed to set up SkyPilot policy for the account.

  • Also, in case of using AWS cloud account, an Access Key and Secret Access Key is required to access the AWS account resources.

Setting up SkyPilot

  • On your terminal, search for the sky controller pod using the following command. Ensure that the pod is running.

    kubectl get po -A | grep sky
    
  • You need to enter the container of the pod from the sky controller pod name you got from the previous step. Use the following command. Replace $sky controller pod$ with the pod name.

kubectl exec -it $sky controller pod$ -n d3x bash
  • Use the following command and provide your AWS credentials:

    aws configure
    

    Note

    This command prompts the user to provide their configuration details (AWS Access Key ID, AWS Secret Access Key, Default Region and Output Format). These credentials are required to authenticate the user’s CLI commands and allow the user to access the AWS resources associated with their AWS account.

Providing required files

  • You need to put the files required to run the SkyPilot job in your workspace. You can use the filebrowser application in DKubeX for this or directly get them from an available repository by the DKubeX CLI.

    • For the examples provided in this guide, we are using the dkubex-examples repository. This repo consists of the wine and llama2 model for training.

      Clone the repository and access the files using the following commands:

      git clone https://github.com/dkubeio/dkubex-examples.git
      
      cd dkubex-examples
      
      git checkout apps-v2
      
      cd sky
      

Running Skypilot Job

Two examples regarding running Skypilot jobs are provided here. To run the example you want, click on the appropriate link provided below.

Example

Description

Link

Wine Model Finetuning

This example demonstrates how to finetune a wine model using Skypilot.

Finetuning Wine model using Skypilot Job

Llama2 Finetuning

This example demonstrates how to finetune a llama2 model using Skypilot.

Finetuning Llama2 model using Skypilot Job

Additional Commands

Cost Reporting

  • Use this command to provide the cost and duration of the user’s model when tried on different accelerators and help the user examine to locate the best resource to run the model:

    d3x sky cost-report
    

Benchmarking

  • Use this command to launch the benchmark clusters on different accelerators. Replace the <benchmark_name> part with the name you want to provide your benchmark.

    • For wine model finetuning example, use the following command:

      d3x sky bench launch -y -n wine wine-benchmark.yaml --benchmark <benchmark_name>
      
    • For llama2 finetuning example, use the following command:

      Note

      For successful training with benchmark, you need A10 , V100 and T4 accelerators in your cloud you have configured with.

      d3x sky bench launch -y -n llama2 llama2-benchmark.yaml --benchmark <benchmark_name>
      
  • Use this command to display the benchmark report:

    d3x sky bench show <benchmark_name>
    
  • Use this command to list the benchmark history:

    d3x sky bench ls
    

Checkpointing

  • For job recovery due to preemptions in Managed Spot jobs, the user application code can checkpoint its progress periodically to a SkyPilot Storage mounted cloud bucket. Hence, The program can reload the latest checkpoint when restarted.

    • For wine model finetuning example, use the following command:

      d3x sky launch spot -y --env MLFLOW_TRACKING_TOKEN=$APIKEY -n wine wine.yaml
      
    • For llama2 finetuning example, use the following command:

      d3x sky launch spot -y --env MLFLOW_TRACKING_TOKEN=$APIKEY -n llama2 llama2.yaml
      
  • Use this command to display the storage bucket where the artifacts are stored at every checkpoint:

    d3x sky storage ls