Data Ingestion with SkyPilot

SkyPilot

This tutorial will guide you regarding launching a data ingestion job on DKubeX using SkyPilot resources and creating a dataset.

Prerequisites

  • Make sure SkyPilot is configured and set up properly on your DKubeX setup. For details, visit Configuring SkyPilot on DKubeX.

  • SkyPilot accesses all its files to run a sky job from the /home/data/ directory on your workspace. Check whether this directory is present on your workspace by running the following command on the DKubeX terminal:

    ls /home | grep -i data
    

    if it shows data as output, the directory is present. otherwise, run the following command to create the /home/data/ directory.

    sudo mkdir /home/data
    
  • You need to use an embedding model to convert the generated text chunks into vectors.

    Embedding model from OpenAI

    If you are using OpenAI’s embedding model, you need to have access to an OpenAI API key that is authorised for that embedding model.

    Embedding model from Huggingface

    If you are using a Huggingface embedding model present in the DKubeX embedding model catalog, note down the model path on Huggingface for the embedding model.

    In this example, we are going to use the BAAI/bge-large-en-v1.5 embedding model. This model is present in the DKubeX embedding model catalog.

    Embedding model deployed locally or on cloud with SkyPilot

    If you are using a local or Sky embedding model deployment, you need to deploy the model first on DKubeX. You can deploy an embedding model on DKubeX by following the guidelines provided here: Deploying Embedding Models on DKubeX.

  • You will need to decide which type of LlamaIndex reader you are going to use to extract data from the documents/files/data source. LlamaIndex offers a wide variety of readers to choose from depending on the data type or source you are using. For more information, visit Llama Hub- Data Loaders.

    • In this example, we are going to use the file reader from LlamaIndex. It extracts data from documents that are present locally on the workspace.

      • In case of file reader, you need to provide the files to be ingested in a folder inside the /home/data directory. This example uses the ContractNLI dataset. To download and unzip the dataset, remove unnecessary files and put the folder containing the documents in /home/data, run the following commands.

        sudo wget https://stanfordnlp.github.io/contract-nli/resources/contract-nli.zip -P /home/data/
        sudo unzip /home/data/contract-nli -d /home/data/
        sudo rm -rf /home/data/contract-nli/dev.json /home/data/contract-nli/LICENSE /home/data/contract-nli/README.md /home/data/contract-nli/TERMS /home/data/contract-nli/test.json /home/data/contract-nli/train.json /home/data/contract-nli.zip
        
  • You have to put the configuration file to be used for the ingestion process in the /home/data/ directory by running the following command.

    wget https://raw.githubusercontent.com/dkubeio/dkubex-examples/refs/tags/v0.8.5.3/rag/ingestion/ingest.yaml -P /home/data/
    

Setting up Ingestion Configuration