Data Ingestion with SkyPilot

SkyPilot

This tutorial will guide you regarding launching a data ingestion job on DKubeX using SkyPilot resources and creating a dataset.

Prerequisites

  • Make sure SkyPilot is configured and set up properly on your DKubeX setup. For details, visit Configuring SkyPilot on DKubeX.

  • SkyPilot accesses all its files to run a sky job from the /home/data/ directory on your workspace. Check whether this directory is present on your workspace by running the following command on the DKubeX terminal:

    ls /home | grep -i data
    

    If it shows data as output, the directory is present. otherwise, run the following command to create the /home/data/ directory.

    sudo mkdir /home/data
    
  • You need to use an embedding model to convert the generated text chunks into vectors.

    Embedding model from OpenAI

    If you are using OpenAI’s embedding model, you need to have access to an OpenAI API key that is authorised for that embedding model.

    Embedding model deployed locally

    If you are using a local embedding model deployment, you need to deploy the model first on DKubeX. You can deploy an embedding model on DKubeX by following the guidelines provided here: Deploying Embedding Models on DKubeX.

    Here we will deploy the BGE-Large embedding model, which is already pre-registered with DKubeX.

    Attention

    Running ingestion job on SkyPilot using an embedding model deployed on SkyPilot is not supported on this release.

  • You will need to decide which type of LlamaIndex reader you are going to use to extract data from the documents/files/data source. LlamaIndex offers a wide variety of readers to choose from depending on the data type or source you are using. For more information, visit Llama Hub- Data Loaders.

    • In this example, we are going to use the file reader from LlamaIndex. It extracts data from documents that are present locally on the workspace.

      • In case of file reader, you need to provide the files to be ingested in a folder inside the /home/data directory. This example uses the ContractNLI dataset. To download and unzip the dataset, remove unnecessary files and put the folder containing the documents in /home/data, run the following commands.

        sudo wget https://stanfordnlp.github.io/contract-nli/resources/contract-nli.zip -P /home/data/
        sudo unzip /home/data/contract-nli -d /home/data/
        sudo rm -rf /home/data/contract-nli/dev.json /home/data/contract-nli/LICENSE /home/data/contract-nli/README.md /home/data/contract-nli/TERMS /home/data/contract-nli/test.json /home/data/contract-nli/train.json /home/data/contract-nli.zip
        
  • You have to put the configuration file to be used for the ingestion process in the /home/data/ directory by running the following command.

    sudo wget https://raw.githubusercontent.com/dkubeio/dkubex-examples/refs/tags/v0.8.6.3/rag/ingestion/ingest.yaml -P /home/data/
    
  • Export the following environment variables by running the following commands on your terminal.

    • Replace the your DKubeX URL with the URL of your setup and <your DKubeX API key> part with your DKubeX API key.

      Hint

      Use the following steps to find your DKubeX API key:

      • Open the DKubeX UI and click on your username on upper-right corner of the UI.

      • Click on the API Key option from the dropdown menu. A pop-up dialog box containing your DKubeX API key will open. Copy and note down this key.

      export DKUBEX_URL="<your DKubeX URL>"
      export DKUBEX_APIKEY="<your DKubeX API key>"
      

Setting up Ingestion Configuration

  • Run vim /home/data/ingest.yaml to open the configuration file in the vim editor.

  • Provide the following details in the configuration file. Also make sure to provide the absolute path of your dataset folder in the input_dir field of the file reader section in the configuration file as shown in the example (For this example, it will be /home/data/contract-nli). Once done, save and exit the file.

    • On the embedding section, select dkubex as we are going to use the BAAI/bge-large-en-v1.5 embedding model deployment we did earlier.

    • On the reader section, select file as we are going to use the file reader from Llamaindex to read the documents for ingestion. For more information regarding the file reader, visit the Llamaindex documentation.

    • Uncomment the entire dkubex section under Embedding Model Details. Here the details of the embedding model to be used (bge-large) is provided. Provide the following details:

      • In the embedding_url field, provide the serving endpoint of the embedding deployment. You can find this by going to the Deployments page in DKubeX UI and clicking on the deployed model name. The serving endpoint will be available on the model details page.

      • In the embedding_key field, provide the serving token for the deployed model to be used. To find the serving token, go to the Deployments page in DKubeX UI and click on the deployed model name. The serving token will be available on the model details page.

    • Make sure the file section under Data Reader Details is uncommented. Under here, in the input_dir field provide the absolute path to your dataset folder, i.e. in this case, /home/<your username>/contract-nli/ (Provide your DKubeX username in place of <your username>).

    Configuration Options

    Field

    Description

    splitter

    Used to split documents/text/data into chunks. Options: sentence_text_splitter_LC, sentence_text_splitter, token_text_splitter.

    embedding

    Type of embedding model deployment to be used. Options: sky (for embedding deployment with SkyPilot- not available in this release), dkubex (for local embedding deployment), openai (to use OpenAI’s embedding model).

    metadata

    Additional metadata to be added to the chunks. Options: default (default metadata), custom (custom metadata).

    reader

    LlamaIndex data reader to be used to extract data from the files. Options: file, scrapeddatareader, confluence, scrapyreader, sharepointreader.

    adjacent_chunks

    Enable or disable use of previous and next chunks while storing current chunk. Options: true, false.

    mlflow -> experiment

    Provide MLFlow experiment name.

    Text Splitter Details (Uncomment the section based on the splitter chosen in the splitter field)

    Field

    Description

    Subfield

    Description

    sentence_text_splitter_LC

    Details for the sentence_text_splitter_LC splitter.

    chunk_size

    Size of the chunk.

    sentence_text_splitter

    Details for the sentence_text_splitter splitter.

    chunk_overlap

    Overlap between chunks.

    token_text_splitter

    Details for the token_text_splitter splitter.

    Embedding Model Details (Uncomment the section based on the embedding model chosen in the embedding field)

    Field

    Description

    Subfield

    Description

    sky

    Details of embedding model deployed with Skypilot.

    embedding_url

    Deployment service endpoint

    embedding_key

    Deployment service token

    dkubex

    Local embedding model deployment details.

    batch_size

    Batch size

    openai

    Details for OpenAI embedding model.

    model

    OpenAI Embedding Model name

    embedding_model

    OpenAI Embedding Model name

    llmkey

    OpenAI API key

Attention

  • The reader section in the ingest.yaml file denotes the type of dataloader to be used for the ingestion process. If you are going to use any other source of data for ingestion as compared to local directory data shown in this example, you need to provide the appropriate details for that type of dataloader.

    • For more information about dataloaders please visit How to use different Data Loaders (Data Readers).

    • You can use multiple type of data sources by providing the reader details simultaneously under the reader section in the ingest.yaml file.

  • Some of the dataloaders require separate pyloader files. Make sure to provide them, if needed.

Running Ingestion on SkyPilot

  • Use the following command to trigger the ingestion process.

    d3x dataset ingest -d <dataset name> --config <ingestion config absolute path> --faq --dkubex-apikey ${DKUBEX_APIKEY} --dkubex-url ${DKUBEX_URL} --remote-sky
    

    Note

    • The time taken for the ingestion process to complete depends on the size of the dataset. Please wait patiently for the process to complete.

    • In case the terminal shows a timed-out error, that means the ingestion is still in progress, and you will need to run the command provided on the CLI after the error message to continue to get the ingestion logs.

    • The record of the ingestion and related artifacts are also stored in the MLFlow application on DKubeX UI.

  • To check if the dataset has been created, stored and are ready to use, use the following command:

    d3x dataset list
    
  • To check the list of documents that has been ingested in the dataset, use the following command:

    d3x dataset show -d <dataset name>
    
  • You can also check the dataset details in the DKubeX UI by navigating to the Datasets section. To see the details of the dataset, click on the dataset name.