Data Ingestion with SkyPilot¶
This tutorial will guide you regarding launching a data ingestion job on DKubeX using SkyPilot resources and creating a dataset.
Prerequisites¶
Make sure SkyPilot is configured and set up properly on your DKubeX setup. For details, visit Configuring SkyPilot on DKubeX.
SkyPilot accesses all its files to run a sky job from the
/home/data/
directory on your workspace. Check whether this directory is present on your workspace by running the following command on the DKubeX terminal:ls /home | grep -i data
if it shows
data
as output, the directory is present. otherwise, run the following command to create the/home/data/
directory.sudo mkdir /home/data
You need to use an embedding model to convert the generated text chunks into vectors.
Embedding model from OpenAI
If you are using OpenAI’s embedding model, you need to have access to an OpenAI API key that is authorised for that embedding model.
Embedding model from Huggingface
If you are using a Huggingface embedding model present in the DKubeX embedding model catalog, note down the model path on Huggingface for the embedding model.
In this example, we are going to use the BAAI/bge-large-en-v1.5 embedding model. This model is present in the DKubeX embedding model catalog.
Embedding model deployed locally or on cloud with SkyPilot
If you are using a local or Sky embedding model deployment, you need to deploy the model first on DKubeX. You can deploy an embedding model on DKubeX by following the guidelines provided here: Deploying Embedding Models on DKubeX.
You will need to decide which type of LlamaIndex reader you are going to use to extract data from the documents/files/data source. LlamaIndex offers a wide variety of readers to choose from depending on the data type or source you are using. For more information, visit Llama Hub- Data Loaders.
In this example, we are going to use the
file
reader from LlamaIndex. It extracts data from documents that are present locally on the workspace.In case of
file
reader, you need to provide the files to be ingested in a folder inside the/home/data
directory. This example uses the ContractNLI dataset. To download and unzip the dataset, remove unnecessary files and put the folder containing the documents in/home/data
, run the following commands.sudo wget https://stanfordnlp.github.io/contract-nli/resources/contract-nli.zip -P /home/data/ sudo unzip /home/data/contract-nli -d /home/data/ sudo rm -rf /home/data/contract-nli/dev.json /home/data/contract-nli/LICENSE /home/data/contract-nli/README.md /home/data/contract-nli/TERMS /home/data/contract-nli/test.json /home/data/contract-nli/train.json /home/data/contract-nli.zip
You have to put the configuration file to be used for the ingestion process in the
/home/data/
directory by running the following command.wget https://raw.githubusercontent.com/dkubeio/dkubex-examples/refs/tags/v0.8.5.3/rag/ingestion/ingest.yaml -P /home/data/