Data Ingestion with SkyPilot¶
This tutorial will guide you regarding launching a data ingestion job on DKubeX using SkyPilot resources and creating a dataset.
Prerequisites¶
Make sure SkyPilot is configured and set up properly on your DKubeX setup. For details, visit Configuring SkyPilot on DKubeX.
SkyPilot accesses all its files to run a sky job from the
/home/data/
directory on your workspace. Check whether this directory is present on your workspace by running the following command on the DKubeX terminal:ls /home | grep -i data
If it shows
data
as output, the directory is present. otherwise, run the following command to create the/home/data/
directory.sudo mkdir /home/data
You need to use an embedding model to convert the generated text chunks into vectors.
Embedding model from OpenAI
If you are using OpenAI’s embedding model, you need to have access to an OpenAI API key that is authorised for that embedding model.
Embedding model deployed locally
If you are using a local embedding model deployment, you need to deploy the model first on DKubeX. You can deploy an embedding model on DKubeX by following the guidelines provided here: Deploying Embedding Models on DKubeX.
Here we will deploy the BGE-Large embedding model, which is already pre-registered with DKubeX.
Attention
Running ingestion job on SkyPilot using an embedding model deployed on SkyPilot is not supported on this release.
You will need to decide which type of LlamaIndex reader you are going to use to extract data from the documents/files/data source. LlamaIndex offers a wide variety of readers to choose from depending on the data type or source you are using. For more information, visit Llama Hub- Data Loaders.
In this example, we are going to use the
file
reader from LlamaIndex. It extracts data from documents that are present locally on the workspace.In case of
file
reader, you need to provide the files to be ingested in a folder inside the/home/data
directory. This example uses the ContractNLI dataset. To download and unzip the dataset, remove unnecessary files and put the folder containing the documents in/home/data
, run the following commands.sudo wget https://stanfordnlp.github.io/contract-nli/resources/contract-nli.zip -P /home/data/ sudo unzip /home/data/contract-nli -d /home/data/ sudo rm -rf /home/data/contract-nli/dev.json /home/data/contract-nli/LICENSE /home/data/contract-nli/README.md /home/data/contract-nli/TERMS /home/data/contract-nli/test.json /home/data/contract-nli/train.json /home/data/contract-nli.zip
You have to put the configuration file to be used for the ingestion process in the
/home/data/
directory by running the following command.sudo wget https://raw.githubusercontent.com/dkubeio/dkubex-examples/refs/tags/v0.8.6.3/rag/ingestion/ingest.yaml -P /home/data/
Export the following environment variables by running the following commands on your terminal.
Replace the
your DKubeX URL
with the URL of your setup and<your DKubeX API key>
part with your DKubeX API key.Hint
Use the following steps to find your DKubeX API key:
Open the DKubeX UI and click on your username on upper-right corner of the UI.
Click on the API Key option from the dropdown menu. A pop-up dialog box containing your DKubeX API key will open. Copy and note down this key.
export DKUBEX_URL="<your DKubeX URL>" export DKUBEX_APIKEY="<your DKubeX API key>"
Setting up Ingestion Configuration¶
Run
vim /home/data/ingest.yaml
to open the configuration file in the vim editor.Provide the following details in the configuration file. Also make sure to provide the absolute path of your dataset folder in the
input_dir
field of thefile
reader section in the configuration file as shown in the example (For this example, it will be/home/data/contract-nli
). Once done, save and exit the file.On the
embedding
section, selectdkubex
as we are going to use theBAAI/bge-large-en-v1.5
embedding model deployment we did earlier.On the
reader
section, selectfile
as we are going to use the file reader from Llamaindex to read the documents for ingestion. For more information regarding the file reader, visit the Llamaindex documentation.Uncomment the entire
dkubex
section underEmbedding Model Details
. Here the details of the embedding model to be used (bge-large
) is provided. Provide the following details:In the
embedding_url
field, provide the serving endpoint of the embedding deployment. You can find this by going to the Deployments page in DKubeX UI and clicking on the deployed model name. The serving endpoint will be available on the model details page.In the
embedding_key
field, provide the serving token for the deployed model to be used. To find the serving token, go to the Deployments page in DKubeX UI and click on the deployed model name. The serving token will be available on the model details page.
Make sure the
file
section underData Reader Details
is uncommented. Under here, in theinput_dir
field provide the absolute path to your dataset folder, i.e. in this case,/home/<your username>/contract-nli/
(Provide your DKubeX username in place of<your username>
).
¶ Field
Description
splitter
Used to split documents/text/data into chunks. Options:
sentence_text_splitter_LC
,sentence_text_splitter
,token_text_splitter
.embedding
Type of embedding model deployment to be used. Options:
sky
(for embedding deployment with SkyPilot- not available in this release),dkubex
(for local embedding deployment),openai
(to use OpenAI’s embedding model).metadata
Additional metadata to be added to the chunks. Options:
default
(default metadata),custom
(custom metadata).reader
LlamaIndex data reader to be used to extract data from the files. Options:
file
,scrapeddatareader
,confluence
,scrapyreader
,sharepointreader
.adjacent_chunks
Enable or disable use of previous and next chunks while storing current chunk. Options:
true
,false
.mlflow
->experiment
Provide MLFlow experiment name.
¶ Field
Description
Subfield
Description
sentence_text_splitter_LC
Details for the
sentence_text_splitter_LC
splitter.chunk_size
Size of the chunk.
sentence_text_splitter
Details for the
sentence_text_splitter
splitter.chunk_overlap
Overlap between chunks.
token_text_splitter
Details for the
token_text_splitter
splitter.¶ Field
Description
Subfield
Description
sky
Details of embedding model deployed with Skypilot.
embedding_url
Deployment service endpoint
embedding_key
Deployment service token
dkubex
Local embedding model deployment details.
batch_size
Batch size
openai
Details for OpenAI embedding model.
model
OpenAI Embedding Model name
embedding_model
OpenAI Embedding Model name
llmkey
OpenAI API key
# Choose your preferred text splitter. Once done, provide details in the appropriate section below. splitter: sentence_text_splitter_LC # OPTIONS: sentence_text_splitter_LC, sentence_text_splitter, token_text_splitter # Choose embedding model type to be used in ingestion. Once done, provide details in the appropriate section below. embedding: dkubex # OPTIONS: dkubex, sky, huggingface, openai # Uncomment 'custom' and comment out 'default' here if you want to provide additional metadata. Once done, provide details in the appropriate section below. metadata: - default # - custom # Uncomment the Llamaindex data reader that you want to be used to extract data from your files and comment the other ones. Once done, provide details in the appropriate section below. reader: - file # - scrapeddatareader # - confluence # - scrapyreader # - sharepointreader # Enable or disable use of previous and next chunks while storing current chunk adjacent_chunks: true #true/false ######################################################################################################################################## # -------------------- Provide appropriate details. Uncomment if your selected option's section is commented below. -------------------- ######################################################################################################################################## ####################### # Text Splitter Details ####################### sentence_text_splitter_LC: chunk_size: 256 chunk_overlap: 0 # sentence_text_splitter: # chunk_size: 256 # chunk_overlap: 0 # token_text_splitter: # chunk_size: 256 # chunk_overlap: 0 ################################ # Provide MLFlow Experiment Name ################################ mlflow: experiment: demo-ingestion ######################### # Embedding Model Details ######################### # sky: # embedding_url: "http://xxx.xxx.xxx.xxx:xxxxx/v1/" # Provide service endpoint of Skypilot deployment # embedding_key: "eyJhbGc**************V5s3b0" # Provide serving token of SkyPilot deployment # batch_size: 10 dkubex: embedding_url: "http://xxx.xxx.xxx.xxx:xxxxx/v1/" # Provide serving url of local deployment in DKubeX embedding_key: "eyJhbGc**************V5s3b0" # Provide service token of local deployment in DKubeX batch_size: 10 # huggingface: # model: "BAAI/bge-large-en-v1.5" # Provide Huggingface embedding model name # openai: # model: "text-embedding-ada-002" # Provide OpenAI Embedding Model Name # embedding_model: "text-embedding-ada-002" # Provide OpenAI Embedding Model Name # llmkey: "sk-TM6c9*****************EPnG" # Provide OpenAI API key ######################### # Custom Metadata Details ######################### # custom: # Provide absolute path of custom script to add additional metadata # adjacent_chunks: False # True/False # extractor_path: <absolute path to .py file> ##################### # Data Reader Details ##################### file: inputs: loader_args: input_dir: /home/data/contract-nli # Provide absolute path to the folder containing documents to be ingested recursive: true exclude_hidden: true raise_on_error: true # scrapeddatareader: # inputs: # loader_args: # input_dir: <path to directory containing docs> # Provide absolute path to the folder containing documents to be ingested. Make sure the folder has a mapping file named "url_file_name_map.json" for links of pdf files. # exclude_hidden: true # raise_on_error: true # data_args: # doc_source: # state_category: # designation_category: # topic_category: # num_workers: # confluence: # inputs: # loader_args: # base_url: <confluence page URL> # Provide Confluence URL # # api_token: ATATT3*********************3EA # Provide API token for Confluence # user_name: demo@dkube.io # Provide Confluence user ID # password: Abc@123 # Provide Confluence Password # data_args: # include_attachments: True # 'True' if attachments in the Confluence site needs to be downloaded, else 'false' # space_key: llamaindex # scrapyreader: # ----------TODO- Discuss description-------------- # inputs: # loader_args: # test1: "" # data_args: # spiders: # myspider: # - path: /home/configs/spiders/quotess.py # url: # - "https://example.com/page/1/" # - "https://example.com/page/2/" # sharepointreader: # inputs: # loader_args: # client_id: # client_secret: # tenant_id: # sharepoint_site_id: # drive_id: # sharepoint_site_name: "" # sharepoint_folder_path: "" # data_args: # doc_source: # state_category: # designation_category: # topic_category:
Attention
The
reader
section in the ingest.yaml file denotes the type of dataloader to be used for the ingestion process. If you are going to use any other source of data for ingestion as compared to local directory data shown in this example, you need to provide the appropriate details for that type of dataloader.For more information about dataloaders please visit How to use different Data Loaders (Data Readers).
You can use multiple type of data sources by providing the reader details simultaneously under the
reader
section in the ingest.yaml file.
Some of the dataloaders require separate pyloader files. Make sure to provide them, if needed.
Running Ingestion on SkyPilot¶
Use the following command to trigger the ingestion process.
d3x dataset ingest -d <dataset name> --config <ingestion config absolute path> --faq --dkubex-apikey ${DKUBEX_APIKEY} --dkubex-url ${DKUBEX_URL} --remote-sky
d3x dataset ingest -d contracts --config /home/data/ingest.yaml --faq --dkubex-apikey ${DKUBEX_APIKEY} --dkubex-url ${DKUBEX_URL} --remote-sky
¶ Option
Description
-d
,--dataset
Name of the dataset to be created.
-p
,--pipeline
Ingestion pipeline to be used.
-c
,--config
Absolute path for the ingestion configuration file.
-s
,--remote-sky
To run ingestion job on SkyPilot.
-r
,--remote-ray
o run ingestion job on remote ray cluster.
-rc
,--ray-config
Absolute path of configuration file for running remote ray job in json format.
-m
,--remote-command
Sky/Ray job command to run.
--dkubex-url
Your DKubeX URL.
-k
,--dkubex-apikey
Your DKubeX API key (You can get this key by running
d3x apikey get
on your DKubeX terminal, or navigating to Account -> Settings -> Developer on the DKubeX UI and copying the key shown in API Credentials).-w
,--num-workers
Number of process to use for parallelization.
--type
To select a particular node on the cluster on which the job will run. (Make sure to provide the node label added for that particular node in the cluster during installation).
--faq
To enable cache for the dataset.
Note
The time taken for the ingestion process to complete depends on the size of the dataset. Please wait patiently for the process to complete.
In case the terminal shows a timed-out error, that means the ingestion is still in progress, and you will need to run the command provided on the CLI after the error message to continue to get the ingestion logs.
The record of the ingestion and related artifacts are also stored in the MLFlow application on DKubeX UI.
To check if the dataset has been created, stored and are ready to use, use the following command:
d3x dataset list
To check the list of documents that has been ingested in the dataset, use the following command:
d3x dataset show -d <dataset name>
d3x dataset show -d contracts
You can also check the dataset details in the DKubeX UI by navigating to the Datasets section. To see the details of the dataset, click on the dataset name.