Data Ingestion with SkyPilot¶
This tutorial will guide you regarding launching a data ingestion job on DKubeX using SkyPilot resources and creating a dataset.
Prerequisites¶
Make sure SkyPilot is configured and set up properly on your DKubeX setup. For details, visit Configuring SkyPilot on DKubeX.
Export the following variables by running the following commands on your DKubeX Terminal. Replace the
<username>part with your DKubeX username and<access token>with your Huggingface token.export HOMEDIR=/home/<username> export HF_TOKEN=<access token>
You need to use an embedding model to convert the generated text chunks into vectors.
Embedding model from OpenAI
If you are using OpenAI’s embedding model, you need to have access to an OpenAI API key that is authorised for that embedding model.
Embedding model deployed locally
You can deploy an embedding model on DKubeX by following the guidelines provided here: Deploying Embedding Models on DKubeX.
Embedding model deployed with Skypilot
You can deploy an embedding model on Skypilot by following the guidelines provided here: Deploying Embedding Models with SkyPilot.
In this example we will locally deploy the BGE-Large embedding model. To deploy the model, use the following command.
d3x emb deploy --name=bge-large --model=BAAI--bge-large-en-v1-5 --token ${HF_TOKEN} --kserve
You will need to decide which type of LlamaIndex reader you are going to use to extract data from the documents/files/data source. LlamaIndex offers a wide variety of readers to choose from depending on the data type or source you are using. For more information, visit Llama Hub- Data Loaders.
In this example, we are going to use the
filereader from LlamaIndex. It extracts data from documents that are present locally in a folder on your workspace. This example uses the ContractNLI dataset. To download and extract the dataset, run the following commands.wget https://raw.githubusercontent.com/dkubeio/dkubex-examples/refs/tags/v0.8.7.1/rag/sample-datasets/contract-nli.zip -P ${HOMEDIR}/ unzip ${HOMEDIR}/contract-nli.zip && rm -rf ${HOMEDIR}/contract-nli.zip
You have to put the configuration file to be used for the ingestion process on your workspace by running the following command.
wget https://raw.githubusercontent.com/dkubeio/dkubex-examples/refs/tags/v0.8.7.1/rag/ingestion/ingest.yaml -P ${HOMEDIR}/
Setting up Ingestion Configuration¶
Run
vim ${HOMEDIR}/ingest.yamlto open the configuration file in the vim editor.Provide the following details in the configuration file. Once done, save and exit the file.
On the
embeddingsection, selectdkubexas we are going to use theBAAI/bge-large-en-v1.5embedding model deployment we did earlier.On the
readersection, selectfileas we are going to use the file reader from Llamaindex to read the documents for ingestion. For more information regarding the file reader, visit the Llamaindex documentation.Uncomment the entire
dkubexsection underEmbedding Model Details. Here the details of the embedding model to be used (bge-large) is provided. Provide the following details:In the
embedding_urlfield, provide the serving endpoint of the embedding deployment. You can find this by going to the Deployments page in DKubeX UI and clicking on the deployed model name. The serving endpoint will be available on the model details page.In the
embedding_keyfield, provide the serving token for the deployed model to be used. To find the serving token, go to the Deployments page in DKubeX UI and click on the deployed model name. The serving token will be available on the model details page.
Make sure the
filesection underData Reader Detailsis uncommented. Under here, in theinput_dirfield provide the absolute path to your dataset folder, i.e. in the case of this example,/home/<username>/contract-nli/. Provide your DKubeX username in place of<username>.
Configuration Options¶ Field
Description
splitterUsed to split documents/text/data into chunks. Options:
sentence_text_splitter_LC,sentence_text_splitter,token_text_splitter.embeddingType of embedding model deployment to be used. Options:
sky(for embedding deployment with SkyPilot- not available in this release),dkubex(for local embedding deployment),openai(to use OpenAI’s embedding model).metadataAdditional metadata to be added to the chunks. Options:
default(default metadata),custom(custom metadata).readerLlamaIndex data reader to be used to extract data from the files. Options:
file,scrapeddatareader,confluence,scrapyreader,sharepointreader.adjacent_chunksEnable or disable use of previous and next chunks while storing current chunk. Options:
true,false.mlflow->experimentProvide MLFlow experiment name.
Text Splitter Details (Uncomment the section based on the splitter chosen in the splitterfield)¶Field
Description
Subfield
Description
sentence_text_splitter_LCDetails for the
sentence_text_splitter_LCsplitter.chunk_sizeSize of the chunk.
sentence_text_splitterDetails for the
sentence_text_splittersplitter.chunk_overlapOverlap between chunks.
token_text_splitterDetails for the
token_text_splittersplitter.Embedding Model Details (Uncomment the section based on the embedding model chosen in the embeddingfield)¶Field
Description
Subfield
Description
skyDetails of embedding model deployed with Skypilot.
embedding_urlDeployment service endpoint
embedding_keyDeployment service token
dkubexLocal embedding model deployment details.
batch_sizeBatch size
openaiDetails for OpenAI embedding model.
modelOpenAI Embedding Model name
embedding_modelOpenAI Embedding Model name
llmkeyOpenAI API key
ingest.yaml¶# Choose your preferred text splitter. Once done, provide details in the appropriate section below. splitter: sentence_text_splitter_LC # OPTIONS: sentence_text_splitter_LC, sentence_text_splitter, token_text_splitter # Choose embedding model type to be used in ingestion. Once done, provide details in the appropriate section below. embedding: dkubex # OPTIONS: dkubex, sky, huggingface, openai # Uncomment 'custom' and comment out 'default' here if you want to provide additional metadata. Once done, provide details in the appropriate section below. metadata: - default # - custom # Uncomment the Llamaindex data reader that you want to be used to extract data from your files and comment the other ones. Once done, provide details in the appropriate section below. reader: - file # - scrapeddatareader # - confluence # - scrapyreader # - sharepointreader # Enable or disable use of previous and next chunks while storing current chunk adjacent_chunks: true #true/false ######################################################################################################################################## # -------------------- Provide appropriate details. Uncomment if your selected option's section is commented below. -------------------- ######################################################################################################################################## ####################### # Text Splitter Details ####################### sentence_text_splitter_LC: chunk_size: 256 chunk_overlap: 0 # sentence_text_splitter: # chunk_size: 256 # chunk_overlap: 0 # token_text_splitter: # chunk_size: 256 # chunk_overlap: 0 ################################ # Provide MLFlow Experiment Name ################################ mlflow: experiment: demo-ingestion ######################### # Embedding Model Details ######################### # sky: # embedding_url: "http://xxx.xxx.xxx.xxx:xxxxx/v1/" # Provide service endpoint of Skypilot deployment # embedding_key: "eyJhbGc**************V5s3b0" # Provide serving token of SkyPilot deployment # batch_size: 10 dkubex: embedding_url: "http://xxx.xxx.xxx.xxx:xxxxx/v1/" # Provide serving url of local deployment in DKubeX embedding_key: "eyJhbGc**************V5s3b0" # Provide service token of local deployment in DKubeX batch_size: 10 # huggingface: # model: "BAAI/bge-large-en-v1.5" # Provide Huggingface embedding model name # openai: # model: "text-embedding-ada-002" # Provide OpenAI Embedding Model Name # embedding_model: "text-embedding-ada-002" # Provide OpenAI Embedding Model Name # llmkey: "sk-TM6c9*****************EPnG" # Provide OpenAI API key ######################### # Custom Metadata Details ######################### # custom: # Provide absolute path of custom script to add additional metadata # adjacent_chunks: False # True/False # extractor_path: <absolute path to .py file> ##################### # Data Reader Details ##################### file: inputs: loader_args: input_dir: /home/<username>/contract-nli # Provide absolute path to the folder containing documents to be ingested recursive: true exclude_hidden: true raise_on_error: true # scrapeddatareader: # inputs: # loader_args: # input_dir: <path to directory containing docs> # Provide absolute path to the folder containing documents to be ingested. Make sure the folder has a mapping file named "url_file_name_map.json" for links of pdf files. # exclude_hidden: true # raise_on_error: true # data_args: # doc_source: # state_category: # designation_category: # topic_category: # num_workers: # confluence: # inputs: # loader_args: # base_url: <confluence page URL> # Provide Confluence URL # # api_token: ATATT3*********************3EA # Provide API token for Confluence # user_name: demo@dkube.io # Provide Confluence user ID # password: Abc@123 # Provide Confluence Password # data_args: # include_attachments: True # 'True' if attachments in the Confluence site needs to be downloaded, else 'false' # space_key: llamaindex # scrapyreader: # ----------TODO- Discuss description-------------- # inputs: # loader_args: # test1: "" # data_args: # spiders: # myspider: # - path: /home/configs/spiders/quotess.py # url: # - "https://example.com/page/1/" # - "https://example.com/page/2/" # sharepointreader: # inputs: # loader_args: # client_id: # client_secret: # tenant_id: # sharepoint_site_id: # drive_id: # sharepoint_site_name: "" # sharepoint_folder_path: "" # data_args: # doc_source: # state_category: # designation_category: # topic_category:
Attention
The
readersection in the ingest.yaml file denotes the type of dataloader to be used for the ingestion process. If you are going to use any other source of data for ingestion as compared to local directory data shown in this example, you need to provide the appropriate details for that type of dataloader.For more information about dataloaders please visit How to use different Data Loaders (Data Readers).
You can use multiple type of data sources by providing the reader details simultaneously under the
readersection in the ingest.yaml file.
Some of the dataloaders require separate pyloader files. Make sure to provide them, if needed.
Running Ingestion on SkyPilot¶
Use the following command to trigger the ingestion process. Replace the
<username>part with your DKubeX username.Format¶d3x dataset ingest -d <dataset name> --config <ingestion config absolute path> --faq -sExample¶d3x dataset ingest -d contracts --config /home/<username>/ingest.yaml --faq -sd3x dataset ingest <options>¶Option
Description
-d,--datasetName of the dataset to be created.
-p,--pipelineIngestion pipeline to be used.
-c,--configAbsolute path for the ingestion configuration file.
-s,--remote-skyTo run ingestion job on SkyPilot.
-r,--remote-rayo run ingestion job on remote ray cluster.
-rc,--ray-configAbsolute path of configuration file for running remote ray job in json format.
-m,--remote-commandSky/Ray job command to run.
--dkubex-urlYour DKubeX URL.
-k,--dkubex-apikeyYour DKubeX API key (You can get this key by running
d3x apikey geton your DKubeX terminal, or navigating to Account -> Settings -> Developer on the DKubeX UI and copying the key shown in API Credentials).-w,--num-workersNumber of process to use for parallelization.
--typeTo select a particular node on the cluster on which the job will run. (Make sure to provide the node label added for that particular node in the cluster during installation).
--faqTo enable cache for the dataset.
Note
The time taken for the ingestion process to complete depends on the size of the dataset. Please wait patiently for the process to complete.
In case the terminal shows a timed-out error, that means the ingestion is still in progress, and you will need to run the command provided on the CLI after the error message to continue to get the ingestion logs.
The record of the ingestion and related artifacts are also stored in the MLFlow application on DKubeX UI.
To check if the dataset has been created, stored and are ready to use, use the following command:
d3x dataset listTo check the list of documents that has been ingested in the dataset, use the following command:
d3x dataset show -d <dataset name>d3x dataset show -d contractsYou can also check the dataset details in the DKubeX UI by navigating to the Datasets section. To see the details of the dataset, click on the dataset name.