Data Ingestion with SkyPilot¶

SkyPilot

This tutorial will guide you regarding launching a data ingestion job on DKubeX using SkyPilot resources and creating a dataset.

Prerequisites¶

Make sure SkyPilot is configured and set up properly on your DKubeX setup. For details, visit Configuring SkyPilot on DKubeX.
Export the following variables by running the following commands on your DKubeX Terminal. Replace the <username> part with your DKubeX username and <access token> with your Huggingface token.
```
export HOMEDIR=/home/<username>
export HF_TOKEN=<access token>
```
You need to use an embedding model to convert the generated text chunks into vectors.

Embedding model from OpenAI

If you are using OpenAI’s embedding model, you need to have access to an OpenAI API key that is authorised for that embedding model.

Embedding model deployed locally

You can deploy an embedding model on DKubeX by following the guidelines provided here: Deploying Embedding Models on DKubeX.

Embedding model deployed with Skypilot

You can deploy an embedding model on Skypilot by following the guidelines provided here: Deploying Embedding Models with SkyPilot.
- In this example we will locally deploy the BGE-Large embedding model. To deploy the model, use the following command.
```
d3x emb deploy --name=bge-large --model=BAAI--bge-large-en-v1-5 --token ${HF_TOKEN} --kserve
```
You will need to decide which type of LlamaIndex reader you are going to use to extract data from the documents/files/data source. LlamaIndex offers a wide variety of readers to choose from depending on the data type or source you are using. For more information, visit Llama Hub- Data Loaders.
- In this example, we are going to use the file reader from LlamaIndex. It extracts data from documents that are present locally in a folder on your workspace. This example uses the ContractNLI dataset. To download and extract the dataset, run the following commands.
  wget https://raw.githubusercontent.com/dkubeio/dkubex-examples/refs/tags/v0.8.7.1/rag/sample-datasets/contract-nli.zip -P ${HOMEDIR}/ unzip ${HOMEDIR}/contract-nli.zip && rm -rf ${HOMEDIR}/contract-nli.zip

You have to put the configuration file to be used for the ingestion process on your workspace by running the following command.

wget https://raw.githubusercontent.com/dkubeio/dkubex-examples/refs/tags/v0.8.7.1/rag/ingestion/ingest.yaml -P ${HOMEDIR}/

Setting up Ingestion Configuration¶

Run vim ${HOMEDIR}/ingest.yaml to open the configuration file in the vim editor.

Provide the following details in the configuration file. Once done, save and exit the file.

On the embedding section, select dkubex as we are going to use the BAAI/bge-large-en-v1.5 embedding model deployment we did earlier.
On the reader section, select file as we are going to use the file reader from Llamaindex to read the documents for ingestion. For more information regarding the file reader, visit the Llamaindex documentation.
Uncomment the entire dkubex section under Embedding Model Details. Here the details of the embedding model to be used (bge-large) is provided. Provide the following details:
- In the embedding_url field, provide the serving endpoint of the embedding deployment. You can find this by going to the Deployments page in DKubeX UI and clicking on the deployed model name. The serving endpoint will be available on the model details page.
- In the embedding_key field, provide the serving token for the deployed model to be used. To find the serving token, go to the Deployments page in DKubeX UI and click on the deployed model name. The serving token will be available on the model details page.
Make sure the file section under Data Reader Details is uncommented. Under here, in the input_dir field provide the absolute path to your dataset folder, i.e. in the case of this example, /home/<username>/contract-nli/. Provide your DKubeX username in place of <username>.

Configuration Options¶
Field	Description
`splitter`	Used to split documents/text/data into chunks. Options: `sentence_text_splitter_LC`, `sentence_text_splitter`, `token_text_splitter`.
`embedding`	Type of embedding model deployment to be used. Options: `sky` (for embedding deployment with SkyPilot- not available in this release), `dkubex` (for local embedding deployment), `openai` (to use OpenAI’s embedding model).
`metadata`	Additional metadata to be added to the chunks. Options: `default` (default metadata), `custom` (custom metadata).
`reader`	LlamaIndex data reader to be used to extract data from the files. Options: `file`, `scrapeddatareader`, `confluence`, `scrapyreader`, `sharepointreader`.
`adjacent_chunks`	Enable or disable use of previous and next chunks while storing current chunk. Options: `true`, `false`.
`mlflow` -> `experiment`	Provide MLFlow experiment name.

Text Splitter Details (Uncomment the section based on the splitter chosen in the `splitter` field)¶
Field	Description	Subfield	Description
`sentence_text_splitter_LC`	Details for the `sentence_text_splitter_LC` splitter.	`chunk_size`	Size of the chunk.
`sentence_text_splitter`	Details for the `sentence_text_splitter` splitter.	`chunk_size`	Size of the chunk.
`sentence_text_splitter`	Details for the `sentence_text_splitter` splitter.	`chunk_overlap`	Overlap between chunks.
`token_text_splitter`	Details for the `token_text_splitter` splitter.	`chunk_overlap`	Overlap between chunks.

Embedding Model Details (Uncomment the section based on the embedding model chosen in the `embedding` field)¶
Field	Description	Subfield	Description
`sky`	Details of embedding model deployed with Skypilot.	`embedding_url`	Deployment service endpoint
`sky`	Details of embedding model deployed with Skypilot.	`embedding_key`	Deployment service token
`dkubex`	Local embedding model deployment details.	`embedding_key`	Deployment service token
`dkubex`	Local embedding model deployment details.	`batch_size`	Batch size

`openai`	Details for OpenAI embedding model.	`model`	OpenAI Embedding Model name
		`embedding_model`	OpenAI Embedding Model name
		`llmkey`	OpenAI API key

ingest.yaml¶

# Choose your preferred text splitter. Once done, provide details in the appropriate section below.
splitter: sentence_text_splitter_LC     # OPTIONS: sentence_text_splitter_LC, sentence_text_splitter, token_text_splitter

# Choose embedding model type to be used in ingestion. Once done, provide details in the appropriate section below.
embedding: dkubex   # OPTIONS: dkubex, sky, huggingface, openai

# Uncomment 'custom' and comment out 'default' here if you want to provide additional metadata. Once done, provide details in the appropriate section below.
metadata:
  - default
#   - custom

# Uncomment the Llamaindex data reader that you want to be used to extract data from your files and comment the other ones. Once done, provide details in the appropriate section below.
reader:
  - file
  # - scrapeddatareader
  # - confluence
  # - scrapyreader
  # - sharepointreader

# Enable or disable use of previous and next chunks while storing current chunk
adjacent_chunks: true #true/false

########################################################################################################################################
# -------------------- Provide appropriate details. Uncomment if your selected option's section is commented below. --------------------
########################################################################################################################################

#######################
# Text Splitter Details
#######################
sentence_text_splitter_LC:
  chunk_size: 256
  chunk_overlap: 0

# sentence_text_splitter:
#   chunk_size: 256
#   chunk_overlap: 0

# token_text_splitter:
#   chunk_size: 256
#   chunk_overlap: 0

################################
# Provide MLFlow Experiment Name
################################
mlflow:
  experiment: demo-ingestion


#########################
# Embedding Model Details
#########################
# sky:
#   embedding_url: "http://xxx.xxx.xxx.xxx:xxxxx/v1/"                           # Provide service endpoint of Skypilot deployment
#   embedding_key: "eyJhbGc**************V5s3b0"                                # Provide serving token of SkyPilot deployment
#   batch_size: 10

dkubex:
  embedding_url: "http://xxx.xxx.xxx.xxx:xxxxx/v1/"                         # Provide serving url of local deployment in DKubeX
  embedding_key: "eyJhbGc**************V5s3b0"                              # Provide service token of local deployment in DKubeX
  batch_size: 10

# huggingface:
#   model: "BAAI/bge-large-en-v1.5"                                         # Provide Huggingface embedding model name

# openai:
#   model: "text-embedding-ada-002"                                         # Provide OpenAI Embedding Model Name
#   embedding_model: "text-embedding-ada-002"                               # Provide OpenAI Embedding Model Name
#   llmkey: "sk-TM6c9*****************EPnG"                                 # Provide OpenAI API key

#########################
# Custom Metadata Details
#########################
# custom:                                               # Provide absolute path of custom script to add additional metadata
#   adjacent_chunks: False                              # True/False
#   extractor_path: <absolute path to .py file>

#####################
# Data Reader Details
#####################
file:
  inputs:
    loader_args:
      input_dir: /home/<username>/contract-nli    # Provide absolute path to the folder containing documents to be ingested
      recursive: true
      exclude_hidden: true
      raise_on_error: true

# scrapeddatareader:
#   inputs:
#     loader_args:
#       input_dir: <path to directory containing docs>        # Provide absolute path to the folder containing documents to be ingested. Make sure the folder has a mapping file named "url_file_name_map.json" for links of pdf files.
#       exclude_hidden: true
#       raise_on_error: true
#     data_args:
#       doc_source:
#       state_category:
#       designation_category:
#       topic_category:
#       num_workers:

# confluence:
#   inputs:
#     loader_args:
#       base_url: <confluence page URL>                     # Provide Confluence URL
#       # api_token: ATATT3*********************3EA         # Provide API token for Confluence
#       user_name: demo@dkube.io                            # Provide Confluence user ID
#       password: Abc@123                                   # Provide Confluence Password
#     data_args:
#       include_attachments: True                           # 'True' if attachments in the Confluence site needs to be downloaded, else 'false'
#       space_key: llamaindex

# scrapyreader:   # ----------TODO- Discuss description--------------
#   inputs:
#     loader_args:
#       test1: ""
#     data_args:
#       spiders:
#         myspider:
#         - path: /home/configs/spiders/quotess.py
#           url:
#             - "https://example.com/page/1/"
#             - "https://example.com/page/2/"


# sharepointreader:
#   inputs:
#     loader_args:
#       client_id:
#       client_secret:
#       tenant_id:
#       sharepoint_site_id:
#       drive_id:
#       sharepoint_site_name: ""
#       sharepoint_folder_path: ""
#     data_args:
#       doc_source:
#       state_category:
#       designation_category:
#       topic_category:

Attention

The reader section in the ingest.yaml file denotes the type of dataloader to be used for the ingestion process. If you are going to use any other source of data for ingestion as compared to local directory data shown in this example, you need to provide the appropriate details for that type of dataloader.
- For more information about dataloaders please visit How to use different Data Loaders (Data Readers).
- You can use multiple type of data sources by providing the reader details simultaneously under the reader section in the ingest.yaml file.
Some of the dataloaders require separate pyloader files. Make sure to provide them, if needed.

Running Ingestion on SkyPilot¶

Use the following command to trigger the ingestion process. Replace the <username> part with your DKubeX username.

Format¶

d3x dataset ingest -d <dataset name> --config <ingestion config absolute path> --faq -s

Example¶

d3x dataset ingest -d contracts --config /home/<username>/ingest.yaml --faq -s

`d3x dataset ingest <options>`¶
Option	Description
`-d`, `--dataset`	Name of the dataset to be created.
`-p`, `--pipeline`	Ingestion pipeline to be used.
`-c`, `--config`	Absolute path for the ingestion configuration file.
`-s`, `--remote-sky`	To run ingestion job on SkyPilot.
`-r`, `--remote-ray`	o run ingestion job on remote ray cluster.
`-rc`, `--ray-config`	Absolute path of configuration file for running remote ray job in json format.
`-m`, `--remote-command`	Sky/Ray job command to run.
`--dkubex-url`	Your DKubeX URL.
`-k`, `--dkubex-apikey`	Your DKubeX API key (You can get this key by running `d3x apikey get` on your DKubeX terminal, or navigating to Account -> Settings -> Developer on the DKubeX UI and copying the key shown in API Credentials).
`-w`, `--num-workers`	Number of process to use for parallelization.
`--type`	To select a particular node on the cluster on which the job will run. (Make sure to provide the node label added for that particular node in the cluster during installation).
`--faq`	To enable cache for the dataset.

Note

The time taken for the ingestion process to complete depends on the size of the dataset. Please wait patiently for the process to complete.
In case the terminal shows a timed-out error, that means the ingestion is still in progress, and you will need to run the command provided on the CLI after the error message to continue to get the ingestion logs.
The record of the ingestion and related artifacts are also stored in the MLFlow application on DKubeX UI.

To check if the dataset has been created, stored and are ready to use, use the following command:
```
d3x dataset list
```
To check the list of documents that has been ingested in the dataset, use the following command:
d3x dataset show -d <dataset name>
d3x dataset show -d contracts
You can also check the dataset details in the DKubeX UI by navigating to the Datasets section. To see the details of the dataset, click on the dataset name.