Data Ingestion and Creating Datasets¶

Note

This section describes how to run a data ingestion job in the local DKubeX cluster. For launching a data ingestion job on a SkyPilot cluster, refer to Data Ingestion with SkyPilot.

Prerequisites¶

You must have an embedding model deployed on DKubeX, or access to a Huggingface embedding model (registered in DKubeX embedding model catalog) or OpenAI embedding model. For more information regarding deployment on DKubeX, refer to Deploying Embedding Models on DKubeX.

Creating an Ingestion Job¶

To create a data ingestion job on DKubeX and create a dataset, follow the steps provided below.

On your DKubeX workspace, click on the Datasets menu on the left panel. The Datasets page will open.
Click on the + button on the top left of the datasets window. This is the create dataset button. The Dataset Create window will open.
On the first tab General:
- Under General Configuration, provide the following details:
  - In the Name field, provide a name for the dataset.
  - Additionally, under Advanced Settings, you can provide the following details:
    - In the MLFlow Experiment field, you can provide a name for the MLFlow experiment in which the ingestion run and artifacts will be stored (Default is Default).
    - You can enable the Adjacent Chunks and the Cache Enabled options to enable including adjacent chunks while creating a chunk and turning on frequent question caching, respectively.
- Under Splitter Configuration, provide the following details:
  - From the Splitter dropdown, select the type of splitter you want to use.
  - In the Chunk Size field, provide the chunk size in number of tokens. The default value is 256.
  - In the Chunk Overlap field, provide the chunk overlap in number of tokens. The default value is 0.
Once done, click on the Next button to proceed to the next tab- Embedding. In this tab:
- From the Embedding Provider dropdown, select the embedding model provider that you are going to use. The options available are:
  - DKubeX/Sky: Select this option if you are going to use an embedding model deployed locally or with SkyPilot respectively on DKubeX. For these two options, the details to be provided are:
    - From the Embedding Model dropdown, select the embedding model deployment you want to use.
    - In the Batch Size field, specify the number of chunks sent to the embedding model to generate embeddings. The default value is 32.
    - In the Number of Workers field, specify the number of workers to be used for parallel processing. The default value is 1.
  - Huggingface: Select this option if you are going to use a Huggingface embedding model. The model should be registered in DKubeX embedding model catalog. For this option, the details to be provided are:
    - In the Model field, provide the full Huggingface model path of the embedding model that you are going to use. E.g., BAAI/bge-large-en-v1-5.
  - OpenAI: Select this option if you are going to use an OpenAI embedding model. For this option, the details to be provided are:
    - From the Model dropdown, choose the OpenAI embedding model that you are going to use. E.g., text-embedding-ada-002.
    - In the LLM Key field, provide the OpenAI API key. The key should be in the format sk-<key>. E.g., sk-4q*********dQ.
    - In the Number of Workers field, specify the number of workers to be used for parallel processing. The default value is 1.
Once done, click on the Next button to proceed to the next tab- Reader and Data Source. In this tab, you can choose and add single or multiple data readers and data source details. The steps are as follows:
- Click on the + Add New button on the top right part of the window, or Get Started in the middle of the window to add your first reader. The Add New Reader sidebar will open.
- From the Type dropdown, select the type of reader you want to use. Currently available options are as follows:
  File Reader is used to get the files to be ingested from the workspace file system. For this reader, provide the following details:
  - Under the Loader Configuration section, you can provide the documents you want to ingest with the three available options:
    
    With the Select File option, you can directly upload the files you want to ingest to your workspace.
    
    With the Select Folder option, you can directly upload a folder containing the files you want to ingest to your workspace.
    
    In case your files are already present in your workspace; with the Path option, you can provide the absolute path of the folder containing those files. You can also upload the files to your workspace using the File Browser application in your DKubeX workspace and then provide the absolute path of the folder here.
  - You can enable the Exclude Hidden option to exclude the hidden files from your selected file source from being ingested.
  - You can enable the Raise on Error option to raise an error if any of the files in the selected file source fails to be ingested. That particular file will be skipped and the rest of the files will be ingested.
  - You can enable the Recursive option to ingest all the files in the selected folder and its subfolders.
  Once done, click on the Save button to save the file reader configuration and data source details.
  Sharepoint Reader is used to get the files to be ingested from a Sharepoint site. For this reader, provide the following details:
  - Under the Loader Configuration section, you can provide the documents you want to ingest with the two available options:
    
    In the Sharepoint Site field, provide the Sharepoint site URL from which you want to ingest the files. E.g., example.sharepoint.com.
    
    In the Sharepoint Site Path field, provide the path of the Sharepoint site from which you want to ingest the files. E.g., /sites/example.
    
    In the Access Token field, provide the current access token for the Sharepoint account.
    
    In the Tenant ID field, provide the tenant ID of the Sharepoint account.
    
    In the Client ID field, provide the client ID of the Sharepoint account.
    
    In the Client Secret field, provide the client secret of the Sharepoint account.
  - Additionally, under the Data Configuration section, you can add custom key-value pairs for data configuration by clicking on the + Add key-value button.
  Once done, click on the Save button to save the Sharepoint reader configuration and data source details.
  The Scraped Data Reader is a specialized type of file reader that is used to ingest data from files from local file system and add custom data configuration values to them. For this reader, provide the following details:
  - Under the Loader Configuration section, you can provide the documents you want to ingest with the three available options:
    
    With the Select File option, you can directly upload the files you want to ingest to your workspace.
    
    With the Select Folder option, you can directly upload a folder containing the files you want to ingest to your workspace.
    
    In case your files are already present in your workspace; with the Path option, you can provide the absolute path of the folder containing those files. You can also upload the files to your workspace using the File Browser application in your DKubeX workspace and then provide the absolute path of the folder here.
  - You can enable the Exclude Hidden option to exclude the hidden files from your selected file source from being ingested.
  - You can enable the Raise on Error option to raise an error if any of the files in the selected file source fails to be ingested. That particular file will be skipped and the rest of the files will be ingested.
  - You can enable the Recursive option to ingest all the files in the selected folder and its subfolders.
  - Additionally, under the Data Configuration section, you can add custom data configurations like Doc Source, Topic category, etc.
  Once done, click on the Save button to save the scraped data reader configuration and data source details.
  Confluence Reader is used to get the files to be ingested from a Confluence site. For this reader, provide the following details:
  - Under the Loader Configuration section, provide the following details:
    
    In the Username field, provide the Confluence user ID of the account.
    
    In the Password field, provide the password of the Confluence account.
    
    In the API Token field, provide the current API token of the Confluence account.
    
    In the Confluence URL field, provide the Confluence site URL from which you want to ingest the files.
  - Under the Data Configuration section, provide the following details:
    
    In the Space Key field, provide the space key of the Confluence site from which you want to ingest the files.
    
    You can enable the Include Attachments option to include the attachments in the Confluence pages.
    
    From the Page Status dropdown, select the status of the pages you want to ingest. The options available are None, Current, Archived and Draft.
    
    In the Max Number of Results field, provide the maximum number of results you want to ingest. The default value is 10.
  Once done, click on the Save button to save the Confluence reader configuration and data source details.
  Github Reader is used to get the files to be ingested from a Github repository. For this reader, provide the following details:
  - Under the Loader Configuration section, provide the following details:
    
    In the Include Directory field, spacify the directories of the Github repository from which you want to ingest the files.
    
    In the Owner field, provide the username of the owner of the Github repository.
    
    In the Repository field, provide the name of the Github repository from which you want to ingest the files.
    
    In the GitHub Token field, provide your current Github access token (PAT).
    
    You can enable the Use Parser option if .pdf files are present in your repository.
    
    In the Include File Extensions section, you can specify the file extensions of the files you want to ingest by clicking on the + Add Extension button. If nothing is provided, all the files in the repository will be ingested.
  - Under the Data Configuration section, provide the following details:
    
    In the Branch field, provide the branch of the Github repository from which you want to ingest the files.
  Once done, click on the Save button to save the Github reader configuration and data source details.
- You can add another data source by clicking on the + Add New button again. You can add multiple data sources by repeating the above steps.
Once you have added all readers and data sources for your dataset, click on the Submit button to create the dataset. The Dataset Create window will close and you will be redirected to the Datasets page.

Checking Dataset details¶

You can check the status of the newly created dataset and all existing datasets on the Datasets page on your DKubeX workspace.

Once the data ingestion job is complete, the status of the new dataset changes to Completed state.
You can click on the dataset name to check all details regarding the dataset.
To check the list of documents ingested in the dataset, click on the Documents tab on the top-right corner of the dataset details page.