Data Ingestion and Creating Datasets

Note

This section describes how to run a data ingestion job in the local DKubeX cluster. For launching a data ingestion job on a SkyPilot cluster, refer to Data Ingestion with SkyPilot.

Prerequisites

  • You must have an embedding model deployed on DKubeX, or access to a Huggingface embedding model (registered in DKubeX embedding model catalog) or OpenAI embedding model. For more information regarding deployment on DKubeX, refer to Deploying Embedding Models on DKubeX.

Creating an Ingestion Job

To create a data ingestion job on DKubeX and create a dataset, follow the steps provided below.

  • On your DKubeX workspace, click on the Datasets menu on the left panel. The Datasets page will open.

  • Click on the + button on the top left of the datasets window. This is the create dataset button. The Dataset Create window will open.

  • On the first tab General:

    • Under General Configuration, provide the following details:

      • In the Name field, provide a name for the dataset.

      • Additionally, under Advanced Settings, you can provide the following details:

        • In the MLFlow Experiment field, you can provide a name for the MLFlow experiment in which the ingestion run and artifacts will be stored (Default is Default).

        • You can enable the Adjacent Chunks and the Cache Enabled options to enable including adjacent chunks while creating a chunk and turning on frequent question caching, respectively.

    • Under Splitter Configuration, provide the following details:

      • From the Splitter dropdown, select the type of splitter you want to use.

      • In the Chunk Size field, provide the chunk size in number of tokens. The default value is 256.

      • In the Chunk Overlap field, provide the chunk overlap in number of tokens. The default value is 0.

  • Once done, click on the Next button to proceed to the next tab- Embedding. In this tab:

    • From the Embedding Provider dropdown, select the embedding model provider that you are going to use. The options available are:

      • DKubeX/Sky: Select this option if you are going to use an embedding model deployed locally or with SkyPilot respectively on DKubeX. For these two options, the details to be provided are:

        • From the Embedding Model dropdown, select the embedding model deployment you want to use.

        • In the Batch Size field, specify the number of chunks sent to the embedding model to generate embeddings. The default value is 32.

        • In the Number of Workers field, specify the number of workers to be used for parallel processing. The default value is 1.

      • Huggingface: Select this option if you are going to use a Huggingface embedding model. The model should be registered in DKubeX embedding model catalog. For this option, the details to be provided are:

        • In the Model field, provide the full Huggingface model path of the embedding model that you are going to use. E.g., BAAI/bge-large-en-v1-5.

      • OpenAI: Select this option if you are going to use an OpenAI embedding model. For this option, the details to be provided are:

        • From the Model dropdown, choose the OpenAI embedding model that you are going to use. E.g., text-embedding-ada-002.

        • In the LLM Key field, provide the OpenAI API key. The key should be in the format sk-<key>. E.g., sk-4q*********dQ.

        • In the Number of Workers field, specify the number of workers to be used for parallel processing. The default value is 1.

  • Once done, click on the Next button to proceed to the next tab- Reader and Data Source. In this tab, you can choose and add single or multiple data readers and data source details. The steps are as follows:

    • Click on the + Add New button on the top right part of the window, or Get Started in the middle of the window to add your first reader. The Add New Reader sidebar will open.

    • From the Type dropdown, select the type of reader you want to use. Currently available options are as follows:

      File Reader is used to get the files to be ingested from the workspace file system. For this reader, provide the following details:

      • Under the Loader Configuration section, you can provide the documents you want to ingest with the three available options:

        • With the Select File option, you can directly upload the files you want to ingest to your workspace.

        • With the Select Folder option, you can directly upload a folder containing the files you want to ingest to your workspace.

        • In case your files are already present in your workspace; with the Path option, you can provide the absolute path of the folder containing those files. You can also upload the files to your workspace using the File Browser application in your DKubeX workspace and then provide the absolute path of the folder here.

      • You can enable the Exclude Hidden option to exclude the hidden files from your selected file source from being ingested.

      • You can enable the Raise on Error option to raise an error if any of the files in the selected file source fails to be ingested. That particular file will be skipped and the rest of the files will be ingested.

      • You can enable the Recursive option to ingest all the files in the selected folder and its subfolders.

      Once done, click on the Save button to save the file reader configuration and data source details.

    • You can add another data source by clicking on the + Add New button again. You can add multiple data sources by repeating the above steps.

  • Once you have added all readers and data sources for your dataset, click on the Submit button to create the dataset. The Dataset Create window will close and you will be redirected to the Datasets page.

Checking Dataset details

You can check the status of the newly created dataset and all existing datasets on the Datasets page on your DKubeX workspace.

  • Once the data ingestion job is complete, the status of the new dataset changes to Completed state.

  • You can click on the dataset name to check all details regarding the dataset.

  • To check the list of documents ingested in the dataset, click on the Documents tab on the top-right corner of the dataset details page.