Evaluating Base and Finetuned LLMs¶

In this tutorial, we will evaluate the performance of a base and finetuned LLM while comparing it to the performance of OpenAI. For this example, the base Llama2-7B and finetuned Llama2-7B models will be used.

Prerequisites¶

You need to ingest your data corpus and create a dataset from it. You can refer to the Data ingestion and creating dataset tutorial for more information on how to do this.
- The dataset name used in this tutorial is contracts.
You need to finetune the Llama2-7b model on DKubeX. For a comprehensive guide on how to finetune a model, refer to the Finetuning Open Source LLMs tutorial.
You need to deploy the base Llama2-7B and the finetuned Llama2-7B model on DKubeX. To learn how to deploy a LLM on DKubeX, refer to the Deploying LLMs in DKubeX tutorial.
- The names of the base and finetuned Llama2-7B deployments used in this tutorial are llama27bbase and llama27bft respectively.
Export the following variables to your workspace by running the following commands on your DKubeX Terminal.
- Replace the <username> part with your DKubeX workspace name.
```
export NAMESPACE="<username>"
export HOMEDIR=/home/${NAMESPACE}
```

A few .yaml files are required to be used in the evaluation process.

On the Terminal application in DKubeX UI, run the following commands:

git clone -b v0.8.3 https://github.com/dkubeio/dkubex-examples.git
cd && cp dkubex-examples/rag/query/query.yaml ${HOMEDIR}/query.yaml && cp dkubex-examples/rag/evaluation/eval.yaml ${HOMEDIR}/eval.yaml && cd

Evaluating LLM Models¶

In this example, first we will evaluate the base Llama2-7B model comparing it to the performance of OpenAI, and then follow the same steps to evaluate the finetuned Llama2-7B model.

To evaluate the base Llama2-7B model, follow the steps provided below:

Provide the appropriate details on the query.yaml file which will be used during the evaluation process. Run vim query.yaml and provide the following details:

Note

On the chat_engine:url: section, provide the endpoint URL of the deployed model to be used. The endpoint URL can be found on the Deployments page of DKubeX UI.

Field	Sub-field	Description
`input`	`question`	The input question to be answered by the RAG system.
`input`	`mode`	The mode of interaction with the pipeline.
`vectorstore_retriever`	`kind`	Specifies the type of vector store retriever.
	`provider`	Provider for the vector store retriever.
	`embedding_class`	Class of embedding used for retrieval.
	`embedding_model`	Name of the embedding model from HuggingFace.
	`dataset`	Name of the ingested dataset.
	`textkey`	Key identifying the text data within the dataset.
	`top_k`	The number of results to retrieve per query.
`prompt_builder`	`prompt_str`	The prompt string used for generation.
`prompt_builder`	`prompt_file`	The file containing the prompt string.
`nodes_sorter`	`max_sources`	Maximum number of sources to consider during sorting.
`reranker`	`model`	Name of the re-ranker model from Hugging Face.
`reranker`	`top_n`	The number of results to re-rank.
`contexts_joiner`	`separator`	Separator used for joining different contexts.
`chat_engine`	`llm`	Specifies the LLM to be used for generation.
	`url`	Service URL for the LLM deployment to be used.
	`llmkey`	Authentication key for accessing the LLM service.
	`window_size`	Size of the window for context generation.
	`max_tokens`	Maximum number of tokens for generation.
`tracking`	`experiment`	MLflow experiment name for tracking.

input:
    question: ""
    mode: "cli"
vectorstore_retriever:
    kind: weaviate
    vectorstore_provider: dkubex
    embedding_class: HuggingFaceEmbedding                       # Use 'HuggingFaceEmbedding' for embedding models from HuggingFace, or 'OpenAIEmbedding' for OpenAI embeddings
    embedding_model: 'BAAI/bge-large-en-v1.5'                   # Embedding model name
    llmkey: ""                                                  # API key for the embedding model (if required)
    textkey: 'paperchunks'
    top_k: 3
prompt_builder:
    prompt_str: ""
    prompt_file: ""
nodes_sorter:
    max_sources: 3
contexts_joiner:
    separator: "\n\n"
chat_engine:
    llm: dkubex                                                 # Use 'dkubex' for DKubeX deployments and 'openai' to use OpenAI API
    url: "https://123.45.67.890/deployment/1/llama27bbase/"     # Endpoint URL for the DKubeX LLM deployment which will be used for generating responses. If using OpenAI, keep blank
    llmkey: "eyJh***********************JSg"                    # If using DKubeX deployment, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key
    window_size: 2
    max_tokens: 2048                                            # Maximum number of tokens to be used for generating responses
securellm:                                                      # SecureLLM configuration. Comment out this section if not using SecureLLM
    appkey: sk-zxr**************************ya                  # Provide SecureLLM Application Key to be used
    dkubex_url: "https://123.45.67.890:32443"                   # Provide the URL of the DKubeX deployment
tracking:
  experiment: dkubexfm-rag                                      # Provide MLFlow experiment name

Provide the appropriate details on the eval.yaml file which will be used during the evaluation process. Run vim eval.yaml and provide the following details:

Note

Provide your own OpenAI API key on the questions_generator:llmkey: section.
On the semantic_similarity_evaluator:llmurl: section, provide the endpoint URL of the deployed model to be used. The endpoint URL can be found on the Deployments page of DKubeX UI.
On the semantic_similarity_evaluator:llmkey: section, provide the serving token of the deployed model to be used. To find the serving token, go to the Deployments page of DKubeX UI, click on the deployed model, and copy the serving token from the Serving Token section.

Field	Sub-field		Description
`vectorstore_reader`	`kind`		Specifies the type of vectorstore reader.
	`provider`		Indicates the provider of the vectorstore reader.
	`properties`		Lists the properties of the vectorstore reader.
`questions_generator`	`prompt_str`		Defines the strategy for generating prompts.
	`prompt_file`		File containing custom prompt.
	`num_questions_per_chunk`		Specifies the number of questions to generate per data chunk.
	`max_chunks`		Sets the maximum number of data chunks to generate questions.
	`llm`		Determines the language model (LLM) to use for generating questions.
	`llm_key`		Requires providing the API key for the chosen LLM.
	`llmurl`		Indicates the URL where the chosen LLM service is deployed.
	`max_tokens`		Specifies the maximum number of tokens allowed in each question.
`retrieval_evaluator`	`vector_retriever`	`kind`	Indicates the type of vector retriever.
		`provider`	Specifies the provider of the vector retriever.
		`textkey`	Refers to the key used to access the text data within the vector retriever.
		`embedding_model`	Specifies the name of the embedding model used for text representation.
		`similarity_top_k`	Sets the number of similar items to retrieve for each query.
	`metrics`		Specifies the evaluation metrics used for retrieval evaluation.
`semantic_similarity_evaluator`	`prompt_str`		Defines the strategy for similarity evaluation.
	`prompt_file`		File containing custom prompt.
	`llm`		Specifies the language model (LLM) to use for semantic similarity evaluation.
	`llmkey`		Labeled dummy in case of local deployments available within DKubeX or used to pass auth key if using an externam endpoint.
	`llmurl`		Indicates the URL where the chosen LLM service is deployed.
	`max_tokens`		Specifies the maximum number of tokens allowed in each semantic similarity evaluation prompt.
	`metrics`		Specifies the evaluation metric used for semantic similarity evaluation.
`tracking`	`experiment`		Provides a unique name for the MLFlow experiment, allowing for tracking and comparison of different runs of the pipeline.

vectorstore_reader:
    kind: weaviate
    provider: dkubex
    properties:
    - paperchunks
    - dkubexfm
questions_generator:                                        # Generates the questions to be used for the dataset evaluation
    prompt_str: "default"
    prompt_file: ""
    num_questions_per_chunk: 1                              # Number of questions to be generated per chunk
    max_chunks: 100                                         # Maximum number of chunks to be used for question generation
    llm: openai                                             # Language model to be used for question generation. To use OpenAI to generate questions, use 'openai'. To use DKubeX LLM deployment, use 'dkubex'
    llmkey: "sk-4aYW**********************ZRLQe"            # If using DKubeX deployment, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key
    llmurl: ""                                              # Endpoint URL for the DKubeX LLM deployment which will be used for generating responses. If using OpenAI, keep blank.
    max_tokens: 2048                                        # Maximum number of tokens to be used for generating questions
retrieval_evaluator:
    vector_retriever:
        kind: weaviate
        vectorstore_provider: dkubex
        textkey: paperchunks
        embedding_class: HuggingFaceEmbedding               # Use 'HuggingFaceEmbedding' for embedding models from HuggingFace, or 'OpenAIEmbedding' for OpenAI embeddings
        embedding_model: "BAAI/bge-large-en-v1.5"           # Embedding model name
        llmkey: ""                                          # API key for the embedding model (if required)
        similarity_top_k: 3
    metrics:
    - mrr
    - hit_rate
semantic_similarity_evaluator:
    prompt_str: "default"
    prompt_file: ""
    llm: openai                                             # Language model to be used for generating groundtruth responses for semantic similarity evaluation. To use DKubeX LLM deployment, use 'dkubex'. To use OpenAI, use 'openai'
    llmkey: "eyJh******************************dJSg"        # If using DKubeX deployment for generating groundtruth responses, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key
    url: ""                                                 # Endpoint URL for the DKubeX LLM deployment which will be used for generating groundtruth responses. If using OpenAI, keep blank
    max_tokens: 2048                                        # Maximum number of tokens to be used for generating groundtruth responses
    rag_configuration: "/absolute/path/to/rag/config"       # Absolute path to the RAG config (query.yaml) file. This file contains the details of the LLM which is to be evaluated against the groundtruth responses.
    metrics:
    - similarity_score
tracking:
    experiment: dkubexfm-rag-evaluate                       # Provide MLFlow experiment name

Once done, run the following command to start the evaluation process of the base Llama2-7B model. Replace the <dataset name> part with the name of the dataset created during ingestion (for this example, contracts).
d3x dataset evaluate -d <dataset name> --config ${HOMEDIR}/eval.yaml
d3x dataset evaluate -d contracts --config ${HOMEDIR}/eval.yaml