Evaluating Base and Finetuned LLMs¶

In this tutorial, we will evaluate the performance of a base and finetuned LLM while comparing it to the performance of OpenAI. For this example, the base Llama2-7B and finetuned Llama2-7B models will be used.

Prerequisites¶

You need to ingest your data corpus and create a dataset from it. You can refer to the Data ingestion and creating dataset tutorial for more information on how to do this.
- The dataset name used in this tutorial is contracts.
You need to finetune the Llama2-7b model on DKubeX. For a comprehensive guide on how to finetune a model, refer to the Finetuning Open Source LLMs tutorial.
You need to deploy the base Llama2-7B and the finetuned Llama2-7B model on DKubeX. To learn how to deploy a LLM on DKubeX, refer to the Deploying LLMs in DKubeX tutorial.
- The names of the base and finetuned Llama2-7B deployments used in this tutorial are llama27bbase and llama27bft respectively.
Export the following variables to your workspace by running the following commands on your DKubeX Terminal.
- Replace the <username> part with your DKubeX workspace name.
```
export PYTHONWARNINGS="ignore"
export OPENAI_API_KEY="dummy"
export NAMESPACE="<username>"
export HOMEDIR=/home/${NAMESPACE}
```

A few .yaml files are required to be used in the evaluation process.

On the Terminal application in DKubeX UI, run the following commands:

git clone https://github.com/dkubeio/dkubex-examples.git
cd dkubex-examples
git checkout llamaidx
cp query/query.yaml ${HOMEDIR}/query.yaml && cp evaluation/eval.yaml ${HOMEDIR}/eval.yaml && cd

Evaluating LLM Models¶

In this example, first we will evaluate the base Llama2-7B model comparing it to the performance of OpenAI, and then follow the same steps to evaluate the finetuned Llama2-7B model.

To evaluate the base Llama2-7B model, follow the steps provided below:

Provide the appropriate details on the query.yaml file which will be used during the evaluation process. Run vim query.yaml and provide the following details:

Note

On the chat_engine:url: section, provide the endpoint URL of the deployed model to be used. The syntax for the URL is provided below. Replace <your username> part with your username.

"http://llama27bbase-serve-svc.<your username>:8000"

Here you are providing your own username here because the llama27bbase deployment was done from your workspace earlier. If you are going to use a model deployed by any other user, you will need to provide the proper deployment name in place of llama27bbase and the username of that user.

Field	Sub-field	Description
`input`	`question`	The input question to be answered by the RAG system.
`input`	`mode`	The mode of interaction with the pipeline.
`vectorstore_retriever`	`kind`	Specifies the type of vector store retriever.
	`provider`	Provider for the vector store retriever.
	`embedding_class`	Class of embedding used for retrieval.
	`embedding_model`	Name of the embedding model from HuggingFace.
	`dataset`	Name of the ingested dataset.
	`textkey`	Key identifying the text data within the dataset.
	`top_k`	The number of results to retrieve per query.
`prompt_builder`	`prompt_str`	The prompt string used for generation.
`prompt_builder`	`prompt_file`	The file containing the prompt string.
`nodes_sorter`	`max_sources`	Maximum number of sources to consider during sorting.
`reranker`	`model`	Name of the re-ranker model from Hugging Face.
`reranker`	`top_n`	The number of results to re-rank.
`contexts_joiner`	`separator`	Separator used for joining different contexts.
`chat_engine`	`llm`	Specifies the LLM to be used for generation.
	`url`	Service URL for the LLM deployment to be used.
	`llmkey`	Authentication key for accessing the LLM service.
	`window_size`	Size of the window for context generation.
	`max_tokens`	Maximum number of tokens for generation.
`tracking`	`experiment`	MLflow experiment name for tracking.

input:
    question: ""
    mode: "cli"
vectorstore_retriever:
    kind: weaviate
    provider: dkubex
    embedding_class: HuggingFaceEmbedding
    embedding_model: 'BAAI/bge-large-en-v1.5'
    dataset: 'dataset001' #name of your dataset
    textkey: 'paperchunks'
    top_k: 3
prompt_builder:
    prompt_str: ""
    prompt_file: ""
nodes_sorter:
    max_sources: 3
reranker:
    model: 'BAAI/bge-reranker-large'
    top_n: 3
contexts_joiner:
    separator: "\n\n"
chat_engine:
    llm: dkubex #use "dkubex" for dkubex deployments
    url: "http://llama27bbase-serve-svc.<your username>:8000"
    llmkey: "dummy"
    window_size: 2
    max_tokens: 1024
tracking:
  experiment: query-experiment-1

Provide the appropriate details on the eval.yaml file which will be used during the evaluation process. Run vim eval.yaml and provide the following details:

Note

Provide your own OpenAI API key on the questions_generator:llmkey: section.
On the semantic_similarity_evaluator:llmurl: section, provide the endpoint URL of the deployed model to be used. The syntax for the URL is provided below. Replace <your username> part with your username.
```
"http://llama27bbase-serve-svc.<your username>:8000"
```
Here you are providing your own username here because the llama27bbase deployment was done from your workspace earlier. If you are going to use a model deployed by any other user, you will need to provide the proper deployment name in place of llama27bbase and the username of that user.

Field	Sub-field		Description
`vectorstore_reader`	`kind`		Specifies the type of vectorstore reader.
	`provider`		Indicates the provider of the vectorstore reader.
	`properties`		Lists the properties of the vectorstore reader.
`questions_generator`	`prompt_str`		Defines the strategy for generating prompts.
	`prompt_file`		File containing custom prompt.
	`num_questions_per_chunk`		Specifies the number of questions to generate per data chunk.
	`max_chunks`		Sets the maximum number of data chunks to generate questions.
	`llm`		Determines the language model (LLM) to use for generating questions.
	`llm_key`		Requires providing the API key for the chosen LLM.
	`llmurl`		Indicates the URL where the chosen LLM service is deployed.
	`max_tokens`		Specifies the maximum number of tokens allowed in each question.
`retrieval_evaluator`	`vector_retriever`	`kind`	Indicates the type of vector retriever.
		`provider`	Specifies the provider of the vector retriever.
		`textkey`	Refers to the key used to access the text data within the vector retriever.
		`embedding_model`	Specifies the name of the embedding model used for text representation.
		`similarity_top_k`	Sets the number of similar items to retrieve for each query.
	`metrics`		Specifies the evaluation metrics used for retrieval evaluation.
`semantic_similarity_evaluator`	`prompt_str`		Defines the strategy for similarity evaluation.
	`prompt_file`		File containing custom prompt.
	`llm`		Specifies the language model (LLM) to use for semantic similarity evaluation.
	`llmkey`		Labeled dummy in case of local deployments available within DKubeX or used to pass auth key if using an externam endpoint.
	`llmurl`		Indicates the URL where the chosen LLM service is deployed.
	`max_tokens`		Specifies the maximum number of tokens allowed in each semantic similarity evaluation prompt.
	`metrics`		Specifies the evaluation metric used for semantic similarity evaluation.
`tracking`	`experiment`		Provides a unique name for the MLFlow experiment, allowing for tracking and comparison of different runs of the pipeline.

# Weaviate is supported as a vectorstore_reader as of now
vectorstore_reader:
    kind: weaviate
    provider: dkubex
    properties:
    - paperchunks
    - dkubexfm

questions_generator:
    prompt_str: "default"
    prompt_file: ""
    num_questions_per_chunk: 1
    max_chunks: 1
    llm: openai #dkubex
    llmkey: "sk-hjy*********************lkyij" #provide your Open_AI_Key here
    llmurl: ""
    max_tokens: 2048
retrieval_evaluator:
    vector_retriever:
        kind: weaviate
        provider: dkubex
        textkey: paperchunks
        embedding_model: "BAAI/bge-large-en-v1.5"
        similarity_top_k: 3
    metrics:
    - mrr
    - hit_rate
semantic_similarity_evaluator:
    prompt_str: "default"
    prompt_file: ""
    llm: dkubex #dkubex
    llmkey: "dummy"
    llmurl: "http://llama27bbase-serve-svc.<your username>:8000" #service url for the llm deployment tp be used. replace username with workspace name in which deployment was created.
    max_tokens: 2048
    metrics:
    - similarity_score
tracking:
    # MLFlow experiment name
    experiment: eval-experiment-1 #provide a unique experiement name

Once done, run the following command to start the evaluation process of the base Llama2-7B model. Replace the <dataset name> part with the name of the dataset created during ingestion (for this example, contracts).
d3x dataset evaluate -d <dataset name> --config ${HOMEDIR}/eval.yaml
d3x dataset evaluate -d contracts --config ${HOMEDIR}/eval.yaml