Evaluating Base and Finetuned LLMs¶
In this tutorial, we will evaluate the performance of a base and finetuned LLM while comparing it to the performance of OpenAI. For this example, the base Llama2-7B and finetuned Llama2-7B models will be used.
Prerequisites¶
You need to ingest your data corpus and create a dataset from it. You can refer to the Data ingestion and creating dataset tutorial for more information on how to do this.
The dataset name used in this tutorial is
contracts
.
You need to finetune the Llama2-7b model on DKubeX. For a comprehensive guide on how to finetune a model, refer to the Finetuning Open Source LLMs tutorial.
You need to deploy the base Llama2-7B and the finetuned Llama2-7B model on DKubeX. To learn how to deploy a LLM on DKubeX, refer to the Deploying LLMs in DKubeX tutorial.
The names of the base and finetuned Llama2-7B deployments used in this tutorial are
llama27bbase
andllama27bft
respectively.
Export the following variables to your workspace by running the following commands on your DKubeX Terminal.
Replace the
<username>
part with your DKubeX workspace name.export NAMESPACE="<username>" export HOMEDIR=/home/${NAMESPACE}
A few .yaml files are required to be used in the evaluation process.
On the Terminal application in DKubeX UI, run the following commands:
git clone -b v0.8.3 https://github.com/dkubeio/dkubex-examples.git cd && cp dkubex-examples/rag/query/query.yaml ${HOMEDIR}/query.yaml && cp dkubex-examples/rag/evaluation/eval.yaml ${HOMEDIR}/eval.yaml && cd
Evaluating LLM Models¶
In this example, first we will evaluate the base Llama2-7B model comparing it to the performance of OpenAI, and then follow the same steps to evaluate the finetuned Llama2-7B model.
To evaluate the base Llama2-7B model, follow the steps provided below:
Provide the appropriate details on the
query.yaml
file which will be used during the evaluation process. Runvim query.yaml
and provide the following details:Note
On the
chat_engine:url:
section, provide the endpoint URL of the deployed model to be used. The endpoint URL can be found on the Deployments page of DKubeX UI.Field
Sub-field
Description
input
question
The input question to be answered by the RAG system.
mode
The mode of interaction with the pipeline.
vectorstore_retriever
kind
Specifies the type of vector store retriever.
provider
Provider for the vector store retriever.
embedding_class
Class of embedding used for retrieval.
embedding_model
Name of the embedding model from HuggingFace.
dataset
Name of the ingested dataset.
textkey
Key identifying the text data within the dataset.
top_k
The number of results to retrieve per query.
prompt_builder
prompt_str
The prompt string used for generation.
prompt_file
The file containing the prompt string.
nodes_sorter
max_sources
Maximum number of sources to consider during sorting.
reranker
model
Name of the re-ranker model from Hugging Face.
top_n
The number of results to re-rank.
contexts_joiner
separator
Separator used for joining different contexts.
chat_engine
llm
Specifies the LLM to be used for generation.
url
Service URL for the LLM deployment to be used.
llmkey
Authentication key for accessing the LLM service.
window_size
Size of the window for context generation.
max_tokens
Maximum number of tokens for generation.
tracking
experiment
MLflow experiment name for tracking.
input: question: "" mode: "cli" vectorstore_retriever: kind: weaviate vectorstore_provider: dkubex embedding_class: HuggingFaceEmbedding # Use 'HuggingFaceEmbedding' for embedding models from HuggingFace, or 'OpenAIEmbedding' for OpenAI embeddings embedding_model: 'BAAI/bge-large-en-v1.5' # Embedding model name llmkey: "" # API key for the embedding model (if required) textkey: 'paperchunks' top_k: 3 prompt_builder: prompt_str: "" prompt_file: "" nodes_sorter: max_sources: 3 contexts_joiner: separator: "\n\n" chat_engine: llm: dkubex # Use 'dkubex' for DKubeX deployments and 'openai' to use OpenAI API url: "https://123.45.67.890/deployment/1/llama27bbase/" # Endpoint URL for the DKubeX LLM deployment which will be used for generating responses. If using OpenAI, keep blank llmkey: "eyJh***********************JSg" # If using DKubeX deployment, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key window_size: 2 max_tokens: 2048 # Maximum number of tokens to be used for generating responses securellm: # SecureLLM configuration. Comment out this section if not using SecureLLM appkey: sk-zxr**************************ya # Provide SecureLLM Application Key to be used dkubex_url: "https://123.45.67.890:32443" # Provide the URL of the DKubeX deployment tracking: experiment: dkubexfm-rag # Provide MLFlow experiment name
Provide the appropriate details on the
eval.yaml
file which will be used during the evaluation process. Runvim eval.yaml
and provide the following details:Note
Provide your own OpenAI API key on the
questions_generator:llmkey:
section.On the
semantic_similarity_evaluator:llmurl:
section, provide the endpoint URL of the deployed model to be used. The endpoint URL can be found on the Deployments page of DKubeX UI.On the
semantic_similarity_evaluator:llmkey:
section, provide the serving token of the deployed model to be used. To find the serving token, go to the Deployments page of DKubeX UI, click on the deployed model, and copy the serving token from the Serving Token section.
Field
Sub-field
Description
vectorstore_reader
kind
Specifies the type of vectorstore reader.
provider
Indicates the provider of the vectorstore reader.
properties
Lists the properties of the vectorstore reader.
questions_generator
prompt_str
Defines the strategy for generating prompts.
prompt_file
File containing custom prompt.
num_questions_per_chunk
Specifies the number of questions to generate per data chunk.
max_chunks
Sets the maximum number of data chunks to generate questions.
llm
Determines the language model (LLM) to use for generating questions.
llm_key
Requires providing the API key for the chosen LLM.
llmurl
Indicates the URL where the chosen LLM service is deployed.
max_tokens
Specifies the maximum number of tokens allowed in each question.
retrieval_evaluator
vector_retriever
kind
Indicates the type of vector retriever.
provider
Specifies the provider of the vector retriever.
textkey
Refers to the key used to access the text data within the vector retriever.
embedding_model
Specifies the name of the embedding model used for text representation.
similarity_top_k
Sets the number of similar items to retrieve for each query.
metrics
Specifies the evaluation metrics used for retrieval evaluation.
semantic_similarity_evaluator
prompt_str
Defines the strategy for similarity evaluation.
prompt_file
File containing custom prompt.
llm
Specifies the language model (LLM) to use for semantic similarity evaluation.
llmkey
Labeled dummy in case of local deployments available within DKubeX or used to pass auth key if using an externam endpoint.
llmurl
Indicates the URL where the chosen LLM service is deployed.
max_tokens
Specifies the maximum number of tokens allowed in each semantic similarity evaluation prompt.
metrics
Specifies the evaluation metric used for semantic similarity evaluation.
tracking
experiment
Provides a unique name for the MLFlow experiment, allowing for tracking and comparison of different runs of the pipeline.
vectorstore_reader: kind: weaviate provider: dkubex properties: - paperchunks - dkubexfm questions_generator: # Generates the questions to be used for the dataset evaluation prompt_str: "default" prompt_file: "" num_questions_per_chunk: 1 # Number of questions to be generated per chunk max_chunks: 100 # Maximum number of chunks to be used for question generation llm: openai # Language model to be used for question generation. To use OpenAI to generate questions, use 'openai'. To use DKubeX LLM deployment, use 'dkubex' llmkey: "sk-4aYW**********************ZRLQe" # If using DKubeX deployment, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key llmurl: "" # Endpoint URL for the DKubeX LLM deployment which will be used for generating responses. If using OpenAI, keep blank. max_tokens: 2048 # Maximum number of tokens to be used for generating questions retrieval_evaluator: vector_retriever: kind: weaviate vectorstore_provider: dkubex textkey: paperchunks embedding_class: HuggingFaceEmbedding # Use 'HuggingFaceEmbedding' for embedding models from HuggingFace, or 'OpenAIEmbedding' for OpenAI embeddings embedding_model: "BAAI/bge-large-en-v1.5" # Embedding model name llmkey: "" # API key for the embedding model (if required) similarity_top_k: 3 metrics: - mrr - hit_rate semantic_similarity_evaluator: prompt_str: "default" prompt_file: "" llm: openai # Language model to be used for generating groundtruth responses for semantic similarity evaluation. To use DKubeX LLM deployment, use 'dkubex'. To use OpenAI, use 'openai' llmkey: "eyJh******************************dJSg" # If using DKubeX deployment for generating groundtruth responses, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key url: "" # Endpoint URL for the DKubeX LLM deployment which will be used for generating groundtruth responses. If using OpenAI, keep blank max_tokens: 2048 # Maximum number of tokens to be used for generating groundtruth responses rag_configuration: "/absolute/path/to/rag/config" # Absolute path to the RAG config (query.yaml) file. This file contains the details of the LLM which is to be evaluated against the groundtruth responses. metrics: - similarity_score tracking: experiment: dkubexfm-rag-evaluate # Provide MLFlow experiment name
Once done, run the following command to start the evaluation process of the base Llama2-7B model. Replace the
<dataset name>
part with the name of the dataset created during ingestion (for this example,contracts
).d3x dataset evaluate -d <dataset name> --config ${HOMEDIR}/eval.yaml
d3x dataset evaluate -d contracts --config ${HOMEDIR}/eval.yaml