Evaluating Base and Finetuned LLMs¶
In this tutorial, we will evaluate the performance of a base and finetuned LLM while comparing it to the performance of OpenAI. For this example, the base Llama2-7B and finetuned Llama2-7B models will be used.
Prerequisites¶
You need to ingest your data corpus and create a dataset from it. You can refer to the Data ingestion and creating dataset tutorial for more information on how to do this.
The dataset name used in this tutorial is
contracts
.
You need to finetune the Llama2-7b model on DKubeX. For a comprehensive guide on how to finetune a model, refer to the Finetuning Open Source LLMs tutorial.
You need to deploy the base Llama2-7B and the finetuned Llama2-7B model on DKubeX. To learn how to deploy a LLM on DKubeX, refer to the Deploying LLMs in DKubeX tutorial.
The names of the base and finetuned Llama2-7B deployments used in this tutorial are
llama27bbase
andllama27bft
respectively.
Export the following variables to your workspace by running the following commands on your DKubeX Terminal.
Replace the
<username>
part with your DKubeX workspace name.export PYTHONWARNINGS="ignore" export OPENAI_API_KEY="dummy" export NAMESPACE="<username>" export HOMEDIR=/home/${NAMESPACE}
A few .yaml files are required to be used in the evaluation process.
On the Terminal application in DKubeX UI, run the following commands:
git clone https://github.com/dkubeio/dkubex-examples.git cd dkubex-examples git checkout llamaidx cp query/query.yaml ${HOMEDIR}/query.yaml && cp evaluation/eval.yaml ${HOMEDIR}/eval.yaml && cd
Evaluating LLM Models¶
In this example, first we will evaluate the base Llama2-7B model comparing it to the performance of OpenAI, and then follow the same steps to evaluate the finetuned Llama2-7B model.
To evaluate the base Llama2-7B model, follow the steps provided below:
Provide the appropriate details on the
query.yaml
file which will be used during the evaluation process. Runvim query.yaml
and provide the following details:Note
On the
chat_engine:url:
section, provide the endpoint URL of the deployed model to be used. The syntax for the URL is provided below. Replace<your username>
part with your username."http://llama27bbase-serve-svc.<your username>:8000"
Here you are providing your own username here because the
llama27bbase
deployment was done from your workspace earlier. If you are going to use a model deployed by any other user, you will need to provide the proper deployment name in place ofllama27bbase
and the username of that user.Field
Sub-field
Description
input
question
The input question to be answered by the RAG system.
mode
The mode of interaction with the pipeline.
vectorstore_retriever
kind
Specifies the type of vector store retriever.
provider
Provider for the vector store retriever.
embedding_class
Class of embedding used for retrieval.
embedding_model
Name of the embedding model from HuggingFace.
dataset
Name of the ingested dataset.
textkey
Key identifying the text data within the dataset.
top_k
The number of results to retrieve per query.
prompt_builder
prompt_str
The prompt string used for generation.
prompt_file
The file containing the prompt string.
nodes_sorter
max_sources
Maximum number of sources to consider during sorting.
reranker
model
Name of the re-ranker model from Hugging Face.
top_n
The number of results to re-rank.
contexts_joiner
separator
Separator used for joining different contexts.
chat_engine
llm
Specifies the LLM to be used for generation.
url
Service URL for the LLM deployment to be used.
llmkey
Authentication key for accessing the LLM service.
window_size
Size of the window for context generation.
max_tokens
Maximum number of tokens for generation.
tracking
experiment
MLflow experiment name for tracking.
input: question: "" mode: "cli" vectorstore_retriever: kind: weaviate provider: dkubex embedding_class: HuggingFaceEmbedding embedding_model: 'BAAI/bge-large-en-v1.5' dataset: 'dataset001' #name of your dataset textkey: 'paperchunks' top_k: 3 prompt_builder: prompt_str: "" prompt_file: "" nodes_sorter: max_sources: 3 reranker: model: 'BAAI/bge-reranker-large' top_n: 3 contexts_joiner: separator: "\n\n" chat_engine: llm: dkubex #use "dkubex" for dkubex deployments url: "http://llama27bbase-serve-svc.<your username>:8000" llmkey: "dummy" window_size: 2 max_tokens: 1024 tracking: experiment: query-experiment-1
Provide the appropriate details on the
eval.yaml
file which will be used during the evaluation process. Runvim eval.yaml
and provide the following details:Note
Provide your own OpenAI API key on the
questions_generator:llmkey:
section.On the
semantic_similarity_evaluator:llmurl:
section, provide the endpoint URL of the deployed model to be used. The syntax for the URL is provided below. Replace<your username>
part with your username."http://llama27bbase-serve-svc.<your username>:8000"
Here you are providing your own username here because the
llama27bbase
deployment was done from your workspace earlier. If you are going to use a model deployed by any other user, you will need to provide the proper deployment name in place ofllama27bbase
and the username of that user.
Field
Sub-field
Description
vectorstore_reader
kind
Specifies the type of vectorstore reader.
provider
Indicates the provider of the vectorstore reader.
properties
Lists the properties of the vectorstore reader.
questions_generator
prompt_str
Defines the strategy for generating prompts.
prompt_file
File containing custom prompt.
num_questions_per_chunk
Specifies the number of questions to generate per data chunk.
max_chunks
Sets the maximum number of data chunks to generate questions.
llm
Determines the language model (LLM) to use for generating questions.
llm_key
Requires providing the API key for the chosen LLM.
llmurl
Indicates the URL where the chosen LLM service is deployed.
max_tokens
Specifies the maximum number of tokens allowed in each question.
retrieval_evaluator
vector_retriever
kind
Indicates the type of vector retriever.
provider
Specifies the provider of the vector retriever.
textkey
Refers to the key used to access the text data within the vector retriever.
embedding_model
Specifies the name of the embedding model used for text representation.
similarity_top_k
Sets the number of similar items to retrieve for each query.
metrics
Specifies the evaluation metrics used for retrieval evaluation.
semantic_similarity_evaluator
prompt_str
Defines the strategy for similarity evaluation.
prompt_file
File containing custom prompt.
llm
Specifies the language model (LLM) to use for semantic similarity evaluation.
llmkey
Labeled dummy in case of local deployments available within DKubeX or used to pass auth key if using an externam endpoint.
llmurl
Indicates the URL where the chosen LLM service is deployed.
max_tokens
Specifies the maximum number of tokens allowed in each semantic similarity evaluation prompt.
metrics
Specifies the evaluation metric used for semantic similarity evaluation.
tracking
experiment
Provides a unique name for the MLFlow experiment, allowing for tracking and comparison of different runs of the pipeline.
# Weaviate is supported as a vectorstore_reader as of now vectorstore_reader: kind: weaviate provider: dkubex properties: - paperchunks - dkubexfm questions_generator: prompt_str: "default" prompt_file: "" num_questions_per_chunk: 1 max_chunks: 1 llm: openai #dkubex llmkey: "sk-hjy*********************lkyij" #provide your Open_AI_Key here llmurl: "" max_tokens: 2048 retrieval_evaluator: vector_retriever: kind: weaviate provider: dkubex textkey: paperchunks embedding_model: "BAAI/bge-large-en-v1.5" similarity_top_k: 3 metrics: - mrr - hit_rate semantic_similarity_evaluator: prompt_str: "default" prompt_file: "" llm: dkubex #dkubex llmkey: "dummy" llmurl: "http://llama27bbase-serve-svc.<your username>:8000" #service url for the llm deployment tp be used. replace username with workspace name in which deployment was created. max_tokens: 2048 metrics: - similarity_score tracking: # MLFlow experiment name experiment: eval-experiment-1 #provide a unique experiement name
Once done, run the following command to start the evaluation process of the base Llama2-7B model. Replace the
<dataset name>
part with the name of the dataset created during ingestion (for this example,contracts
).d3x dataset evaluate -d <dataset name> --config ${HOMEDIR}/eval.yaml
d3x dataset evaluate -d contracts --config ${HOMEDIR}/eval.yaml