Evaluating Base and Finetuned LLMs¶
In this tutorial, we will evaluate the performance of a base and finetuned LLM while comparing it to the performance of OpenAI. For this example, the base Llama2-7B and finetuned Llama2-7B models will be used.
Prerequisites¶
You need to ingest your data corpus and create a dataset from it. You can refer to the Data ingestion and creating dataset tutorial for more information on how to do this.
The dataset name used in this tutorial is
contracts.
You need to finetune the Llama2-7b model on DKubeX. For a comprehensive guide on how to finetune a model, refer to the Finetuning Open Source LLMs tutorial.
You need to deploy the base Llama2-7B and the finetuned Llama2-7B model on DKubeX. To learn how to deploy a LLM on DKubeX, refer to the Deploying LLMs in DKubeX tutorial.
The names of the base and finetuned Llama2-7B deployments used in this tutorial are
llama27bbaseandllama27bftrespectively.
Export the following variables to your workspace by running the following commands on your DKubeX Terminal.
Replace the
<username>part with your DKubeX workspace name.export NAMESPACE="<username>" export HOMEDIR=/home/${NAMESPACE}
A few .yaml files are required to be used in the evaluation process.
On the Terminal application in DKubeX UI, run the following commands:
git clone -b v0.8.3 https://github.com/dkubeio/dkubex-examples.git cd && cp dkubex-examples/rag/query/query.yaml ${HOMEDIR}/query.yaml && cp dkubex-examples/rag/evaluation/eval.yaml ${HOMEDIR}/eval.yaml && cd
Evaluating LLM Models¶
In this example, first we will evaluate the base Llama2-7B model comparing it to the performance of OpenAI, and then follow the same steps to evaluate the finetuned Llama2-7B model.
To evaluate the base Llama2-7B model, follow the steps provided below:
Provide the appropriate details on the
query.yamlfile which will be used during the evaluation process. Runvim query.yamland provide the following details:Note
On the
chat_engine:url:section, provide the endpoint URL of the deployed model to be used. The endpoint URL can be found on the Deployments page of DKubeX UI.Field
Sub-field
Description
inputquestionThe input question to be answered by the RAG system.
modeThe mode of interaction with the pipeline.
vectorstore_retrieverkindSpecifies the type of vector store retriever.
providerProvider for the vector store retriever.
embedding_classClass of embedding used for retrieval.
embedding_modelName of the embedding model from HuggingFace.
datasetName of the ingested dataset.
textkeyKey identifying the text data within the dataset.
top_kThe number of results to retrieve per query.
prompt_builderprompt_strThe prompt string used for generation.
prompt_fileThe file containing the prompt string.
nodes_sortermax_sourcesMaximum number of sources to consider during sorting.
rerankermodelName of the re-ranker model from Hugging Face.
top_nThe number of results to re-rank.
contexts_joinerseparatorSeparator used for joining different contexts.
chat_enginellmSpecifies the LLM to be used for generation.
urlService URL for the LLM deployment to be used.
llmkeyAuthentication key for accessing the LLM service.
window_sizeSize of the window for context generation.
max_tokensMaximum number of tokens for generation.
trackingexperimentMLflow experiment name for tracking.
input: question: "" mode: "cli" vectorstore_retriever: kind: weaviate vectorstore_provider: dkubex embedding_class: HuggingFaceEmbedding # Use 'HuggingFaceEmbedding' for embedding models from HuggingFace, or 'OpenAIEmbedding' for OpenAI embeddings embedding_model: 'BAAI/bge-large-en-v1.5' # Embedding model name llmkey: "" # API key for the embedding model (if required) textkey: 'paperchunks' top_k: 3 prompt_builder: prompt_str: "" prompt_file: "" nodes_sorter: max_sources: 3 contexts_joiner: separator: "\n\n" chat_engine: llm: dkubex # Use 'dkubex' for DKubeX deployments and 'openai' to use OpenAI API url: "https://123.45.67.890/deployment/1/llama27bbase/" # Endpoint URL for the DKubeX LLM deployment which will be used for generating responses. If using OpenAI, keep blank llmkey: "eyJh***********************JSg" # If using DKubeX deployment, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key window_size: 2 max_tokens: 2048 # Maximum number of tokens to be used for generating responses securellm: # SecureLLM configuration. Comment out this section if not using SecureLLM appkey: sk-zxr**************************ya # Provide SecureLLM Application Key to be used dkubex_url: "https://123.45.67.890:32443" # Provide the URL of the DKubeX deployment tracking: experiment: dkubexfm-rag # Provide MLFlow experiment name
Provide the appropriate details on the
eval.yamlfile which will be used during the evaluation process. Runvim eval.yamland provide the following details:Note
Provide your own OpenAI API key on the
questions_generator:llmkey:section.On the
semantic_similarity_evaluator:llmurl:section, provide the endpoint URL of the deployed model to be used. The endpoint URL can be found on the Deployments page of DKubeX UI.On the
semantic_similarity_evaluator:llmkey:section, provide the serving token of the deployed model to be used. To find the serving token, go to the Deployments page of DKubeX UI, click on the deployed model, and copy the serving token from the Serving Token section.
Field
Sub-field
Description
vectorstore_readerkindSpecifies the type of vectorstore reader.
providerIndicates the provider of the vectorstore reader.
propertiesLists the properties of the vectorstore reader.
questions_generatorprompt_strDefines the strategy for generating prompts.
prompt_fileFile containing custom prompt.
num_questions_per_chunkSpecifies the number of questions to generate per data chunk.
max_chunksSets the maximum number of data chunks to generate questions.
llmDetermines the language model (LLM) to use for generating questions.
llm_keyRequires providing the API key for the chosen LLM.
llmurlIndicates the URL where the chosen LLM service is deployed.
max_tokensSpecifies the maximum number of tokens allowed in each question.
retrieval_evaluatorvector_retrieverkindIndicates the type of vector retriever.
providerSpecifies the provider of the vector retriever.
textkeyRefers to the key used to access the text data within the vector retriever.
embedding_modelSpecifies the name of the embedding model used for text representation.
similarity_top_kSets the number of similar items to retrieve for each query.
metricsSpecifies the evaluation metrics used for retrieval evaluation.
semantic_similarity_evaluatorprompt_strDefines the strategy for similarity evaluation.
prompt_fileFile containing custom prompt.
llmSpecifies the language model (LLM) to use for semantic similarity evaluation.
llmkeyLabeled dummy in case of local deployments available within DKubeX or used to pass auth key if using an externam endpoint.
llmurlIndicates the URL where the chosen LLM service is deployed.
max_tokensSpecifies the maximum number of tokens allowed in each semantic similarity evaluation prompt.
metricsSpecifies the evaluation metric used for semantic similarity evaluation.
trackingexperimentProvides a unique name for the MLFlow experiment, allowing for tracking and comparison of different runs of the pipeline.
vectorstore_reader: kind: weaviate provider: dkubex properties: - paperchunks - dkubexfm questions_generator: # Generates the questions to be used for the dataset evaluation prompt_str: "default" prompt_file: "" num_questions_per_chunk: 1 # Number of questions to be generated per chunk max_chunks: 100 # Maximum number of chunks to be used for question generation llm: openai # Language model to be used for question generation. To use OpenAI to generate questions, use 'openai'. To use DKubeX LLM deployment, use 'dkubex' llmkey: "sk-4aYW**********************ZRLQe" # If using DKubeX deployment, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key llmurl: "" # Endpoint URL for the DKubeX LLM deployment which will be used for generating responses. If using OpenAI, keep blank. max_tokens: 2048 # Maximum number of tokens to be used for generating questions retrieval_evaluator: vector_retriever: kind: weaviate vectorstore_provider: dkubex textkey: paperchunks embedding_class: HuggingFaceEmbedding # Use 'HuggingFaceEmbedding' for embedding models from HuggingFace, or 'OpenAIEmbedding' for OpenAI embeddings embedding_model: "BAAI/bge-large-en-v1.5" # Embedding model name llmkey: "" # API key for the embedding model (if required) similarity_top_k: 3 metrics: - mrr - hit_rate semantic_similarity_evaluator: prompt_str: "default" prompt_file: "" llm: openai # Language model to be used for generating groundtruth responses for semantic similarity evaluation. To use DKubeX LLM deployment, use 'dkubex'. To use OpenAI, use 'openai' llmkey: "eyJh******************************dJSg" # If using DKubeX deployment for generating groundtruth responses, provide the serving_token for the deployment. If using OpenAI, provide the OpenAI API key url: "" # Endpoint URL for the DKubeX LLM deployment which will be used for generating groundtruth responses. If using OpenAI, keep blank max_tokens: 2048 # Maximum number of tokens to be used for generating groundtruth responses rag_configuration: "/absolute/path/to/rag/config" # Absolute path to the RAG config (query.yaml) file. This file contains the details of the LLM which is to be evaluated against the groundtruth responses. metrics: - similarity_score tracking: experiment: dkubexfm-rag-evaluate # Provide MLFlow experiment name
Once done, run the following command to start the evaluation process of the base Llama2-7B model. Replace the
<dataset name>part with the name of the dataset created during ingestion (for this example,contracts).d3x dataset evaluate -d <dataset name> --config ${HOMEDIR}/eval.yamld3x dataset evaluate -d contracts --config ${HOMEDIR}/eval.yaml