Evaluating Base and Finetuned LLMs¶
In this tutorial, we will evaluate the performance of a base and finetuned LLM while comparing it to the performance of OpenAI. For this example, the base Llama2-7B and finetuned Llama2-7B models will be used.
Prerequisites¶
You need to ingest your data corpus and create a dataset from it. You can refer to the Data ingestion and creating dataset tutorial for more information on how to do this.
The dataset name used in this tutorial is
contracts.
You need to finetune the Llama2-7b model on DKubeX. For a comprehensive guide on how to finetune a model, refer to the Finetuning Open Source LLMs tutorial.
You need to deploy the base Llama2-7B and the finetuned Llama2-7B model on DKubeX. To learn how to deploy a LLM on DKubeX, refer to the Deploying LLMs in DKubeX tutorial.
The names of the base and finetuned Llama2-7B deployments used in this tutorial are
llama27bbaseandllama27bftrespectively.
Export the following variables to your workspace by running the following commands on your DKubeX Terminal.
Replace the
<username>part with your DKubeX workspace name.export PYTHONWARNINGS="ignore" export OPENAI_API_KEY="dummy" export NAMESPACE="<username>" export HOMEDIR=/home/${NAMESPACE}
A few .yaml files are required to be used in the evaluation process.
On the Terminal application in DKubeX UI, run the following commands:
git clone https://github.com/dkubeio/dkubex-examples.git cd dkubex-examples git checkout llamaidx cp query/query.yaml ${HOMEDIR}/query.yaml && cp evaluation/eval.yaml ${HOMEDIR}/eval.yaml && cd
Evaluating LLM Models¶
In this example, first we will evaluate the base Llama2-7B model comparing it to the performance of OpenAI, and then follow the same steps to evaluate the finetuned Llama2-7B model.
To evaluate the base Llama2-7B model, follow the steps provided below:
Provide the appropriate details on the
query.yamlfile which will be used during the evaluation process. Runvim query.yamland provide the following details:Note
On the
chat_engine:url:section, provide the endpoint URL of the deployed model to be used. The syntax for the URL is provided below. Replace<your username>part with your username."http://llama27bbase-serve-svc.<your username>:8000"Here you are providing your own username here because the
llama27bbasedeployment was done from your workspace earlier. If you are going to use a model deployed by any other user, you will need to provide the proper deployment name in place ofllama27bbaseand the username of that user.Field
Sub-field
Description
inputquestionThe input question to be answered by the RAG system.
modeThe mode of interaction with the pipeline.
vectorstore_retrieverkindSpecifies the type of vector store retriever.
providerProvider for the vector store retriever.
embedding_classClass of embedding used for retrieval.
embedding_modelName of the embedding model from HuggingFace.
datasetName of the ingested dataset.
textkeyKey identifying the text data within the dataset.
top_kThe number of results to retrieve per query.
prompt_builderprompt_strThe prompt string used for generation.
prompt_fileThe file containing the prompt string.
nodes_sortermax_sourcesMaximum number of sources to consider during sorting.
rerankermodelName of the re-ranker model from Hugging Face.
top_nThe number of results to re-rank.
contexts_joinerseparatorSeparator used for joining different contexts.
chat_enginellmSpecifies the LLM to be used for generation.
urlService URL for the LLM deployment to be used.
llmkeyAuthentication key for accessing the LLM service.
window_sizeSize of the window for context generation.
max_tokensMaximum number of tokens for generation.
trackingexperimentMLflow experiment name for tracking.
input: question: "" mode: "cli" vectorstore_retriever: kind: weaviate provider: dkubex embedding_class: HuggingFaceEmbedding embedding_model: 'BAAI/bge-large-en-v1.5' dataset: 'dataset001' #name of your dataset textkey: 'paperchunks' top_k: 3 prompt_builder: prompt_str: "" prompt_file: "" nodes_sorter: max_sources: 3 reranker: model: 'BAAI/bge-reranker-large' top_n: 3 contexts_joiner: separator: "\n\n" chat_engine: llm: dkubex #use "dkubex" for dkubex deployments url: "http://llama27bbase-serve-svc.<your username>:8000" llmkey: "dummy" window_size: 2 max_tokens: 1024 tracking: experiment: query-experiment-1
Provide the appropriate details on the
eval.yamlfile which will be used during the evaluation process. Runvim eval.yamland provide the following details:Note
Provide your own OpenAI API key on the
questions_generator:llmkey:section.On the
semantic_similarity_evaluator:llmurl:section, provide the endpoint URL of the deployed model to be used. The syntax for the URL is provided below. Replace<your username>part with your username."http://llama27bbase-serve-svc.<your username>:8000"Here you are providing your own username here because the
llama27bbasedeployment was done from your workspace earlier. If you are going to use a model deployed by any other user, you will need to provide the proper deployment name in place ofllama27bbaseand the username of that user.
Field
Sub-field
Description
vectorstore_readerkindSpecifies the type of vectorstore reader.
providerIndicates the provider of the vectorstore reader.
propertiesLists the properties of the vectorstore reader.
questions_generatorprompt_strDefines the strategy for generating prompts.
prompt_fileFile containing custom prompt.
num_questions_per_chunkSpecifies the number of questions to generate per data chunk.
max_chunksSets the maximum number of data chunks to generate questions.
llmDetermines the language model (LLM) to use for generating questions.
llm_keyRequires providing the API key for the chosen LLM.
llmurlIndicates the URL where the chosen LLM service is deployed.
max_tokensSpecifies the maximum number of tokens allowed in each question.
retrieval_evaluatorvector_retrieverkindIndicates the type of vector retriever.
providerSpecifies the provider of the vector retriever.
textkeyRefers to the key used to access the text data within the vector retriever.
embedding_modelSpecifies the name of the embedding model used for text representation.
similarity_top_kSets the number of similar items to retrieve for each query.
metricsSpecifies the evaluation metrics used for retrieval evaluation.
semantic_similarity_evaluatorprompt_strDefines the strategy for similarity evaluation.
prompt_fileFile containing custom prompt.
llmSpecifies the language model (LLM) to use for semantic similarity evaluation.
llmkeyLabeled dummy in case of local deployments available within DKubeX or used to pass auth key if using an externam endpoint.
llmurlIndicates the URL where the chosen LLM service is deployed.
max_tokensSpecifies the maximum number of tokens allowed in each semantic similarity evaluation prompt.
metricsSpecifies the evaluation metric used for semantic similarity evaluation.
trackingexperimentProvides a unique name for the MLFlow experiment, allowing for tracking and comparison of different runs of the pipeline.
# Weaviate is supported as a vectorstore_reader as of now vectorstore_reader: kind: weaviate provider: dkubex properties: - paperchunks - dkubexfm questions_generator: prompt_str: "default" prompt_file: "" num_questions_per_chunk: 1 max_chunks: 1 llm: openai #dkubex llmkey: "sk-hjy*********************lkyij" #provide your Open_AI_Key here llmurl: "" max_tokens: 2048 retrieval_evaluator: vector_retriever: kind: weaviate provider: dkubex textkey: paperchunks embedding_model: "BAAI/bge-large-en-v1.5" similarity_top_k: 3 metrics: - mrr - hit_rate semantic_similarity_evaluator: prompt_str: "default" prompt_file: "" llm: dkubex #dkubex llmkey: "dummy" llmurl: "http://llama27bbase-serve-svc.<your username>:8000" #service url for the llm deployment tp be used. replace username with workspace name in which deployment was created. max_tokens: 2048 metrics: - similarity_score tracking: # MLFlow experiment name experiment: eval-experiment-1 #provide a unique experiement name
Once done, run the following command to start the evaluation process of the base Llama2-7B model. Replace the
<dataset name>part with the name of the dataset created during ingestion (for this example,contracts).d3x dataset evaluate -d <dataset name> --config ${HOMEDIR}/eval.yamld3x dataset evaluate -d contracts --config ${HOMEDIR}/eval.yaml