Evaluating Base and Finetuned LLMs

In this tutorial, we will evaluate the performance of a base and finetuned LLM while comparing it to the performance of OpenAI. For this example, the base Llama2-7B and finetuned Llama2-7B models will be used.

Prerequisites

  • You need to ingest your data corpus and create a dataset from it. You can refer to the Data ingestion and creating dataset tutorial for more information on how to do this.

    • The dataset name used in this tutorial is contracts.

  • You need to finetune the Llama2-7b model on DKubeX. For a comprehensive guide on how to finetune a model, refer to the Finetuning Open Source LLMs tutorial.

  • You need to deploy the base Llama2-7B and the finetuned Llama2-7B model on DKubeX. To learn how to deploy a LLM on DKubeX, refer to the Deploying LLMs in DKubeX tutorial.

    • The names of the base and finetuned Llama2-7B deployments used in this tutorial are llama27bbase and llama27bft respectively.

  • Export the following variables to your workspace by running the following commands on your DKubeX Terminal.

    • Replace the <username> part with your DKubeX workspace name.

      export PYTHONWARNINGS="ignore"
      export OPENAI_API_KEY="dummy"
      export NAMESPACE="<username>"
      export HOMEDIR=/home/${NAMESPACE}
      
  • A few .yaml files are required to be used in the evaluation process.

    • On the Terminal application in DKubeX UI, run the following commands:

      git clone https://github.com/dkubeio/dkubex-examples.git
      cd dkubex-examples
      git checkout llamaidx
      cp query/query.yaml ${HOMEDIR}/query.yaml && cp evaluation/eval.yaml ${HOMEDIR}/eval.yaml && cd
      

Evaluating LLM Models

In this example, first we will evaluate the base Llama2-7B model comparing it to the performance of OpenAI, and then follow the same steps to evaluate the finetuned Llama2-7B model.

To evaluate the base Llama2-7B model, follow the steps provided below:

  • Provide the appropriate details on the query.yaml file which will be used during the evaluation process. Run vim query.yaml and provide the following details:

    Note

    On the chat_engine:url: section, provide the endpoint URL of the deployed model to be used. The syntax for the URL is provided below. Replace <your username> part with your username.

    "http://llama27bbase-serve-svc.<your username>:8000"
    

    Here you are providing your own username here because the llama27bbase deployment was done from your workspace earlier. If you are going to use a model deployed by any other user, you will need to provide the proper deployment name in place of llama27bbase and the username of that user.

    Field

    Sub-field

    Description

    input

    question

    The input question to be answered by the RAG system.

    mode

    The mode of interaction with the pipeline.

    vectorstore_retriever

    kind

    Specifies the type of vector store retriever.

    provider

    Provider for the vector store retriever.

    embedding_class

    Class of embedding used for retrieval.

    embedding_model

    Name of the embedding model from HuggingFace.

    dataset

    Name of the ingested dataset.

    textkey

    Key identifying the text data within the dataset.

    top_k

    The number of results to retrieve per query.

    prompt_builder

    prompt_str

    The prompt string used for generation.

    prompt_file

    The file containing the prompt string.

    nodes_sorter

    max_sources

    Maximum number of sources to consider during sorting.

    reranker

    model

    Name of the re-ranker model from Hugging Face.

    top_n

    The number of results to re-rank.

    contexts_joiner

    separator

    Separator used for joining different contexts.

    chat_engine

    llm

    Specifies the LLM to be used for generation.

    url

    Service URL for the LLM deployment to be used.

    llmkey

    Authentication key for accessing the LLM service.

    window_size

    Size of the window for context generation.

    max_tokens

    Maximum number of tokens for generation.

    tracking

    experiment

    MLflow experiment name for tracking.

  • Provide the appropriate details on the eval.yaml file which will be used during the evaluation process. Run vim eval.yaml and provide the following details:

    Note

    • Provide your own OpenAI API key on the questions_generator:llmkey: section.

    • On the semantic_similarity_evaluator:llmurl: section, provide the endpoint URL of the deployed model to be used. The syntax for the URL is provided below. Replace <your username> part with your username.

      "http://llama27bbase-serve-svc.<your username>:8000"
      

      Here you are providing your own username here because the llama27bbase deployment was done from your workspace earlier. If you are going to use a model deployed by any other user, you will need to provide the proper deployment name in place of llama27bbase and the username of that user.

    Field

    Sub-field

    Description

    vectorstore_reader

    kind

    Specifies the type of vectorstore reader.

    provider

    Indicates the provider of the vectorstore reader.

    properties

    Lists the properties of the vectorstore reader.

    questions_generator

    prompt_str

    Defines the strategy for generating prompts.

    prompt_file

    File containing custom prompt.

    num_questions_per_chunk

    Specifies the number of questions to generate per data chunk.

    max_chunks

    Sets the maximum number of data chunks to generate questions.

    llm

    Determines the language model (LLM) to use for generating questions.

    llm_key

    Requires providing the API key for the chosen LLM.

    llmurl

    Indicates the URL where the chosen LLM service is deployed.

    max_tokens

    Specifies the maximum number of tokens allowed in each question.

    retrieval_evaluator

    vector_retriever

    kind

    Indicates the type of vector retriever.

    provider

    Specifies the provider of the vector retriever.

    textkey

    Refers to the key used to access the text data within the vector retriever.

    embedding_model

    Specifies the name of the embedding model used for text representation.

    similarity_top_k

    Sets the number of similar items to retrieve for each query.

    metrics

    Specifies the evaluation metrics used for retrieval evaluation.

    semantic_similarity_evaluator

    prompt_str

    Defines the strategy for similarity evaluation.

    prompt_file

    File containing custom prompt.

    llm

    Specifies the language model (LLM) to use for semantic similarity evaluation.

    llmkey

    Labeled dummy in case of local deployments available within DKubeX or used to pass auth key if using an externam endpoint.

    llmurl

    Indicates the URL where the chosen LLM service is deployed.

    max_tokens

    Specifies the maximum number of tokens allowed in each semantic similarity evaluation prompt.

    metrics

    Specifies the evaluation metric used for semantic similarity evaluation.

    tracking

    experiment

    Provides a unique name for the MLFlow experiment, allowing for tracking and comparison of different runs of the pipeline.

  • Once done, run the following command to start the evaluation process of the base Llama2-7B model. Replace the <dataset name> part with the name of the dataset created during ingestion (for this example, contracts).

    d3x dataset evaluate -d <dataset name> --config ${HOMEDIR}/eval.yaml