Merging Finetuned Models¶
Finetuning Embedding Models, Finetuning LLMs
In this tutorial, we will go through the steps of merging finetuned model checkpoints on DKubeX.
Prerequisites¶
Make sure that your embedding model or LLM finetuning run is successfully completed. You can refer to the respective tutorials for finetuning embedding models and LLMs on DKubeX.
Finetuning tutorials
Finetuning Embedding ModelsTutorial regarding finetuning embedding models locally on DKubeX.
Finetuning LLMsTutorial regarding finetuning LLMs locally on DKubeX.
For this example, we will assume that you have successfully finetuned the meta-llama/Meta-Llama-3-8B-Instruct LLM model. The finetuning run name assumed for this example is
llama-3-ft
.Make sure that at least one of the worker nodes running your cluster running DKubeX contains an NVIDIA A10 GPU (with minimum resource of AWS- g5.4xlarge type instance, with at least 16 vCPU cores and 64 GiB of memory).
In case of a RKE2 cluster, make sure the node is labeled as
a10
during installation.In case of an AWS EKS cluster, make sure that the cluster contains a
g5.4xlarge
type nodegroup with maximum size of 1 or more.
Make sure that you have an active Huggingface access token which has access to the model you are finetuning and merging. For this example you need to have an active access token for the meta-llama/Meta-Llama-3-8B-Instruct model on Huggingface. You can generate these tokens on the Access Tokens page on Huggingface. For more information, refer to the Huggingface documentation.
Open DKubeX terminal and export the following environment variables. Replace
<username>
with your DKubeX username, and<access-token>
with your Huggingface access token.export HOMEDIR=/home/<username> export HF_TOKEN=<access-token>
export HOMEDIR=/home/demo export HF_TOKEN=hf_aJ0eX**************WJlIn0
Merging Finetuned Model Checkpoints¶
Create a Ray cluster on which the merge job will run by using the command provided below. Replace
<cluster-name>
with the name of the Ray cluster you want to create. For this example, to create the Ray cluster, run the example command provided below.d3x ray create --name <cluster-name> --cpu 8 --memory 64 --gpu 1 --type a10
d3x ray create --name ftmerge --cpu 8 --memory 64 --gpu 1 --type a10
Note
In case you are using an AWS EKS setup, please change the value of the flag
--type
froma10
tog5.4xlarge
in the command.Check the status of the Ray cluster by running the following command:
d3x ray list
Once the Ray cluster status becomes
running
, you can run the merge job on the cluster by running the following command. Replace the following in the command:Variable
Replace with
<merge-job-name>
Unique name for the merge job.
<ft-run-name>
Name of the finetuning run.
<ray-cluster-name>
Name of the Ray cluster on which the merge job will run.
<hf-token>
Huggingface access token.
d3x ft merge --name <merge-job-name> --ft-name <ft-run-name> --ray_cluster <ray-cluster-name> --hf-token <hf-token>
d3x ft merge --name llama3merge --ft-name llama-3-ft --ray_cluster ftmerge --hf-token ${HF_TOKEN}
Once the merge run goes into
succeeded
state, open MLFlow on DKubeX workspace, and open the experiment corresponding to the merge run to view the merge run metrics and artifacts, along with the recorded merged model checkpoint. The experiment name in MLFlow will be same as the merge run name (llama3merge
for this example).