Introduction
Recent advances successful Large Language Models (LLMs) person shown committedness successful systematic reasoning tasks, pinch open-source models for illustration DeepSeek-R1 demonstrating awesome capabilities successful breaking down analyzable problems into logical steps. By fine-tuning these reasoning-focused models for aesculapian applications, we tin create proof-of-concept AI assistants that could perchance support healthcare professionals successful their objective decision-making processes while maintaining transparent chains of reasoning. In this tutorial, we’ll research really to leverage DigitalOcean’s GPU Droplets to fine-tune a distilled quantized type of DeepSeek-R1, transforming it into a specialized reasoning adjunct that tin thief analyse diligent cases, propose imaginable diagnoses, and supply verified system explanations for its recommendations.
Shoutout to this awesome DataCamp tutorial and the paper, HuatuoGPT-o1, Towards Medical Complex Reasoning pinch LLMs, for inspiring this tutorial.
Prerequisites
Knowledge of these prerequisites will beryllium adjuvant pinch pursuing on pinch this tutorial:
- Python and PyTorch
- Deep Learning fundamentals (ex: neural networks, hyperparameters, etc.)
- Experience moving pinch Hugging Face models and the Transformers library
When should we usage Fine-Tuning?
Fine-tuning adapts a pre-trained model’s existing knowledge to execute circumstantial tasks by training it further connected a curated dataset. Fine-tuning shines successful scenarios wherever consistent formatting, circumstantial tone requirements, aliases complex instruction following are needed, arsenic it tin optimize the model’s behaviour for these peculiar usage cases. This attack typically requires less computational resources and little clip than training a exemplary from scratch. Before proceeding pinch fine-tuning, however, it is bully believe for developers to first see the advantages of alternatives specified arsenic punctual engineering, Retrieval Augmented Generation (RAG), and moreover training a exemplary from scratch.
Prompt Engineering | Prompt engineering involves crafting precise instructions to guideline the model’s behaviour utilizing existing capabilities. We person tutorials that refine strategy prompts for circumstantial use-cases pinch DigitalOcean’s 1-click models: Getting Started pinch LLMs for Social Media Analytics & How to Create an Email Newsletter Generator |
Retrieval-Augmented Generation | In cases wherever the extremity is to incorporated caller aliases up-to-date information, Retrieval-Augmented Generation (RAG) is typically much appropriate. RAG allows the exemplary to entree outer knowledge without modifying its underlying parameters. |
Training From Scratch | Training a exemplary from scratch tin beryllium beneficial successful applications wherever exemplary interpretability and explainability are desired. This attack gives you greater power complete the model’s architecture, data, and decision-making process. |
One tin do combinations of different approaches specified arsenic fine-tuning and RAG. By combining fine-tuning to found a robust baseline pinch RAG to grip move updates, the strategy achieves some adaptability and ratio without requiring changeless re-training. It really each comes down to organizational assets constraints and desired performance.
Monitoring whether outputs present to the standards of the intended inferior and iterating/pivoting if not is perfectly critical.
Once we cognize that fine-tuning is the attack we want to take, we request to combine the basal components.
What do we request to Fine-Tune a Model?
A pre-trained model
A pre-trained exemplary is simply a neural web that has already been trained connected a ample general-purpose corpus of data. Hugging Face has a plethora of open-source models disposable for you to use.
In this tutorial, we will beryllium utilizing a very celebrated reasoning model, DeepSeek-R1. Reasoning models excel astatine intricate tasks for illustration precocious problems successful mathematics aliases coding. We chose “unsloth/DeepSeek-R1-Distill-Llama-8B-bnb-4bit” because it is distilled and pre-quantized, making it a much representation businesslike and cost-effective exemplary to execute experiments with. We were particularly funny astir its imaginable for analyzable tasks specified arsenic aesculapian analysis. Note that utilizing them for simpler tasks specified arsenic summarization aliases translator would beryllium overkill owed to the inclination reasoning models person towards being computationally costly and verbose.
Dataset
Hugging Face has a awesome action of datasets. We will beryllium utilizing the Medical O1 Reasoning Dataset. This dataset was generated pinch GPT-4o by searching for solutions to verifiable aesculapian problems and validating them done a aesculapian verifier.
This dataset will beryllium utilized to execute supervised fine-tuning (SFT), wherever models are trained connected a dataset of instructions and responses. To minimize the quality betwixt the generated answers and ground-truth responses, SFT adjusts the weights successful the LLM.
GPUs
GPUs aren’t ever basal to fine-tune a model. However, utilizing a GPU (or aggregate GPUs) tin velocity up the process significantly, particularly for larger models aliases datasets for illustration the ones utilized successful this tutorial. In this article, we will show you really you tin make usage of DigitalOcean GPU Droplets.
Before starting this tutorial, it is recommended to familiarize yourself pinch the pursuing libraries and tools:
Unsloth
Unsloth is each astir making LLM training faster, pinch a peculiar attraction connected fine-tuning. The FastLanguageModel class, portion of the Unsloth library, provides a simplified abstraction for fine-tuning LLMs. This people tin grip loading the trained exemplary weights, preprocessing input text, and executing conclusion to make outputs.
Transformer Reinforcement Learning (TRL)
The HuggingFace Library, TRL, is utilized to train transformer connection models pinch Reinforcement Learning. This tutorial will utilize the SFTTrainer Class.
Transformers
Transformers is besides a HuggingFace Library. We will beryllium utilizing the TrainingArguments people to specify our desired arguments successful SFTTrainer.
Weights and Biases
The W&B platform will beryllium utilized for research tracking. Specifically, loss curves will beryllium monitored.
Part 2: Implementation
Step 1: Set up a GPU Droplet and Launch Jupyter Labs
Follow this tutorial, “Setting Up the GPU Droplet Environment for AI/ML Coding”, to group up a GPU Droplet situation for our Jupyter Notebook.
Step 2: Install unsloth
%%capture !pip instal unsloth !pip instal --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.gitStep 3: Configure Access Tokens
HuggingFace Tokens tin beryllium obtained from the Hugging Face Access Token page. Note that you whitethorn request to create Hugging Face account.
from huggingface_hub import login hf_token = "Replace pinch your existent token" login(hf_token)Similarly, you will besides request a Weights & Biases relationship to get a token for this step.
Step 4: Loading the exemplary and tokenizer
max_seq_length = 2048 model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B-bnb-4bit", max_seq_length = max_seq_length, load_in_4bit = True, dtype = None, token = hf_token, )Step 5: Testing Model Outputs Before Fine-Tuning
Creating a System Prompt
It is bully believe to verify whether exemplary outputs lucifer your standards for format, quality, accuracy, etc. to measure if fine-tuning is necessary. Since we are willing successful reasoning, we will formulate a strategy punctual that elicits a concatenation of thought.
Instead of penning the punctual straight successful our input, let’s commencement by penning up a punctual template that incorporates spot holders.
In this punctual template, we will specify precisely what we are looking for.
prompt_template= """### Role: You are a aesculapian master specializing successful objective reasoning, diagnostics, and curen planning. Your responses should: - Be evidence-based and clinically relevant - Include differential diagnoses erstwhile appropriate - Consider diligent information and modular of care - Note immoderate important limitations aliases uncertainties ### Question: {question} ### Thinking Process: {thinking} ### Clinical Assessment: {response} """Notice the {thinking} placeholder. The superior extremity of this measurement is to instruct the LLM to explicitly articulate its reasoning process earlier providing the last answer. This is often what is referred to arsenic "chain-of-thought prompting”.
Inference pinch our System Prompt (Before Fine-tuning)
Here, we format the mobility utilizing the system punctual (prompt_template) to guarantee the exemplary follows a logical reasoning process. We will tokenize the input, return them arsenic PyTorch tensors, and move it to the GPU (cuda) for faster inference.
FastLanguageModel.forinference(model) inputs = tokenizer([prompt_template.format(question, "")], return_tensors="pt").to("cuda")After, we will make a consequence utilizing the model, specifying cardinal parameters for illustration max_new_tokens=1200 (limits consequence length).
outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=1200, use_cache=True, )To get the last readable answer, we will decode the output tokens backmost into text.
response = tokenizer.batch_decode(outputs) print(response[0].split("### Response:")[1])Feel free to research pinch different punctual formulations and spot really they impact your outputs.
Step 6: Load the dataset
The dataset, FreedomIntelligence/medical-o1-reasoning-SFT, that we’re utilizing has 3 columns: Question, Complex_CoT, and Response.
We will create a usability (formatting_prompts_func) to format the input prompts successful the dataset.
def formatting_prompts_func(examples): inputs = examples["Question"] cots = examples["Complex_CoT"] outputs = examples["Response"] texts = [] for input, cot, output in zip(inputs, cots, outputs): matter = prompt_template.format(input, cot, output) + tokenizer.eos_token texts.append(text) return { "text": texts, } from datasets import load_dataset dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", divided = "train[0:500]",trust_remote_code=True) dataset = dataset.map(formatting_prompts_func, batched = True,) dataset["text"][0]Step 7: Prepare the Model for Parameter Efficient Fine-Tuning (PEFT)
Instead of updating each the parameters of the exemplary during fine-tuning, PEFT methods typically only modify a mini subset of parameters, resulting successful savings successful computational powerfulness and time.
Here is an overview of immoderate of the parameters and arguments we will beryllium utilizing the .get_peft_model method of Unsloth’s FastLanguageModel class.
r: LoRA rank. This determines the number of trainable adapters. | Select immoderate number greater than 0; recommended numbers are 8, 16, 32, 64, and 128. Note that a higher rank yields much intelligent, but slower, exemplary outputs. |
target_modules: These are the modules (layers) wrong the transformer architecture wherever LoRA will beryllium applied. | q_proj, k_proj, v_proj: The query, key, and worth projection layers successful the attraction mechanism. Fine-tuning these is important for adapting the model’s attraction to the caller task. o_proj: This is the output projection furniture successful the attraction mechanism. gate_proj, up_proj, down_proj: These are the projection layers successful the feed-forward web (FFN) portion of the transformer block. Fine-tuning these tin thief the exemplary study task-specific representations successful the FFN. |
lora_alpha: This is simply a scaling facet for the LoRA updates. It helps power the magnitude of the updates applied to the original weights. It’s related to the learning rate, and tuning it tin beryllium important for performance. | It’s often group to a aggregate of r (ex: 2r aliases 4r). |
lora_dropout: This is the dropout probability applied to the LoRA updates. Dropout is simply a regularization method that helps forestall overfitting. | When group to 0, nary dropout is applied. You mightiness summation this if you observe overfitting. |
bias: The bias parameter indicates really biases, which are constants added to offset the result, are handled by the model. | Set arsenic “none” if nary bias is to beryllium added. Other imaginable arguments see “all” aliases “lora_only”, specifying which layers bias is added to. |
use_gradient_checkpointing: Gradient checkpointing is simply a method to trim representation usage during training astatine the costs of immoderate other computation. It recomputes activations during the backward walk alternatively of storing them. | The “unsloth” statement tin beryllium utilized for an optimized implementation of gradient checkpointing for agelong contexts wrong the Unsloth library. Alternatively, this statement tin beryllium group to True for modular gradient checkpointing (to prevention representation astatine the disbursal of slower backward pass) aliases False to disable it. |
random_state: This sets the random seed for initializing the LoRA weights. Using a fixed random seed ensures reproducibility—you’ll get the aforesaid results if you tally the codification again pinch the aforesaid seed. | It doesn’t matter what worth this is, arsenic agelong arsenic it’s accordant passim your code. |
use_rslora: rsLoRA introduces a scaling facet to stabilize gradients during training, addressing the rumor of gradient illness that tin hap successful modular LoRA arsenic the rank increases. | rsLoRA is applied erstwhile group to True (sets the adapter scaling facet to lora_alpha/math.sqrt®); this is recommended for higher r values. The default worth is False (default worth of lora_alpha/r). |
Now that we’ve evaluated exemplary outputs, it is clip to usage our SFT dataset to fine-tune the pre-trained model.
Step 8: Model Training pinch SFTTrainer
Supervised Fine-tuning Trainer is simply a people to create supervised fine-tuned models from TRL.
We will besides beryllium using
Training Arguments
per_device_train_batch_size: Number of samples processed per device/GPU during training step | Typically powers of 2: 1, 2, 4, 7, 16, 32… |
gradient_accumulation_steps: Number of guardant passes to accumulate earlier performing backward pass | Higher values let for larger effective batch sizes. (Effective batch size = per_device_train_batch_size * gradient_accumulation_steps.) |
warmup_steps: Number of steps for the learning complaint warmup phase | Non-negative integer, typically 5-10% of full training steps (max_steps) |
max_steps: Total number of training steps to perform | Positive integers, depends connected dataset size and training needs |
learning_rate: Step size utilized for exemplary weight updates | Typically betwixt 1e-5 and 1e-3 (ex: 2e-4, 3e-4, 5e-5) |
fp16: Controls whether to usage 16-bit floating constituent precision bf16: Controls whether to usage encephalon floating constituent format | Enables mixed precision training (fp16 aliases bf16) for faster training, if supported by the hardware. Potential values include: not is_bfloat16_supported() aliases is_bfloat16_supported(). |
logging_steps: How often to log training metrics | Positive integer worth indicating interval of steps to walk earlier logging the training metrics. The worth chosen involves striking a equilibrium betwixt having capable accusation to way training advancement and keeping the overhead of logging manageable. |
optim: Optimization algorithm for training | adamw 8-bit performs likewise to adamw (a popular, robust optimizer), but pinch reduced GPU representation usage, making it a recommended choice |
weight_decay: a regularization method to forestall overfitting wherever the worth corresponds to the magnitude of weight decay to apply. | A float worth that defaults to 0. |
lr_scheduler_type: Schedule for learning complaint adjustments | The default and suggested worth is “linear”. Other alternatives see “cosine”, “polynomial”, etc. and whitethorn beryllium chosen to execute faster convergence. |
seed: Random seed for reproducibility | It doesn’t matter what worth this is, arsenic agelong arsenic it’s accordant passim your code. |
output_dir: Location to prevention training outputs | A drawstring of the directory path |
This bid will commencement the training process.
trainer_stats = trainer.train()Step 9: Monitoring Experiments
Experiment search tin beryllium done pinch Weights and Biases. Essentially, we want to guarantee that the training nonaccomplishment decreases complete clip to guarantee exemplary capacity is improving pinch fine-tuning.
If exemplary capacity is degrading, it whitethorn beryllium worthy experimenting pinch the hyperparameter values.
Step 10: Model Inference After Fine-Tuning
question = "A 58-year-old female reports a 3-year history of urine leakage erstwhile laughing, exercising, aliases lifting dense objects. She denies immoderate nighttime incontinence aliases feelings of urgency. On beingness exam, she demonstrates urine nonaccomplishment pinch Valsalva maneuver, and a Q-tip trial shows hypermobility of the urethrovesical junction pinch a 45-degree excursion. What would urodynamic testing astir apt show regarding her post-void residual measurement and detrusor musculus activity?" FastLanguageModel.for_inference(model) inputs = tokenizer([prompt_template.format(question, "")], return_tensors="pt").to("cuda") outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=1200, use_cache=True, ) response = tokenizer.batch_decode(outputs) print(response[0].split("### Response:")[1])Step 11: Saving the Model Locally
new_model_local = "DeepSeek-R1-Medical-COT" model.save_pretrained(new_model_local) tokenizer.save_pretrained(new_model_local) model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)Step 12: Pushing the Model to HuggingFace Hub
If it is desirable to make the exemplary accessible and beneficial to the wider AI community, we tin people the adopter, tokenizer, and exemplary connected to the Hugging Face Hub. This will let others to easy merge our exemplary into their ain projects and systems.
new_model_online = "HuggingFaceUSERNAME/DeepSeek-R1-Medical-COT" model.push_to_hub(new_model_online) tokenizer.push_to_hub(new_model_online) model.push_to_hub_merged(new_model_online, tokenizer, save_method = "merged_16bit")Conclusion
Fine-tuning is really smart teams toggle shape those pre-trained models into precise, targeted devices that lick existent problems. Here, we’re not reinventing the wheel, but alternatively aligning these wheels truthful that they return america wherever we want to go. While pre-trained models are powerful, they tin beryllium generic pinch outputs that whitethorn deficiency the building and constituent characteristic of professional-grade work.
We dream that done this tutorial, you gained an intuition astir erstwhile to usage and fine-tune reasoning models arsenic good arsenic immoderate inspiration to amended refine this exertion for your use-case.
References and Additional Resources
Fine-Tuning DeepSeek R1 (Reasoning Model) | DataCamp HuatuoGPT-o1, Towards Medical Complex Reasoning pinch LLMs Train your ain R1 reasoning exemplary locally (GRPO) Unslothai Llama3.1_(8B)-GRPO.ipynb Fine-Tuning Your Own Llama 3 Model