Deploy Fine-Tuned LLM to Prod with BYOM + Dedicated Inference

Jun 01, 2026 07:00 AM - 3 weeks ago 22010

Introduction

Picture a merchandise squad that fine-tuned a language model connected objective notes. In the lab, prompts and answers looked great. Then personification asked the difficult question: How do we vessel this to the diligent app without sending PHI done a shared nationalist API?

That spread shows up everywhere, not only successful healthtech. You mightiness have:

  • A support bot trained connected soul runbooks
  • A ineligible adjunct tuned connected your statement style
  • A coding helper adapted to your backstage codebase patterns

Training gets you a files of weights. Production needs thing else: a unchangeable URL, predictable latency erstwhile galore users deed it astatine once, and web rules that support postulation wrong your unreality boundary.

Serverless inference connected DigitalOcean is fantabulous erstwhile you want to telephone instauration models quickly and salary per token. You do not upload your ain fine-tuned weights location successful the aforesaid way.

Dedicated Inference gives you a GPU reserved for your model. BYOM (Bring Your Own Model) is really you registry those weights successful the Model Catalog truthful the level knows what to load.

Let’s first understand erstwhile to usage Dedicated Inference and erstwhile not to.

When to usage Dedicated Inference and erstwhile not to?

Dedicated Inference makes consciousness erstwhile you want afloat power complete the exemplary being served (for example, your ain fine-tuned weights, a proprietary model, aliases thing not disposable successful the default catalog) and you expect users aliases applications to deed the endpoint pinch steady, production-like traffic.

With Dedicated, your exemplary lives connected its ain reserved GPU, down a backstage endpoint wrong your VPC. This intends you get accordant latency, web isolation, and the expertise to use civilization entree controls.

It’s the correct prime erstwhile you attraction astir privacy, compliance, predictable performance, aliases request to tally models that can’t vessel arsenic a instauration exemplary successful Serverless. The tradeoff: you’ll salary for the reserved GPU while the deployment exists, moreover if requests are infrequent.

Dedicated is simply a bully fit Try thing else
Your fine-tuned aliases civilization weights Catalog-only exemplary → Serverless
Steady traffic; good pinch per GPU hour Spiky aliases mini postulation → Serverless (per token, nary idle GPU bill)
Private GPU, VPC endpoint, aliases compliance needs Quick tests without civilization weights
Live API aliases chat Overnight bulk jobs → Batch Inference

You tin publication much astir the differences betwixt Serverless, Dedicated, and Batch Inference and scaling Dedicated vs Serverless.

Prerequisites

Before you start, make judge you have:

  • A DigitalOcean account.
  • Dedicated Inference and BYOM
  • A VPC successful ATL1, NYC2, TOR1, aliases RIC1, positive a ;Droplet successful that VPC for the Step 4 test.
  • Weights successful Spaces aliases Hugging Face spot BYOM limits
  • Optional: a GPU Droplet for fine-tuning; Firewalls erstwhile you fastener down production

Same region for import, VPC, and deploy erstwhile you can, cross-region works, but first requests tin consciousness slow.

Region tip: Use the aforesaid region for import penchant and deployment erstwhile possible. Cross-region works, but the first petition aft deploy whitethorn consciousness slower while information and GPUs align.

A speedy representation of terms

Term Think of it as
Fine-tuning Teaching a guidelines exemplary your tone, format, aliases domain vocabulary connected illustration pairs
Weights / checkpoint The saved files that correspond what the exemplary learned
BYOM Uploading your weights into DigitalOcean Model Catalog
Dedicated Inference A managed GPU endpoint that serves your imported model
VPC A backstage web segment. Only resources wrong it scope each different easily
Private endpoint An conclusion URL that is meant to beryllium called from wrong that VPC
OpenAI-compatible API Same JSON style arsenic chat/completions, truthful galore SDKs activity unchanged

What you will build

[Optional] GPU Droplet Hugging Face aliases Spaces | | v v fine-tuned weights ---------> BYOM import (My Models) | v your app successful VPC ------------> Dedicated Inference (private URL)

By the extremity you will have:

  1. A exemplary registered nether My Models pinch position Ready.
  2. A Dedicated Inference deployment successful position Active.
  3. A successful trial connection from wrong your VPC.
  4. A shape you reuse for the existent exertion backend.

How to usage this tutorial

  • Work successful a lab project aliases sandbox relationship first.
  • Keep a elemental notes file: exemplary sanction you chose, region, whether the nationalist endpoint is connected aliases off, and wherever you stored the entree token.
  • Do not skip the trial measurement successful Step 4. It is the impervious that serving useful earlier you ligament the app.
  • If you already person fine-tuned files connected Hugging Face, commencement astatine Step 2.

The 5 steps astatine a glance

Step Goal Control Panel path
1 Create your civilization weights (optional) GPU Droplet
2 Register weights successful the catalog INFERENCEModel CatalogMy Models
3 Turn weights into a unrecorded GPU endpoint INFERENCEDedicated Inference
4 Prove the endpoint answers from your VPC Deployment Overview + curl
5 Check quality, compliance, and metrics Model Evaluations, Observability, settings

Step 1: Fine-tune connected a GPU Droplet (optional)

A guidelines exemplary from the net speaks generic language. Your merchandise needs vocabulary, information style, and reply format that lucifer your users. Fine-tuning is the accustomed measurement to get there. You show the exemplary thousands of examples of bully input and bully output, and it adjusts its weights.

If personification connected your squad already handed you a vanished Hugging Face folder, skip to Step 2. You do not request to train twice.

What you will do successful the Control Panel

  1. Create a GPU Droplet successful the region you scheme to usage later for inference.
  2. SSH in, instal Python tooling, and tally training.
  3. Save a normal Hugging Face directory and upload it.

Prepare a mini dataset

Many beginners commencement pinch a nationalist aesculapian Q&A group to study the plumbing. Do not put existent diligent information successful a tutorial lab. For production, portion identifiers first.

We’ll create a Python book to load and preprocess your dataset for fine-tuning.

1. Create a caller Python file:

Save the pursuing codification arsenic prepare_dataset.py connected your GPU Droplet.

# If you spot "ModuleNotFoundError: No module named 'datasets'", # you request to instal the 'datasets' package first. # Run this successful your Droplet's terminal earlier moving the script: # pip instal datasets from datasets import load_dataset # Load a nationalist demo dataset. # You tin switch this for your ain CSV aliases JSON later for accumulation use. dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en") def preprocess_function(example): # "prompt" is what the personification asks; "completion" is the perfect answer. return { "prompt": [{"role": "user", "content": example["Question"]}], "completion": [ { "role": "assistant", "content": ( f"<thinking>{example['Complex_CoT']}</thinking>" f"{example['Response']}" ), } ], } # Apply the preprocessing usability to each example. processed_dataset = dataset.map( preprocess_function, remove_columns=["Question", "Response", "Complex_CoT"], ) # Optionally, prevention the processed dataset for later steps. processed_dataset.save_to_disk("./processed_medical_chat") print("Processed dataset saved to ./processed_medical_chat")

2. Run the script:

python prepare_dataset.py

Output

Processed dataset saved to ./processed_medical_chat

This will create a processed_medical_chat files successful your moving directory. You’ll constituent your trainer astatine this files successful the adjacent step.

Notes:

  • load_dataset makes it easy to propulsion distant information truthful you don’t person to hand-build JSON files arsenic a beginner.
  • The punctual and completion keys usage the format expected by TRL SFTTrainer.
  • The <thinking>...</thinking> wrapper is conscionable a style example—a measurement to train chain-of-thought responses. Remove aliases alteration it to lucifer your needs aliases information guidelines.

Adapt this book if you request to usage your ain dataset (for example, switch the dataset loading statement pinch a section CSV aliases JSONL path).

Install training devices connected the Droplet

On a GPU Droplet, instal PyTorch pinch a CUDA build that matches the driver connected the image. A plain pip instal torch often pulls a instrumentality compiled for a newer CUDA than your Droplet driver supports, which makes torch.cuda.is_available() return False and breaks bf16 training.

1. Check the GPU and driver first:

nvidia-smi

You should spot a GPU sanction and a driver version. If nvidia-smi fails, hole the Droplet image aliases GPU connect earlier continuing—training a 8B-class exemplary connected CPU is not applicable for this tutorial.

2. Install PyTorch, past the rest:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install transformers datasets trl accelerate peft bitsandbytes "jinja2>=3.1.0"

jinja2>=3.1.0 is required because SFTTrainer tokenizes punctual / completion chat rows pinch apply_chat_template. Ubuntu images often vessel jinja2 3.0.x, which triggers ImportError: apply_chat_template requires jinja2>=3.1.0.

If cu121 still warns astir the driver, effort cu118 from the aforesaid scale URL, aliases usage DigitalOcean’s AI/ML GPU image erstwhile you create the Droplet (Jupyter connected GPU Droplets)—those images usually vessel pinch a matched driver and CUDA stack.

3. Confirm PyTorch sees the GPU:

python -c "import torch; print('cuda:', torch.cuda.is_available()); print('device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() other 'none')"

Output

cuda: True device: NVIDIA H200

Only tally train_model.py aft cuda: True. If you spot cuda: False, do not alteration bf16 aliases fp16—fix the PyTorch/CUDA instal first.

4. Confirm Jinja2 is caller capable for chat templates:

python -c "import jinja2; print(jinja2.__version__)"

You request 3.1.0 aliases higher. If you spot 3.0.x, upgrade:

pip install --upgrade "jinja2>=3.1.0"
  • torch runs the mathematics connected the GPU.
  • transformers loads the guidelines exemplary architecture.
  • trl supplies SFTTrainer, which hides a batch of training boilerplate.
  • accelerate / peft thief erstwhile you usage businesslike fine-tuning modes later.
  • jinja2 renders chat templates erstwhile TRL tokenizes your punctual / completion fields.

1. Create a caller Python record connected the Droplet:

Save the pursuing arsenic train_model.py successful the aforesaid directory wherever you ran prepare_dataset.py (so the comparative way to the dataset files resolves correctly).

# Run this book connected your GPU Droplet aft confirming: # - CUDA is disposable (torch.cuda.is_available() returns True) # - jinja2 >= 3.1.0 is installed import sys import torch from datasets import load_from_disk from trl import SFTConfig, SFTTrainer # Path to the processed dataset generated by prepare_dataset.py PROCESSED_DATA_DIR = "./processed_medical_chat" # Output directory wherever the fine-tuned exemplary will beryllium saved OUTPUT_DIR = "./my-fine-tuned-model" # Base exemplary from Hugging Face Hub (may request authentication for gated models) BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct" # Check for GPU availability if not torch.cuda.is_available(): sys.exit( "CUDA is not available. Run: python -c \"import torch; print(torch.cuda.is_available())\"\n" "If False, reinstall PyTorch pinch an due instrumentality aliases hole your GPU drivers." ) # Detect optimal precision settings based connected your GPU use_bf16 = torch.cuda.is_bf16_supported() use_fp16 = not use_bf16 and torch.cuda.get_device_capability(0)[0] >= 7 print(f"Training pinch bf16={use_bf16}, fp16={use_fp16}") # Load the preprocessed dataset dataset = load_from_disk(PROCESSED_DATA_DIR) train_dataset = dataset["train"] if "train" in dataset else dataset # Set up supervised fine-tuning pinch the TRL SFTTrainer trainer = SFTTrainer( model=BASE_MODEL, train_dataset=train_dataset, args=SFTConfig( output_dir=OUTPUT_DIR, num_train_epochs=1, per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=2e-5, logging_steps=10, save_steps=500, bf16=use_bf16, fp16=use_fp16, ), ) trainer.train() trainer.save_model(OUTPUT_DIR) print(f"Saved fine-tuned weights to {OUTPUT_DIR}")

What this book does:

  • Checks that a compatible NVIDIA GPU is available, and exits pinch instructions if not.
  • Detects whether the GPU supports businesslike mixed-precision training (bf16 aliases fp16) and sets training flags accordingly.
  • Loads your preprocessed training information from disk.
  • Uses Hugging Face’s TRL SFTTrainer to fine-tune a guidelines connection exemplary (Qwen/Qwen2.5-1.5B-Instruct) connected your civilization dataset.
  • Configures basal training parameters: 1 epoch, batch size of 1, gradient accumulation, and regular model/logging checkpoints.
  • Trains the exemplary and saves the resulting fine-tuned weights successful ./my-fine-tuned-model, fresh for conclusion and BYOM deployment.
  • Prints a confirmation erstwhile the process finishes and the output files is ready.

2. Run the script:

Confirm Hugging Face auth if you usage a gated guidelines exemplary (huggingface-cli whoami). If tokenization grounded earlier pinch a Jinja2 error, tally pip instal --upgrade "jinja2>=3.1.0" first. Then:

python train_model.py

Training tin return a agelong clip depending connected GPU size, model, and epoch count. You should spot measurement logs and a falling nonaccomplishment if the tally is healthy. You tin publication much astir LLM fine-tuning.

Output

... Saved fine-tuned weights to ./my-fine-tuned-model

3. Confirm the output files exists:

ls -la ./my-fine-tuned-model

You should spot astatine slightest config.json, tokenizer files, and 1 aliases much .safetensors weight files. That directory is what BYOM imports—not the .py scripts.

Notes:

  • model= is the starting checkpoint from Hugging Face. Larger models request larger GPUs; trim exemplary size aliases usage QLoRA (not covered here) if you deed out-of-memory errors.
  • train_dataset must beryllium the preprocessed information from prepare_dataset.py. If you skipped that step, switch load_from_disk(...) pinch your ain load_dataset(...) telephone and the aforesaid punctual / completion format.
  • Tune num_train_epochs, per_device_train_batch_size, and learning_rate utilizing TRL SFTTrainer guidance and short trial runs connected your Droplet earlier a afloat training job.
  • save_model writes the files you upload successful the adjacent subsection. Track nonaccomplishment during training; if nonaccomplishment flatlines aliases explodes, hole training earlier you salary for conclusion GPUs.

Export files BYOM will accept

Your files should look for illustration a modular Hugging Face repo:

  • config.json (architecture and hyperparameters)
  • Tokenizer files specified arsenic tokenizer.json aliases tokenizer_config.json
  • Weight files successful Safetensors format (.safetensors)

Check BYOM limits for supported architectures.

Upload aft training

To Hugging Face:

huggingface-cli upload your-org/your-model-name ./my-fine-tuned-model .
  • your-org/your-model-name becomes the repo ID you paste into BYOM successful Step 2.
  • ./my-fine-tuned-model is the section directory save_model created.
  • The last . intends upload the contents of the folder, not the files sanction arsenic a azygous file.

To Spaces: sync the aforesaid directory into a bucket prefix utilizing the Control Panel upload UI aliases s3cmd pinch your Spaces endpoint URL.

Step 2: Import your exemplary pinch BYOM

Why this measurement exists

Dedicated Inference does not publication random files from your laptop. BYOM is the handoff: DigitalOcean copies and validates your weights, stores them successful a managed location, and lists the exemplary nether My Models. Only past tin you connect the exemplary to a GPU deployment.

Think of BYOM arsenic airdrome information for exemplary folders. It checks record types, architecture, and tokenizer beingness earlier thing touches a accumulation GPU.

  1. INFERENCEModel CatalogMy ModelsImport Model
  2. Pick Hugging Face aliases Spaces
  3. Fill successful exemplary name, description, tags, and preferred GPU region
  4. Add a Hugging Face token if the repo is gated
  5. Accept position and confirm

You should spot the exemplary successful My Models pinch position Ready.

BYOM Import Model

Wait for the correct status

Status Meaning Your action
Importing Copy and validation successful progress Wait. Large models return longer.
Ready You deploy successful Step 3 Proceed
Failed Layout, architecture, aliases permissions problem Open the error, hole files aliases token, import again

Official reference: How to Import Your Own Models (BYOM).

Step 3: Deploy Dedicated Inference

Importing registers what to serve. Deploying allocates where it runs: GPU type, VPC, endpoints, and autoscaling hooks managed by the platform. Your app will yet telephone this deployment, not the Hugging Face URL.

Dedicated Inference is aimed astatine dependable postulation connected your weights. You salary for GPU clip while the deployment exists. That is different from serverless inference, wherever you salary per token connected shared instauration models. For a longer comparison, spot Dedicated vs Serverless Inference arsenic You Scale.

From the power panel, you tin deploy the exemplary by pursuing the steps below:

  1. INFERENCEDedicated InferenceDeploy Dedicated Inference
  2. Region: lucifer your app VPC region erstwhile possible
  3. Model tab: take My Models, past the exemplary you imported
  4. GPU plan: commencement pinch 1 GPU for mini models (7B–8B class). Scale up for larger models aliases higher concurrency
  5. Deployment name: descriptive, e.g. clinical-notes-prod-v1
  6. Public endpoint: time off abnormal if compliance requires private-only access
  7. Deploy and watch status

Deploy Dedicated Inference

You tin besides mention to our How to Use Dedicated Inference archiving for the steps.

Wait for adjacent 15-30 minutes for the deployment to beryllium successful Active status.

DI Deployment Active Status

Status you will see

Status Meaning
Provisioning GPUs and routing are still coming online
Active Safe to nonstop trial traffic
Error Often capacity aliases config. Check the message, effort different scheme aliases region

Save the entree token immediately

The UI shows the endpoint entree token erstwhile astatine create time. Copy it into a password head aliases secrets store. If you suffer it, create a caller cardinal nether Endpoint Access Keys connected the deployment settings page.

Step 4: Test the backstage endpoint from your VPC

A greenish Active badge is not enough. You want impervious that a existent petition returns text. This measurement besides teaches an important rule: the private endpoint is meant to beryllium reached from resources successful the same VPC, not from your location laptop (unless you deliberately VPN aliases bastion in).

That creation is what teams want erstwhile they opportunity PHI must not time off our web boundary.

What you request from the Overview page

Open your deployment Overview and copy:

  1. Private Endpoint URL (starts pinch https:// and ends pinch thing for illustration private-dedicated-inference.do-infra.ai)
  2. CA certificate record (download button)
  3. The model name precisely arsenic shown successful BYOM (case and pronunciation matter)
  4. Your endpoint entree token

Run the trial from a Droplet, Kubernetes node, aliases different VM successful that VPC. Many teams support a mini “jump” Droplet for precisely this benignant of check.

The trial request, explained statement by line

curl --cacert /path/to/ca.pem \ --location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \ -d '{ "model": "YOUR_MODEL_NAME", "messages": [ {"role": "user", "content": "Say hullo successful 1 sentence."} ], "max_tokens": 150 }'

Here is an illustration of the output:

{"id":"chatcmpl-00d342fa-748d-4fd6-9368-a26e0899ec32","object":"chat.completion","created":1779880993,"model":"anish-byom-qwen3-4b-tutorial","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! 😊","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":19,"completion_tokens":5,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}%

You tin besides usage the jq bid to beautiful people the output:

curl --cacert /path/to/ca.pem \ --location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \ -d '{ "model": "YOUR_MODEL_NAME", "messages": [ {"role": "user", "content": "Say hullo successful 1 sentence."} ], "max_tokens": 150 }' | jq . % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 774 0 613 100 161 1521 399 --:--:-- --:--:-- --:--:-- 1925 { "id": "chatcmpl-b8175ac2-c4f0-444e-a69d-c2066d0a3a99", "object": "chat.completion", "created": 1779881059, "model": "anish-byom-qwen3-4b-tutorial", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Hello! 😊", "refusal": null, "annotations": null, "audio": null, "function_call": null, "tool_calls": [], "reasoning": null }, "logprobs": null, "finish_reason": "stop", "stop_reason": null, "token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 14, "total_tokens": 19, "completion_tokens": 5, "prompt_tokens_details": null }, "prompt_logprobs": null, "prompt_token_ids": null, "kv_transfer_params": null }
Piece Role
curl Sends an HTTP petition from the bid line. Good for first interaction earlier you constitute app code.
--cacert /path/to/ca.pem Trusts the backstage TLS certificate DigitalOcean issued for this deployment. Without it, TLS verification fails.
--location "https://.../v1/chat/completions" Hits the OpenAI-style chat way your SDKs will besides use.
Authorization: Bearer ... Proves this customer is allowed to talk to your deployment.
"model": "YOUR_MODEL_NAME" Must lucifer the your BYOM import name, not needfully the Hugging Face repo name.
messages Chat history array. For a fume test, 1 personification connection is enough.
max_tokens Upper bound connected really agelong the reply grow. Keeps trial costs predictable.

Let’s effort different curl bid to trial the response:

curl --cacert /path/to/ca.pem \ --location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \ -d '{ "model": "YOUR_MODEL_NAME", "messages": [ {"role": "user", "content": "What is the superior of Uttarakhand?" ], "max_tokens": 512 }' | jq . { "id": "chatcmpl-96e1bc75-1c4d-463a-9dd4-b9f17565f373", "object": "chat.completion", "created": 1779881516, "model": "anish-byom-qwen3-4b-tutorial", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The superior of Uttarakhand is **Dehradun**.", "refusal": null, "annotations": null, "audio": null, "function_call": null, "tool_calls": [], "reasoning": null }, "logprobs": null, "finish_reason": "stop", "stop_reason": null, "token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 17, "total_tokens": 32, "completion_tokens": 15, "prompt_tokens_details": null }, "prompt_logprobs": null, "prompt_token_ids": null, "kv_transfer_params": null }

What a bully consequence looks like

You should get JSON pinch a choices array. Inside is message.content pinch generated text. If you spot 401, the token is incorrect aliases missing. If TLS fails, the CA way is incorrect aliases you are not connected the VPC network.

Public endpoint (when you alteration it)

Some teams alteration a nationalist endpoint for staging. The aforesaid JSON assemblage works. You skip --cacert erstwhile the certificate concatenation is standard. Production regulated stacks usually enactment private-only.

Here is an example:

curl --location 'https://YOUR-PUBLIC-ENDPOINT/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN' \ -d '{ "model": "anish-byom-qwen3-4b-tutorial", "messages": [ { "role": "user", "content": "What is the superior of Uttarakhand State successful India?" } ], "max_tokens": 150 }' | jq id": "chatcmpl-96e1bc75-1c4d-463a-9dd4-b9f17565f373", "object": "chat.completion", "created": 1779881516, "model": "anish-byom-qwen3-4b-tutorial", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The superior of Uttarakhand is **Dehradun**.", "refusal": null, "annotations": null, "audio": null, "function_call": null, "tool_calls": [], "reasoning": null }, "logprobs": null, "finish_reason": "stop", "stop_reason": null, "token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 17, "total_tokens": 32, "completion_tokens": 15, "prompt_tokens_details": null }, "prompt_logprobs": null, "prompt_token_ids": null, "kv_transfer_params": null }

Step 5: Compliance, value checks, and monitoring

Shipping is much than 1 successful curl. You request assurance the exemplary still sounds for illustration your fine-tune, that information handling matches policy, and that latency will not astonishment you connected Monday greeting traffic.

VPC and Zero Data Retention

  • Send accumulation app postulation to the private URL erstwhile isolation is required.
  • Add Cloud Firewalls truthful only your load balancer aliases app tier reaches the conclusion customer hosts.
  • Turn connected Zero Data Retention successful Inference workspace settings if your statement says prompts and completions must not beryllium stored for maltreatment reappraisal aliases value sampling.

Map controls to immoderate model your institution uses. Healthcare teams often reference NIST RMF families for entree control, audit, and transmission protection. Your information partner translates merchandise settings into power language.

Model Evaluations

For system value testing astatine scale, usage Model Evaluations.

INFERENCEModel EvaluationsNew Evaluation

This is different from a chat playground: you upload a dataset (CSV aliases JSONL, up to 1,000 rows and 1 GB), prime your Dedicated Inference deployment arsenic the candidate model, take a judge exemplary (for illustration GPT-4o aliases Claude Sonnet), and tally metrics specified arsenic correctness, completeness, aliases toxicity. The level scores each statement pinch an LLM-as-a-judge workflow.

Sample information dataset for this tutorial

Upload medical-eval-tutorial.csv (15 rows below, aesculapian Q&A prompts aligned pinch this walkthrough).

I person added the dataset beneath for your reference, which you tin transcript and paste into the dataset upload field.

input,expected_response "{""messages"": [{""role"": ""user"", ""content"": ""What is the superior of Uttarakhand, India?""}]}","Dehradun is the superior of Uttarakhand." "{""messages"": [{""role"": ""user"", ""content"": ""What are communal symptoms of seasonal influenza?""}]}","Common symptoms see fever, cough, sore throat, assemblage aches, and fatigue." "{""messages"": [{""role"": ""user"", ""content"": ""What is hypertension?""}]}","Hypertension is persistently elevated humor unit successful the arteries." "{""messages"": [{""role"": ""user"", ""content"": ""How does manus washing thief forestall infection?""}]}","Hand washing removes germs from hands and reduces dispersed to the mouth, nose, and surfaces." "{""messages"": [{""role"": ""user"", ""content"": ""What is type 2 diabetes?""}]}","Type 2 glucosuria is simply a information wherever the assemblage does not usage insulin efficaciously and humor sweetener rises complete time." "{""messages"": [{""role"": ""user"", ""content"": ""When should personification telephone emergency services for thorax pain?""}]}","Call emergency services for abrupt terrible thorax pain, symptom pinch shortness of breath, aliases symptom spreading to the arm, jaw, aliases back." "{""messages"": [{""role"": ""user"", ""content"": ""What is simply a normal resting bosom complaint for adults?""}]}","A normal resting bosom complaint for astir adults is astir 60 to 100 thumps per minute." "{""messages"": [{""role"": ""user"", ""content"": ""What are signs of dehydration?""}]}","Signs see thirst, barren mouth, acheronian urine, dizziness, and reduced urination." "{""messages"": [{""role"": ""user"", ""content"": ""What is asthma?""}]}","Asthma is simply a chronic information wherever airways constrictive and swell, causing wheezing and shortness of breath." "{""messages"": [{""role"": ""user"", ""content"": ""Why are vaccines important?""}]}","Vaccines train the immune strategy to admit pathogens and trim the consequence of terrible illness." "{""messages"": [{""role"": ""user"", ""content"": ""What is anemia?""}]}","Anemia is simply a information pinch little than normal reddish humor cells aliases hemoglobin, often causing fatigue." "{""messages"": [{""role"": ""user"", ""content"": ""How overmuch slumber do adults typically need?""}]}","Most adults request astir 7 to 9 hours of slumber per nighttime for bully health." "{""messages"": [{""role"": ""user"", ""content"": ""What is nutrient poisoning?""}]}","Food poisoning is unwellness from eating contaminated food, often causing nausea, vomiting, and diarrhea." "{""messages"": [{""role"": ""user"", ""content"": ""What is simply a migraine?""}]}","A migraine is simply a neurological headache section that whitethorn see throbbing pain, nausea, and sensitivity to light." "{""messages"": [{""role"": ""user"", ""content"": ""What does BMI measure?""}]}","BMI estimates assemblage fat based connected tallness and weight, utilized arsenic a wide screening tool."

Columns: input (JSON messages for chat completions—not query) and expected_response (reference reply for the judge). Lab usage only—not existent diligent data.

In the power panel: Model EvaluationsNew EvaluationAdd datasetUpload → take that file.

Typical flow:

  • Select a dataset — upload medical-eval-tutorial.csv.

Choose Dataset

  • Select a candidate — take your BYOM Dedicated Inference deployment.

Choose Candidate

  • System prompt — paste the aforesaid strategy connection you scheme to usage successful accumulation truthful scores bespeak existent behavior. For example:
You assistance clinicians. You do not diagnose. Ask users to corroborate pinch a professional.
  • Select a judge model — a abstracted catalog exemplary scores your candidate’s outputs (this incurs further token cost).
  • Pick metrics and a star metric walk threshold, past click Run.
  • When the tally finishes, reappraisal per-row scores and Download results arsenic JSON.

System Prompt

Use Evaluations erstwhile you request repeatable, dataset-driven impervious earlier launch. Use curl erstwhile you only request a accelerated fume trial that the endpoint answers.

Run Evaluation

Here is an illustration of the output:

Evaluation Output

When things spell wrong

Symptom Likely cause What to try
BYOM import Failed Missing tokenizer aliases incorrect record types Compare files layout to import docs
invalid exemplary provider Used hugging_face for a BYOM model Use model_catalog and provider_model_id successful API calls
Stuck connected Provisioning GPU capacity successful that region Wait, aliases retry different region aliases GPU size
401 Unauthorized Token typo aliases expired key Create a caller endpoint entree key
TLS / certificate error Wrong CA way aliases calling from extracurricular VPC Run from a big successful the aforesaid VPC pinch the downloaded CA file
Answers consciousness off Training aliases punctual issue Re-test pinch curl aliases Model Evaluations; set strategy punctual aliases fine-tune data
bf16/gpu aliases CUDA driver excessively aged connected GPU Droplet PyTorch instrumentality mismatch pinch Droplet driver See If train_model.py fails connected bf16 aliases CUDA successful Step 1
Gated repo / Please log successful erstwhile training Hugging Face auth aliases license Log successful to Hugging Face successful Step 1
jinja2>=3.1.0 during Tokenizing train dataset Old Jinja2 connected the Droplet pip instal --upgrade "jinja2>=3.1.0"

Cleanup (so laboratory walk stops)

  1. Destroy the Dedicated Inference deployment (INFERENCEDedicated InferenceDestroy).
  2. Delete the exemplary successful My Models only aft the deployment is gone (INFERENCEModel CatalogMy ModelsDelete).
  3. Power disconnected aliases delete the training GPU Droplet if you created 1 for this exercise.

GPU hours and BYOM retention charges accrue while resources still exist.

FAQs

1. What is serverless inference?

Serverless conclusion is simply a managed measurement to telephone instauration models without provisioning GPUs yourself. You nonstop requests to a shared endpoint and salary for usage (on DigitalOcean, typically per token) alternatively than reserving hardware. It fits prototyping, bursty traffic, and catalog models. It does not big arbitrary fine-tuned weights—that way is Dedicated Inference pinch BYOM.

2. What is the quality betwixt serverless and dedicated inference?

Serverless conclusion shares infrastructure, scales pinch demand, and bills by usage. Dedicated Inference reserves a GPU for your deployment, bills by GPU hr while the deployment exists, and supports backstage VPC endpoints positive civilization weights via BYOM. Dedicated trades idle costs for predictable latency and isolation; serverless trades per-request elasticity for nary custom-model hosting.

3. When should I not usage serverless conclusion for my LLM?

Skip serverless erstwhile you request your ain fine-tuned aliases proprietary weights, dependable accumulation postulation pinch strict latency targets, aliases a backstage endpoint wrong a VPC. Those are the signals to move to Dedicated. You tin publication Dedicated vs Serverless arsenic you scale.

4. How do I deploy a fine-tuned exemplary to production?

A applicable way connected DigitalOcean: fine-tune (for illustration pinch TRL connected a GPU Droplet), export weights, import via BYOM, create a Dedicated Inference deployment connected a GPU size that fits your model, past telephone the OpenAI-compatible endpoint pinch curl aliases your app SDK. This tutorial walks that travel extremity to end, including VPC testing and optional Model Evaluations.

5. Do I request a abstracted “deploy” measurement aft fine-tuning?

Yes. Fine-tuning produces caller weights; serving them requires an conclusion deployment. On DigitalOcean, import is not the aforesaid arsenic serving: BYOM registers weights, past Dedicated Inference loads them onto a reserved GPU and exposes an API. Destroy the deployment erstwhile you are done truthful GPU-hour charges stop.

To Wrap this up

In this tutorial, we person learned really to deploy a connection exemplary to accumulation pinch BYOM and Dedicated Inference. We person besides learned really to trial the deployment and really to cleanup the resources.

You tin now usage this deployment to service your ain exemplary to your users.

  • How to Import Your Own Models (BYOM)
  • How to Use Dedicated Inference
  • Dedicated Inference method heavy dive
  • Dedicated vs Serverless Inference arsenic You Scale
  • TRL SFT Trainer

Creative CommonsThis activity is licensed nether a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

More