Where to Host Your Open-Source Model (Under 10B Parameters)

Jun 04, 2026 09:45 PM - 2 days ago 2432

Intro - framing the existent question

For a sub-10B model, the difficult portion isn’t tin you big it, almost immoderate work tin supply a instrumentality that can; it’s matching the hosting exemplary to your postulation pattern, customization needs, and budget. Most users are focused connected the largest models, and for bully reason: they are the highest performers acknowledgment to their size. But a increasing number of users are uncovering that, pinch advances successful LLM technology, smaller models fresh their needs exceedingly well.

Small models alteration the math. A 7–9B exemplary fits connected a azygous mid-tier GPU, which unlocks options (serverless, per-token, single-GPU droplets) that aren’t viable for 70B+ models. Economically, this opens a batch of doors for caller users to get utilizing AI technologies that were antecedently unviable, crossed each sorts of caller scenarios.

So wherever should you really tally one? The short answer: for astir teams moving a exemplary nether 10B parameters, a managed conclusion level that supports bring-your-own-model is the champion starting point: you get production-grade serving without managing GPUs, and you tin big your ain fine-tuned weights, not conscionable an off-the-shelf catalog. Self-managed GPUs only commencement to make consciousness astatine sustained, predictable, precocious volume.

The remainder of this guideline gives you the reasoning down that answer. We’ll commencement pinch a speedy sizing model truthful you cognize precisely really overmuch GPU representation a sub-10B exemplary needs, locomotion done the 4 ways to big it and what each is bully for, and decorativeness pinch a actual proposal and a step-by-step quick-start. By the extremity you’ll beryllium capable to lucifer your exemplary to the correct hosting attack successful a fewer minutes, and cognize erstwhile it’s worthy switching.

Key Takeaways

  • Match the hosting exemplary to your traffic, not your exemplary size. For a sub-10B exemplary pinch bursty aliases unpredictable traffic, serverless per-token conclusion is cheapest because you salary thing while idle; a dedicated GPU only wins erstwhile your utilization is consistently high.
  • Most teams should commencement connected a managed platform. Use serverless inference if an off-the-shelf exemplary (Llama, Mistral, Qwen) fits, aliases Bring-Your-Own-Model (BYOM) if you’re serving your ain fine-tune — some springiness you production-grade serving without operating GPUs, drivers, aliases an conclusion server yourself.
  • Small models alteration the costs math. A quantized 7–9B exemplary fits connected a azygous 48 GB GPU (≈14 GB astatine FP16, ≈7 GB astatine 8-bit, ≈3.5 GB astatine 4-bit), truthful you take among inexpensive single-GPU options alternatively than wiring together multi-GPU clusters — and the per-token-vs-per-GPU-hour crossover arrives sooner than it does for 70B+ models.

First, size your model: really overmuch GPU do you really need?

Before you comparison providers, activity retired really overmuch GPU representation your exemplary really needs — it’s the azygous number that determines which hosting options are moreover connected the table. The norm of thumb is simple: VRAM scales pinch the bytes you walk per parameter. Full precision (FP32) costs 4 bytes per parameter, half precision (FP16/BF16) costs 2, 8-bit costs 1, and 4-bit costs astir 0.5. Multiply by your parameter count and you person the weights footprint.

VRAM requirements for a mini exemplary (~7–9B parameters)

Precision Bytes / param Weights only (7B) Weights only (9B) Realistic VRAM to budget Typical GPU that fits
FP32 (full) 4 ~28 GB ~36 GB ~34 / ~43 GB A6000 48 GB, L40 48 GB (tight astatine 9B); H100 / H200 pinch room to spare
FP16 / BF16 (half) 2 ~14 GB ~18 GB ~17 / ~22 GB A6000 48 GB, L40 48 GB comfortably; H100 / H200 for precocious concurrency
8-bit (INT8) 1 ~7 GB ~9 GB ~9 / ~11 GB A6000 / L40 pinch ample KV-cache headroom; H100 / H200 are overkill
4-bit (NF4 / GPTQ / AWQ) 0.5 ~3.5 GB ~4.5 GB ~5 / ~6 GB Any of the four; an H100 / H200 only earns its costs astatine precocious concurrency

Table 1 - VRAM requirements by precision: A 7–9B exemplary needs astir 14–18 GB of VRAM astatine FP16, astir 7–9 GB astatine 8-bit, and astir 3.5–4.5 GB astatine 4-bit, truthful a quantized sub-10B exemplary fits comfortably connected a azygous 48 GB GPU.

The takeaway: a quantized sub-10B exemplary runs comfortably connected a azygous 48 GB GPU, and a 4-bit type fits moreover much comfortably. That’s the full logic the costs calculus differs from 70B+ models: you’re choosing among single-GPU options (serverless, per-token, 1 GPU Droplet) alternatively than wiring together multi-GPU clusters.

Two caveats are worthy keeping successful mind. First, quantization is simply a quality-versus-footprint trade: dropping from FP16 to 4-bit astir quarters VRAM again pinch minimal value nonaccomplishment for astir usage cases, but it isn’t free, truthful verify connected your ain eval group earlier shipping (Lee et al. 2024, “A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs”). Each measurement down — FP16 to 8-bit, past 8-bit to 4-bit — astir halves the weights footprint again. Second, the weights are only the floor. Real-world VRAM is driven arsenic overmuch by discourse magnitude and concurrency arsenic by the exemplary itself, because the KV cache grows pinch both. At agelong contexts and galore simultaneous requests that cache tin adhd respective gigabytes connected apical of the weights, truthful size for your highest concurrent load, not conscionable the model. (These figures are deliberately rounded rules of thumb; modern architectures utilizing grouped-query attraction devour noticeably little KV cache than older designs.) (Ainslie et al. 2023, “GQA,” EMNLP)

What are the ways to big a mini open-source model?

There are really only 4 ways to tally inference, and each suits a different operation of traffic, customization, and operational appetite. Here’s what each is, who it’s for, and what it costs you — kept vendor-neutral; the proposal comes later.

Serverless / pay-per-token conclusion APIs

You nonstop requests to a hosted endpoint and salary only for the tokens you consume. The level handles everything underneath, scaling to zero erstwhile idle. This is the champion fresh for bursty aliases unpredictable traffic, prototypes, and thing pinch adaptable load, because you salary thing while you’re not serving. The trade-offs are little power complete the serving stack, occasional acold starts aft idle periods, and a per-token complaint that tin transcend a dedicated GPU erstwhile your postulation is precocious and steady.

Managed exemplary hosting pinch Bring-Your-Own-Model (BYOM)

You upload your ain weights — usually a fine-tune — and the level runs them connected its optimized serving stack. This is the saccharine spot for teams that person a civilization sub-10B exemplary and want production-grade serving without operating it themselves: nary managing vLLM, nary tuning batching, nary chasing kernel optimizations. The trade-off is that you’re bound to the platform’s supported exemplary formats and architectures, but successful speech you skip an tremendous magnitude of operational overhead. For astir teams shipping a fine-tuned mini model, this is the way of slightest resistance.

Self-managed GPU instances

You rent a GPU virtual instrumentality and tally your ain conclusion server — vLLM for high-throughput production, Text Generation Inference arsenic an alternative, aliases Ollama erstwhile you conscionable want thing moving successful minutes. This gives you maximum power and is the astir cost-effective action astatine sustained, predictable load aliases erstwhile you person typical requirements. The drawback is that you now ain scaling, batching, monitoring, and uptime; vLLM will get you fantabulous throughput per GPU, but keeping it patient is your job.

Local / separator / on-prem

You tally the exemplary connected your ain hardware — a workstation GPU, an on-prem server, aliases an separator device. This is the correct telephone for strict information residency, air-gapped environments, development, aliases hobby projects. The trade-offs are nary elastic standard and an upfront superior costs alternatively of a usage-based one.

Serverless vs. BYOM vs. self-managed GPU: really do they compare?

Approach Best for Control Ops burden Cost model Scaling
Serverless / per-token API Bursty, unpredictable, prototypes Low None Pay per token Automatic, standard to zero
Managed BYOM Custom/fine-tuned models, thin teams Medium Low Per token aliases per hour Managed by platform
Self-managed GPU Sustained precocious volume, afloat control High High Per GPU-hour You build it
Local / separator / on-prem Compliance, air-gapped, dev Full High (hardware) Capex None (fixed capacity)

Table 2 - Four-way hosting comparison: Serverless suits bursty postulation pinch zero ops burden, managed BYOM suits civilization fine-tunes for thin teams, self-managed GPUs suit sustained precocious measurement astatine the costs of moving everything yourself, and local/on-prem suits compliance aliases air-gapped needs.

How to really choose: a determination framework

You don’t request to measurement each 4 options arsenic — a fistful of questions usually points to one. Run done these successful order:

Is your postulation predictable? If it’s bursty, spiky, aliases you simply don’t cognize yet, serverless per-token billing protects you from paying for idle GPUs. If it’s dependable and high, a dedicated GPU starts to triumph connected cost.

Are you moving your ain model? If an off-the-shelf exemplary meets your needs, a hosted conclusion API is the fastest route. If you’ve fine-tuned your ain sub-10B model, you request either BYOM aliases a self-managed GPU to service those weights.

How overmuch infrastructure do you want to own? If you’d alternatively not negociate GPUs, drivers, and an conclusion server, a managed aliases serverless level is the answer. If you person a level squad and want to compression each dollar of throughput, self-managed gives you the levers.

What’s your costs crossover? Per-token pricing is cheapest astatine debased and adaptable volume; per-GPU-hour pricing is cheapest erstwhile utilization is consistently high. There’s a break-even point, and it moves successful favour of dedicated GPUs arsenic your sustained postulation grows.

Do you person compliance aliases data-residency constraints? Requirements astir wherever information lives, aliases air-gapped operation, tin override everything supra and push you toward a circumstantial region aliases afloat on-prem hosting.

For astir teams launching a mini model, the first 3 questions onshore successful the aforesaid place: unpredictable early traffic, a exemplary you want to control, and nary desire to babysit infrastructure — which is precisely what the adjacent conception is about.

Recommendation: wherever to commencement for a sub-10B model

For astir teams, commencement connected a managed conclusion level and only move to self-managed GPUs erstwhile sustained measurement justifies it. Our actual proposal is DigitalOcean’s Gradient AI Platform, because it covers the full sub-10B travel — a hosted exemplary catalog, your ain fine-tune, and a dedicated GPU — connected 1 relationship and 1 bill. Which introduction constituent you usage depends connected whether you’re moving a celebrated unfastened exemplary aliases your own:

Running a celebrated unfastened exemplary (Llama, Mistral, Qwen, Gemma, DeepSeek)? Use Serverless Inference. DigitalOcean’s Serverless Inference exposes 30+ instauration models done a azygous OpenAI-compatible endpoint (https://inference.do-ai.run/v1/) and 1 exemplary entree key, billed per input and output token pinch nary GPU to provision. It scales automatically and you salary only for tokens consumed — perfect for the bursty, hard-to-forecast postulation a mini exemplary sees successful its early life. New accounts get a usage allowance earlier billing starts (for example, $25 connected tier 1). If a catalog exemplary meets your needs, this is the fastest imaginable start.

Running your ain fine-tune? Use Bring-Your-Own-Model (BYOM). BYOM lets you import your ain weights from Hugging Face (gated repos included) aliases a DigitalOcean Spaces bucket and person DigitalOcean service them connected an optimized stack — nary vLLM to operate. Two specifics to scheme around: imports must beryllium Safetensors files, and supported architectures are presently the Qwen family (Qwen2ForCausalLM and Qwen3ForCausalLM) — which conveniently covers beardown sub-10B bases for illustration Qwen3-8B. BYOM models deploy done Dedicated Inference, billed per GPU-hour alternatively than per token, truthful this way suits steadier workloads. Import is done successful the Control Panel (it isn’t yet exposed via the API, CLI, aliases SDK), and imported weights unrecorded successful a managed Spaces location that incurs retention charges.

Outgrowing managed, aliases request a different architecture? Use GPU Droplets. When postulation is dependable and high, aliases your exemplary falls extracurricular the BYOM architecture list, self-managed GPU Droplets springiness you afloat power connected the aforesaid platform. On-demand pricing is transparent and single-GPU-friendly for sub-10B models:

GPU Droplet VRAM On-demand price Good for a sub-10B model
NVIDIA RTX 4000 Ada 20 GB $0.76 / hr A quantized (4-bit / 8-bit) 7–9B model
NVIDIA RTX 6000 Ada 48 GB $1.57 / hr FP16 sub-10B pinch headroom for context
NVIDIA L40S 48 GB $1.57 / hr FP16 sub-10B, higher throughput
NVIDIA H100 80 GB $3.39 / hr Overkill for 1 mini model; useful astatine precocious concurrency

Table 3 - GPU Droplet pricing: A quantized 7–9B exemplary runs connected a $0.76/hr RTX 4000 Ada, while an FP16 sub-10B exemplary wants a 48 GB paper (RTX 6000 Ada aliases L40S astatine $1.57/hr), making the H100 overkill isolated from astatine precocious concurrency.

Billing is per-second pinch a five-minute minimum, and reserved contracts little the hourly complaint for sustained use. DigitalOcean’s 1-Click Models tin besides deploy celebrated unfastened models (Llama 3, Mistral, Qwen, Gemma) onto a Droplet pinch an OpenAI-compatible endpoint successful a fewer clicks.

The honorable pitch: predictable pricing, a elemental UI, and OpenAI-compatible APIs make this a beardown fresh for solo developers done mid-size teams. It’s not the correct prime if your fine-tune uses an architecture extracurricular the existent BYOM list, aliases if you dangle connected hyperscaler-specific services elsewhere successful your stack. (Per-token serverless rates are group per model, truthful cheque the existent pricing page for the nonstop exemplary you’ll run.)

What are the alternatives to DigitalOcean for hosting a mini model?

No azygous level is correct for everyone, and a guideline that pretends different won’t gain anyone’s trust. Here’s wherever the alternatives genuinely shine:

Specialized GPU renters (RunPod, Lambda, Vast.ai). Choose these if your privilege is the cheapest imaginable earthy GPU-hours and you’re comfortable doing the setup yourself. You’ll get beardown per-hour rates, particularly connected user cards for illustration the RTX 4090 that grip a quantized 7B exemplary well, successful speech for a much DIY experience.

Hyperscalers (AWS, GCP, Azure). Choose these if you’re already heavy successful 1 of their ecosystems aliases you want spot-instance discounts and tight integration pinch the remainder of your infrastructure. The trade-off is much complexity and mostly higher costs than a focused provider.

Dedicated conclusion APIs / serverless exemplary providers. Choose these if an off-the-shelf unfastened exemplary meets your needs and you want to vessel today. They’re the fastest way to a moving endpoint — but they typically won’t big a civilization fine-tune the measurement BYOM does.

The constituent of laying these retired plainly is that the proposal supra holds up connected the merits: mini models genuinely suit managed BYOM and serverless, and you tin verify that by comparing against the honorable type of each alternative.

Quick-start: hosting your fine-tune connected DigitalOcean BYOM

Here’s the way from a group of weights to a live, OpenAI-compatible endpoint:

First, Prep your model. Confirm your weights are successful Safetensors format, usage a supported architecture (Qwen2 aliases Qwen3 family today), and unrecorded successful a Hugging Face repo (gated is fine) aliases a DigitalOcean Spaces bucket. You can’t upload files straight from your computer.

Second, Import it. In the DigitalOcean Control Panel, unfastened INFERENCE → Model Catalog → My Models and commencement an import pointing astatine your Hugging Face repo aliases Spaces location. You tin tally respective imports astatine erstwhile without waiting for each to finish.

Third, Wait for Ready. Track position connected the My Models tab. A grounded import usually intends a missing required record aliases an unsupported architecture.

Fourth, Deploy to Dedicated Inference. From the exemplary card, create a dedicated conclusion deployment and take your GPU. This gives you a dedicated, OpenAI-compatible endpoint billed per GPU-hour.

Fifth, Call it. Point immoderate OpenAI-compatible customer astatine your deployment’s guidelines URL pinch a exemplary entree key:

from openai import OpenAI client = OpenAI( base_url="https://<your-deployment-url>.do-ai.run/v1/", # shown successful the Control Panel api_key="<your-model-access-key>", ) resp = client.chat.completions.create( model="<your-imported-model>", messages=[{"role": "user", "content": "Hello!"}], ) print(resp.choices[0].message.content)

Calling a catalog exemplary done Serverless Inference is identical, isolated from the guidelines URL is the shared https://inference.do-ai.run/v1/ — truthful you tin prototype against a banal exemplary and switch successful your fine-tune later by changing 1 line.

Get started: BYOM import guide · create an account · GPU Droplet pricing.

How overmuch does it costs to big a mini open-source LLM?

The halfway costs determination is per-token versus per-GPU-hour, and mini models make the crossover hap sooner. The measurement to logic astir it is to person some to the aforesaid portion — costs per cardinal tokens — and compare.

A worked illustration (illustrative throughput; verify against your ain benchmarks): tally a 4-bit 7B exemplary connected a NVIDIA RTX 4000 Ada GPU Droplet astatine $0.76/hour. At a blimpish single-stream complaint of ~50 tokens/second, that’s astir 180,000 tokens/hour, aliases astir $4.20 per cardinal tokens — if the GPU stays busy. The drawback is that connection “if.” Continuous batching (the vLLM style) tin push aggregate throughput respective times higher, bringing the effective costs toward $1 per cardinal tokens aliases beneath astatine precocious concurrency; conversely, a GPU you’re paying for astatine 20% utilization quintuples your existent per-token cost. Serverless per-token pricing, by contrast, charges you only for tokens really generated, truthful astatine debased aliases spiky measurement it’s almost ever cheaper. The break-even arrives erstwhile your sustained utilization is precocious capable that the dedicated GPU’s hourly cost, dispersed crossed the tokens it really serves, drops beneath the per-token rate. (Anyscale, continuous batching throughput benchmark)

You tin move that break-even successful your favour pinch a fewer levers: quantization (a smaller footprint lets you usage a cheaper GPU — a 4-bit 7B fits the $0.76/hr RTX 4000 Ada), batching (the azygous biggest throughput multiplier per GPU), scale-to-zero (serverless charges thing while idle), and reserved capacity (committed-use GPU Droplet contracts trim the hourly complaint for dependable workloads). Serverless per-token rates are published per model, truthful cheque the pricing page for the circumstantial exemplary you scheme to run.

FAQ

What’s the cheapest measurement to big a mini open-source LLM? Serverless per-token conclusion is cheapest for debased aliases unpredictable traffic, because you salary thing while idle. For sustained precocious traffic, a azygous dedicated GPU moving a quantized exemplary costs little per token. The crossover comes earlier for mini models because they fresh connected 1 inexpensive GPU.

How overmuch VRAM does a 7B exemplary need? Roughly 14 GB astatine FP16, astir 7 GB astatine 8-bit, and astir 3.5 GB astatine 4-bit — earlier discourse and concurrency overhead. Real-world usage runs higher because the KV cache grows pinch discourse magnitude and the number of simultaneous requests, truthful size for your highest concurrent load, not conscionable the weights.

Do I request a GPU to tally a 7B model? In practice, yes, for usable speed. A quantized 7B runs comfortably connected a azygous 48 GB GPU for illustration an A6000 aliases L40, pinch plentifulness of headroom for context. CPU-only conclusion useful but is excessively slow for astir accumulation use.

What GPU do I request for a quantized 7B model? A azygous 48 GB paper (A6000 aliases L40) comfortably runs a quantized 7–9B exemplary pinch room for discourse and concurrency; connected DigitalOcean the RTX 6000 Ada aliases L40S astatine $1.57/hr is the earthy fit, aliases an H100 if you’re serving galore concurrent requests.

Serverless vs. GPU Droplet: which is cheaper? Serverless per-token pricing wins astatine debased aliases adaptable volume; a dedicated GPU Droplet wins erstwhile your utilization is consistently high. To find your break-even, estimate sustained tokens per time and comparison the per-token complaint against the per-GPU-hour costs dispersed complete the tokens you’d really serve. Batching and quantization push that break-even successful the dedicated GPU’s favor.

Can I big a fine-tuned model, not conscionable a catalog one? Yes — that’s what Bring-Your-Own-Model (BYOM) is for. You upload your ain fine-tuned weights and the level serves them connected its optimized stack, truthful you’re not constricted to a fixed catalog.

What exemplary formats are supported for BYOM? On DigitalOcean, BYOM imports judge Safetensors weights (plus modular companion files for illustration config and tokenizer) from Hugging Face aliases a Spaces bucket. Supported architectures are presently the Qwen family — Qwen2ForCausalLM and Qwen3ForCausalLM — truthful cheque the import requirements earlier you commencement if you’re bringing a different base.

Can DigitalOcean big a fine-tuned Qwen model? Yes. Qwen is the supported BYOM architecture coming (Qwen2 and Qwen3), which covers beardown sub-10B bases for illustration Qwen3-8B. Import your Safetensors weights, hold for the exemplary to scope Ready, past deploy it to Dedicated Inference for an OpenAI-compatible endpoint. If your fine-tune uses a non-Qwen guidelines for illustration Llama aliases Mistral, tally it yourself connected a GPU Droplet instead.

How do I import a fine-tuned exemplary into DigitalOcean BYOM? In the Control Panel, unfastened INFERENCE → Model Catalog → My Models and commencement an import pointing astatine your Hugging Face repo (gated repos are fine) aliases a Spaces bucket. Wait for the position to scope Ready, past create a Dedicated Inference deployment. Imports must beryllium Safetensors files, and you can’t upload straight from your machine — the weights person to unrecorded successful Hugging Face aliases Spaces first.

Does serverless conclusion person acold starts? It can, aft an idle play — the first petition pursuing idle clip whitethorn beryllium slower while capacity spins up. That’s the trade-off for scaling to zero and paying thing while idle, which is usually worthy it for the bursty postulation a mini exemplary sees early on. Steady, latency-sensitive workloads are a amended fresh for a dedicated deployment.

Can I move from a catalog exemplary to my ain fine-tune later? Yes, pinch almost nary codification change. Both Serverless Inference and a BYOM Dedicated Inference deployment expose the aforesaid OpenAI-compatible API, truthful you prototype against a banal catalog exemplary and later switch successful your fine-tune by changing the guidelines URL and exemplary sanction — 1 statement each.

How do I big a exemplary connected DigitalOcean? For a celebrated unfastened model, nonstop requests to the Serverless Inference API astatine https://inference.do-ai.run/v1/ pinch a exemplary entree cardinal — nary setup. For your ain fine-tune, import the Safetensors weights via INFERENCE → Model Catalog → My Models, hold for Ready, past create a Dedicated Inference deployment to get an endpoint. For afloat control, tally it yourself connected a GPU Droplet.

Conclusion

For a exemplary nether 10B parameters, the mobility isn’t whether you tin big it — it’s matching the attack to your traffic, your customization needs, and your budget. For astir teams that intends starting connected a managed platform: Serverless Inference if a catalog exemplary fits, BYOM connected Dedicated Inference if you’re serving your ain fine-tune, and graduating to a self-managed GPU Droplet only erstwhile dependable measurement makes it cheaper. DigitalOcean’s Gradient AI Platform covers that full arc connected 1 bill. Start pinch the BYOM quick-start aliases see existent pricing.

Creative CommonsThis activity is licensed nether a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

More